shfl_scan -O3 whole ./ ../ ../../common/inc Data-Parallel Algorithms Performance Strategies GPGPU CUDA scan parallel prefix sum Data-Parallel Algorithms true shfl_scan.cu 1:CUDA Advanced Topics 1:Data-Parallel Algorithms 1:Performance Strategies sm30 sm35 sm37 sm50 sm52 sm60 sm61 sm70 sm72 sm75 x86_64 linux windows7 x86_64 macosx arm aarch64 ppc64le linux 3.0 CUDA Parallel Prefix Sum with Shuffle Intrinsics (SHFL_Scan) exe