shfl_scan --std=c++11 -O3 cudaMemcpy cudaFree cudaMallocHost cudaEventSynchronize cudaEventRecord cudaFreeHost cudaGetDevice cudaMemset cudaMalloc cudaEventElapsedTime cudaGetDeviceProperties cudaEventCreate whole ./ ../ ../../../Common Data-Parallel Algorithms Performance Strategies GPGPU CPP11 CUDA scan parallel prefix sum Data-Parallel Algorithms true shfl_scan.cu CPP11 1:CUDA Advanced Topics 1:Data-Parallel Algorithms 1:Performance Strategies x86_64 linux windows7 x86_64 macosx arm aarch64 sbsa ppc64le linux 3.5 CUDA Parallel Prefix Sum with Shuffle Intrinsics (SHFL_Scan) exe