globalToShmemAsyncCopy --std=c++11 cudaEventCreate cudaEventRecord cudaEventQuery cudaEventDestroy cudaEventElapsedTime cudaEventSynchronize cudaMalloc cudaFree cudaMemcpy whole ./ ../ ../../common/inc CUDA Runtime API Linear Algebra CPP11 CUDA CUDA matrix multiply Async copy CPP11 GCC 5.0.0 true globalToShmemAsyncCopy.cu CPP11 1:CUDA Basic Topics 3:Linear Algebra sm35 sm37 sm50 sm52 sm60 sm61 sm70 sm72 sm75 sm80 x86_64 linux x86_64 macosx arm ppc64le linux aarch64 linux aarch64 qnx windows7 all Global Memory to Shared Memory Async Copy