globalToShmemAsyncCopy --std=c++11 cudaFree cudaEventRecord cudaMallocHost cudaEventCreate cudaMemsetAsync cudaEventElapsedTime cudaEventSynchronize cudaDeviceGetAttribute cudaFreeHost cudaMalloc cudaStreamCreateWithFlags cudaEventDestroy cudaStreamSynchronize cudaMemcpyAsync whole ./ ../ ../../../Common CUDA Runtime API Linear Algebra CPP11 CUDA CUDA matrix multiply Async copy CPP11 GCC 5.1.0 true globalToShmemAsyncCopy.cu CPP11 1:CUDA Basic Topics 3:Linear Algebra sm70 sm72 sm75 sm80 sm86 sm87 x86_64 linux x86_64 macosx arm sbsa ppc64le linux aarch64 linux aarch64 qnx windows7 7.0 Global Memory to Shared Memory Async Copy