globalToShmemAsyncCopy --std=c++11 cudaStreamCreateWithFlags cudaMalloc cudaDeviceGetAttribute cudaFree cudaMallocHost cudaEventSynchronize cudaEventRecord cudaFreeHost cudaStreamSynchronize cudaEventDestroy cudaEventElapsedTime cudaMemsetAsync cudaMemcpyAsync cudaEventCreate whole ./ ../ ../../../Common CUDA Runtime API Linear Algebra CPP11 CUDA CUDA matrix multiply Async copy CPP11 GCC 5.1.0 true globalToShmemAsyncCopy.cu CPP11 1:CUDA Basic Topics 3:Linear Algebra sm70 sm72 sm75 sm80 sm86 sm87 sm89 sm90 x86_64 linux x86_64 macosx arm sbsa ppc64le linux aarch64 linux aarch64 qnx windows7 7.0 Global Memory to Shared Memory Async Copy