mirror of
https://github.com/NVIDIA/cuda-samples.git
synced 2025-01-20 00:25:50 +08:00
.. | ||
globalToShmemTMACopy.cu | ||
README.md | ||
util.h |
globalToShmemTMACopy - Global Memory to Shared Memory TMA Copy
Description
This sample shows how to use the CUDA driver API and inline PTX assembly to copy a 2D tile of a tensor into shared memory. It also demonstrates arrive-wait barrier for synchronization.
Key Concepts
CUDA Runtime API, CUDA Driver API, PTX ISA, CPP11 CUDA
Supported SM Architectures
This sample requires compute capability 9.0 or higher.
Supported OSes
Linux, Windows, QNX
Supported CPU Architecture
x86_64, ppc64le, armv7l, aarch64
CUDA APIs involved
CUDA Runtime API
cudaMalloc, cudaMemcpy, cudaFree, cudaDeviceSynchronize
CUDA Driver API
cudaMalloc, cudaMemcpy, cudaFree, cudaDeviceSynchronize
CUDA PTX ISA
Dependencies needed to build/run
Prerequisites
Download and install the CUDA Toolkit 12.2 for your corresponding platform. Make sure the dependencies mentioned in Dependencies section above are installed.