mirror of
https://github.com/NVIDIA/cuda-samples.git
synced 2024-11-29 00:39:19 +08:00
50 lines
1.3 KiB
Markdown
50 lines
1.3 KiB
Markdown
|
# globalToShmemTMACopy - Global Memory to Shared Memory TMA Copy
|
||
|
|
||
|
## Description
|
||
|
|
||
|
This sample shows how to use the CUDA driver API and inline PTX assembly to copy
|
||
|
a 2D tile of a tensor into shared memory. It also demonstrates arrive-wait
|
||
|
barrier for synchronization.
|
||
|
|
||
|
## Key Concepts
|
||
|
|
||
|
CUDA Runtime API, CUDA Driver API, PTX ISA, CPP11 CUDA
|
||
|
|
||
|
## Supported SM Architectures
|
||
|
|
||
|
This sample requires compute capability 9.0 or higher.
|
||
|
|
||
|
[SM 9.0 ](https://developer.nvidia.com/cuda-gpus)
|
||
|
|
||
|
## Supported OSes
|
||
|
|
||
|
Linux, Windows, QNX
|
||
|
|
||
|
## Supported CPU Architecture
|
||
|
|
||
|
x86_64, ppc64le, armv7l, aarch64
|
||
|
|
||
|
## CUDA APIs involved
|
||
|
|
||
|
### [CUDA Runtime API](http://docs.nvidia.com/cuda/cuda-runtime-api/index.html)
|
||
|
cudaMalloc, cudaMemcpy, cudaFree, cudaDeviceSynchronize
|
||
|
|
||
|
### [CUDA Driver API](http://docs.nvidia.com/cuda/cuda-driver-api/index.html)
|
||
|
cudaMalloc, cudaMemcpy, cudaFree, cudaDeviceSynchronize
|
||
|
|
||
|
### [CUDA PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html)
|
||
|
|
||
|
## Dependencies needed to build/run
|
||
|
[CPP11](../../../README.md#cpp11)
|
||
|
|
||
|
## Prerequisites
|
||
|
|
||
|
Download and install the [CUDA Toolkit 12.2](https://developer.nvidia.com/cuda-downloads) for your corresponding platform.
|
||
|
Make sure the dependencies mentioned in [Dependencies]() section above are installed.
|
||
|
|
||
|
## Build and Run
|
||
|
|
||
|
|
||
|
## References (for more details)
|
||
|
|