cuda-samples/Samples/3_CUDA_Features/globalToShmemTMACopy/README.md

# globalToShmemTMACopy - Global Memory to Shared Memory TMA Copy

## Description

This sample shows how to use the CUDA driver API and inline PTX assembly to copy
a 2D tile of a tensor into shared memory. It also demonstrates arrive-wait
barrier for synchronization. 

## Key Concepts

CUDA Runtime API, CUDA Driver API, PTX ISA, CPP11 CUDA

## Supported SM Architectures

This sample requires compute capability 9.0 or higher.

[SM 9.0 ](https://developer.nvidia.com/cuda-gpus)

## Supported OSes

Linux, Windows, QNX

## Supported CPU Architecture

x86_64, ppc64le, armv7l, aarch64

## CUDA APIs involved

### [CUDA Runtime API](http://docs.nvidia.com/cuda/cuda-runtime-api/index.html)
cudaMalloc, cudaMemcpy, cudaFree, cudaDeviceSynchronize

### [CUDA Driver API](http://docs.nvidia.com/cuda/cuda-driver-api/index.html)
cudaMalloc, cudaMemcpy, cudaFree, cudaDeviceSynchronize

### [CUDA PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html)

## Dependencies needed to build/run
[CPP11](../../../README.md#cpp11)

## Prerequisites

Download and install the [CUDA Toolkit 12.2](https://developer.nvidia.com/cuda-downloads) for your corresponding platform.
Make sure the dependencies mentioned in [Dependencies]() section above are installed.

## Build and Run


## References (for more details)
Add TMA example 2023-06-30 23:39:53 +08:00			`# globalToShmemTMACopy - Global Memory to Shared Memory TMA Copy`

			`## Description`

			`This sample shows how to use the CUDA driver API and inline PTX assembly to copy`
			`a 2D tile of a tensor into shared memory. It also demonstrates arrive-wait`
			`barrier for synchronization.`

			`## Key Concepts`

			`CUDA Runtime API, CUDA Driver API, PTX ISA, CPP11 CUDA`

			`## Supported SM Architectures`

			`This sample requires compute capability 9.0 or higher.`

			`[SM 9.0 ](https://developer.nvidia.com/cuda-gpus)`

			`## Supported OSes`

			`Linux, Windows, QNX`

			`## Supported CPU Architecture`

			`x86_64, ppc64le, armv7l, aarch64`

			`## CUDA APIs involved`

			`### [CUDA Runtime API](http://docs.nvidia.com/cuda/cuda-runtime-api/index.html)`
			`cudaMalloc, cudaMemcpy, cudaFree, cudaDeviceSynchronize`

			`### [CUDA Driver API](http://docs.nvidia.com/cuda/cuda-driver-api/index.html)`
			`cudaMalloc, cudaMemcpy, cudaFree, cudaDeviceSynchronize`

			`### [CUDA PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html)`

			`## Dependencies needed to build/run`
			`[CPP11](../../../README.md#cpp11)`

			`## Prerequisites`

			`Download and install the [CUDA Toolkit 12.2](https://developer.nvidia.com/cuda-downloads) for your corresponding platform.`
			`Make sure the dependencies mentioned in [Dependencies]() section above are installed.`

			`## Build and Run`


			`## References (for more details)`