mirror of
https://github.com/NVIDIA/cuda-samples.git
synced 2026-06-04 08:16:52 +08:00
This is the release of the CUDA 13.3 samples, which include additions for CUDA Tile C++, and updated CCCL and Python samples.
4.5 KiB
4.5 KiB
tmaTensorMap (Python)
Description
This sample demonstrates how to use Tensor Memory Accelerator (TMA)
descriptors with cuda.core on Hopper and later GPUs (compute
capability >= 9.0). TMA enables efficient bulk data movement between
global and shared memory using hardware-managed tensor map
descriptors, which are a key building block for modern GEMM kernels
and large shared-memory tile loads.
The sample:
- Creates a TMA tiled descriptor from a CuPy device array via
StridedMemoryView.from_any_interface(...).as_tensor_map(...). - Passes the descriptor by value (as
__grid_constant__) to a kernel that uses libcudacxx TMA/barrier wrappers to bulk-load a tile into shared memory, then copies it out to verify correctness. - Reuses the same descriptor against a new source tensor with
replace_address()to avoid rebuilding it.
What You'll Learn
- Creating a TMA descriptor from a strided device tensor via
StridedMemoryView.as_tensor_map(box_dim=...) - Passing a tensor map to a kernel by value using
__grid_constant__ - Using libcudacxx (
cuda/barrier) to coordinate TMA loads with a block-scoped barrier - Reusing a descriptor against a new source buffer via
tensor_map.replace_address(new_tensor) - Compiling a kernel to CUBIN for a specific target arch so Hopper features are available
- Using
cuda.pathfinderto locate the CUDA toolkit include directory CCCL headers and libcudacxx
Key Libraries
cuda.core- compilation, launching, and tensor-map helperscuda.pathfinder- locate the CUDA toolkit include directorycupy- allocate and fill device tensorsnumpy- scalar kernel arguments
Key APIs
From cuda.core
StridedMemoryView.from_any_interface(tensor, stream_ptr=-1)- build a typed view from any DLPack/CUDA-array-interface tensorStridedMemoryView.as_tensor_map(box_dim=(...))- produce a TMA descriptor for the given tile shapetensor_map.replace_address(new_tensor)- retarget an existing descriptor at a new tensorProgram(code, code_type="c++", options=ProgramOptions(std="c++17", arch="sm_90", include_path=[...]))- compile a C++ kernel against libcudacxxprogram.compile("cubin")- produce a CUBIN so__grid_constant__and TMA intrinsics are fully supportedlaunch(stream, config, kernel, tensor_map, ...)- pass the TMA descriptor as a kernel argument
From cuda.pathfinder
get_cuda_path_or_home()- return the detected CUDA toolkit root for locatinginclude/cccl
From cuda_samples_utils
print_gpu_info()- print device name and compute capability
Requirements
Hardware
- NVIDIA Hopper or newer GPU with Compute Capability 9.0 or higher (H100, H200, B200, ...)
- On GPUs older than Hopper the sample exits cleanly without running the kernel
- Minimum GPU memory: 512 MB
Software
- CUDA Toolkit 13.0 or newer with libcudacxx (cccl) headers
- Python 3.10 or newer
cuda-python(>=13.0.0)cuda-core(>=1.0.0)cupy-cuda13x(>=14.0.0)
Installation
Install the required packages from requirements.txt:
cd /path/to/cuda-samples/python/2_CoreConcepts/tmaTensorMap
pip install -r requirements.txt
The requirements.txt installs:
cuda-python(>=13.0.0)cuda-core(>=1.0.0)cupy-cuda13x(>=14.0.0)
How to Run
Basic usage
cd cuda-samples/python/2_CoreConcepts/tmaTensorMap
python tmaTensorMap.py
With custom parameters
# Larger tensor (must be a multiple of the 128-element tile)
python tmaTensorMap.py --elements 8192
# Use a specific GPU
python tmaTensorMap.py --device 1
Expected Output
On a Hopper (sm_90) GPU:
Device: NVIDIA H100 PCIe
Compute Capability: 9.0
TMA copy verified: 1024 elements across 8 tiles
replace_address verified: descriptor reused with new source tensor
Note: Device name and compute capability will vary based on your GPU.
Files
tmaTensorMap.py- Python implementation usingcuda.coreTMA APIsREADME.md- This filerequirements.txt- Sample dependencies../../Utilities/cuda_samples_utils.py- Common utilities (imported by this sample)