# tmaTensorMap (Python) ## Description This sample demonstrates how to use Tensor Memory Accelerator (TMA) descriptors with `cuda.core` on Hopper and later GPUs (compute capability >= 9.0). TMA enables efficient bulk data movement between global and shared memory using hardware-managed tensor map descriptors, which are a key building block for modern GEMM kernels and large shared-memory tile loads. The sample: 1. Creates a TMA tiled descriptor from a CuPy device array via `StridedMemoryView.from_any_interface(...).as_tensor_map(...)`. 2. Passes the descriptor by value (as `__grid_constant__`) to a kernel that uses libcudacxx TMA/barrier wrappers to bulk-load a tile into shared memory, then copies it out to verify correctness. 3. Reuses the same descriptor against a new source tensor with `replace_address()` to avoid rebuilding it. ## What You'll Learn - Creating a TMA descriptor from a strided device tensor via `StridedMemoryView.as_tensor_map(box_dim=...)` - Passing a tensor map to a kernel by value using `__grid_constant__` - Using libcudacxx (`cuda/barrier`) to coordinate TMA loads with a block-scoped barrier - Reusing a descriptor against a new source buffer via `tensor_map.replace_address(new_tensor)` - Compiling a kernel to CUBIN for a specific target arch so Hopper features are available - Using `cuda.pathfinder` to locate the CUDA toolkit include directory CCCL headers and libcudacxx ## Key Libraries - [`cuda.core`](https://nvidia.github.io/cuda-python/cuda-core/latest/) - compilation, launching, and tensor-map helpers - `cuda.pathfinder` - locate the CUDA toolkit include directory - `cupy` - allocate and fill device tensors - `numpy` - scalar kernel arguments ## Key APIs ### From `cuda.core` - `StridedMemoryView.from_any_interface(tensor, stream_ptr=-1)` - build a typed view from any DLPack/CUDA-array-interface tensor - `StridedMemoryView.as_tensor_map(box_dim=(...))` - produce a TMA descriptor for the given tile shape - `tensor_map.replace_address(new_tensor)` - retarget an existing descriptor at a new tensor - `Program(code, code_type="c++", options=ProgramOptions(std="c++17", arch="sm_90", include_path=[...]))` - compile a C++ kernel against libcudacxx - `program.compile("cubin")` - produce a CUBIN so `__grid_constant__` and TMA intrinsics are fully supported - `launch(stream, config, kernel, tensor_map, ...)` - pass the TMA descriptor as a kernel argument ### From `cuda.pathfinder` - `get_cuda_path_or_home()` - return the detected CUDA toolkit root for locating `include/cccl` ### From `cuda_samples_utils` - `print_gpu_info()` - print device name and compute capability ## Requirements ### Hardware - NVIDIA Hopper or newer GPU with Compute Capability 9.0 or higher (H100, H200, B200, ...) - On GPUs older than Hopper the sample exits cleanly without running the kernel - Minimum GPU memory: 512 MB ### Software - CUDA Toolkit 13.0 or newer with libcudacxx (cccl) headers - Python 3.10 or newer - `cuda-python` (>=13.0.0) - `cuda-core` (>=0.6.0) - `cupy-cuda13x` (>=13.0.0) ## Installation Install the required packages from `requirements.txt`: ```bash cd /path/to/cuda-samples/python/2_CoreConcepts/tmaTensorMap pip install -r requirements.txt ``` The `requirements.txt` installs: - `cuda-python` (>=13.0.0) - `cuda-core` (>=0.6.0) - `cupy-cuda13x` (>=13.0.0) ## How to Run ### Basic usage ```bash cd cuda-samples/python/2_CoreConcepts/tmaTensorMap python tmaTensorMap.py ``` ### With custom parameters ```bash # Larger tensor (must be a multiple of the 128-element tile) python tmaTensorMap.py --elements 8192 # Use a specific GPU python tmaTensorMap.py --device 1 ``` ## Expected Output On a Hopper (sm_90) GPU: ``` Device: NVIDIA H100 PCIe Compute Capability: 9.0 TMA copy verified: 1024 elements across 8 tiles replace_address verified: descriptor reused with new source tensor ``` **Note:** Device name and compute capability will vary based on your GPU. ## Files - `tmaTensorMap.py` - Python implementation using `cuda.core` TMA APIs - `README.md` - This file - `requirements.txt` - Sample dependencies - `../../Utilities/cuda_samples_utils.py` - Common utilities (imported by this sample) ## See Also - [CUDA Python Documentation](https://nvidia.github.io/cuda-python/) - [TMA in the CUDA C++ Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#tensor-memory-accelerator) - [`cuda::barrier` reference](https://nvidia.github.io/cccl/libcudacxx/extended_api/synchronization_primitives/barrier.html)