Dheemanth aeab82ff30
CUDA 13.2 samples update (#432)
- Added Python samples for CUDA Python 1.0 release
- Renamed top-level `Samples` directory to `cpp` to accommodate Python samples.
2026-05-13 17:13:18 -05:00

4.5 KiB

tmaTensorMap (Python)

Description

This sample demonstrates how to use Tensor Memory Accelerator (TMA) descriptors with cuda.core on Hopper and later GPUs (compute capability >= 9.0). TMA enables efficient bulk data movement between global and shared memory using hardware-managed tensor map descriptors, which are a key building block for modern GEMM kernels and large shared-memory tile loads.

The sample:

  1. Creates a TMA tiled descriptor from a CuPy device array via StridedMemoryView.from_any_interface(...).as_tensor_map(...).
  2. Passes the descriptor by value (as __grid_constant__) to a kernel that uses libcudacxx TMA/barrier wrappers to bulk-load a tile into shared memory, then copies it out to verify correctness.
  3. Reuses the same descriptor against a new source tensor with replace_address() to avoid rebuilding it.

What You'll Learn

  • Creating a TMA descriptor from a strided device tensor via StridedMemoryView.as_tensor_map(box_dim=...)
  • Passing a tensor map to a kernel by value using __grid_constant__
  • Using libcudacxx (cuda/barrier) to coordinate TMA loads with a block-scoped barrier
  • Reusing a descriptor against a new source buffer via tensor_map.replace_address(new_tensor)
  • Compiling a kernel to CUBIN for a specific target arch so Hopper features are available
  • Using cuda.pathfinder to locate the CUDA toolkit include directory CCCL headers and libcudacxx

Key Libraries

  • cuda.core - compilation, launching, and tensor-map helpers
  • cuda.pathfinder - locate the CUDA toolkit include directory
  • cupy - allocate and fill device tensors
  • numpy - scalar kernel arguments

Key APIs

From cuda.core

  • StridedMemoryView.from_any_interface(tensor, stream_ptr=-1) - build a typed view from any DLPack/CUDA-array-interface tensor
  • StridedMemoryView.as_tensor_map(box_dim=(...)) - produce a TMA descriptor for the given tile shape
  • tensor_map.replace_address(new_tensor) - retarget an existing descriptor at a new tensor
  • Program(code, code_type="c++", options=ProgramOptions(std="c++17", arch="sm_90", include_path=[...])) - compile a C++ kernel against libcudacxx
  • program.compile("cubin") - produce a CUBIN so __grid_constant__ and TMA intrinsics are fully supported
  • launch(stream, config, kernel, tensor_map, ...) - pass the TMA descriptor as a kernel argument

From cuda.pathfinder

  • get_cuda_path_or_home() - return the detected CUDA toolkit root for locating include/cccl

From cuda_samples_utils

  • print_gpu_info() - print device name and compute capability

Requirements

Hardware

  • NVIDIA Hopper or newer GPU with Compute Capability 9.0 or higher (H100, H200, B200, ...)
  • On GPUs older than Hopper the sample exits cleanly without running the kernel
  • Minimum GPU memory: 512 MB

Software

  • CUDA Toolkit 13.0 or newer with libcudacxx (cccl) headers
  • Python 3.10 or newer
  • cuda-python (>=13.0.0)
  • cuda-core (>=0.6.0)
  • cupy-cuda13x (>=13.0.0)

Installation

Install the required packages from requirements.txt:

cd /path/to/cuda-samples/python/2_CoreConcepts/tmaTensorMap
pip install -r requirements.txt

The requirements.txt installs:

  • cuda-python (>=13.0.0)
  • cuda-core (>=0.6.0)
  • cupy-cuda13x (>=13.0.0)

How to Run

Basic usage

cd cuda-samples/python/2_CoreConcepts/tmaTensorMap
python tmaTensorMap.py

With custom parameters

# Larger tensor (must be a multiple of the 128-element tile)
python tmaTensorMap.py --elements 8192

# Use a specific GPU
python tmaTensorMap.py --device 1

Expected Output

On a Hopper (sm_90) GPU:

Device: NVIDIA H100 PCIe
Compute Capability: 9.0

TMA copy verified: 1024 elements across 8 tiles
replace_address verified: descriptor reused with new source tensor

Note: Device name and compute capability will vary based on your GPU.

Files

  • tmaTensorMap.py - Python implementation using cuda.core TMA APIs
  • README.md - This file
  • requirements.txt - Sample dependencies
  • ../../Utilities/cuda_samples_utils.py - Common utilities (imported by this sample)

See Also