mirror of https://github.com/NVIDIA/cuda-samples.git synced 2026-07-16 21:06:52 +08:00

History

Release v13.3 of the CUDA samples with CUDA 13.3 Toolkit (#435 )

This is the release of the CUDA 13.3 samples, which include additions for CUDA Tile C++, and updated CCCL and Python samples.

2026-05-27 16:50:59 -05:00

CMakeLists.txt

Release v13.3 of the CUDA samples with CUDA 13.3 Toolkit (#435 )

2026-05-27 16:50:59 -05:00

README.md

Release v13.3 of the CUDA samples with CUDA 13.3 Toolkit (#435 )

2026-05-27 16:50:59 -05:00

tileBmm.cu

Release v13.3 of the CUDA samples with CUDA 13.3 Toolkit (#435 )

2026-05-27 16:50:59 -05:00

README.md

tileBmm

Description

This sample demonstrates a static-persistent batched matrix multiplication (BMM) using CUDA Tile C++. Given inputs A of shape (Q, M, K) and B of shape (Q, K, N), the kernel computes C = A x B of shape (Q, M, N). The grid launches a fixed number of persistent blocks sized from the device's SM count, and each block walks the (M, N, Q-chunk) tile space via a grid-stride loop. The batch dimension is tiled by BLOCK_SIZE_Q so every iteration issues a single rank-3 batched cuda::tiles::mma per K-step over tiles of shape (BLOCK_SIZE_Q, BLOCK_SIZE_M, BLOCK_SIZE_K) and (BLOCK_SIZE_Q, BLOCK_SIZE_K, BLOCK_SIZE_N). Grouped ordering on (pid_m, pid_n) gives L2 reuse. The accumulator is kept in float32 for precision, and masked loads/stores handle tiles that overhang the matrix or batch boundaries. Inputs and outputs use __half precision.

Expected Output

Success! BMM matches expected results.

Prerequisites

CUDA Toolkit version 13.3 or later.
CUDA Driver version 580 or later.
Host compiler with C++20 support.