mirror of
https://github.com/NVIDIA/cuda-samples.git
synced 2026-06-04 00:06:52 +08:00
tileMatmul
Description
This sample demonstrates how to write a matrix multiplication kernel with good
performance in CUDA Tile C++. The kernel multiplies FP16 input tiles with FP32
accumulation using cuda::tiles::mma.
The sample compares a naive implementation with an optimized implementation that applies good practices and provides the compiler with additional guidance for better code generation. The host code uses CUDA events to capture execution time.
The optimized kernel uses neutral placeholder values for LOAD_LATENCY and
STORE_LATENCY. See tileMatmulAutotuner for an
example of autotuning these values to find a suitable
configuration for your hardware.
Running
# Run with default warmup and benchmark iterations. Validation is disabled by default.
./tileMatmul
# Enable CPU validation
./tileMatmul --validate
# To run faster, skip warmups (not recommended) and set iteration to 1
./tileMatmul --skip-warmup -i 1
# Show all options
./tileMatmul --help
Command-Line Options
| Option | Description |
|---|---|
--validate |
Enable CPU cross-validation. Validation is disabled by default. |
--skip-warmup |
Disable warmup iterations. |
--warmup=N |
Set warmup iterations. The default is 5. |
-i N, --iters=N |
Set benchmark iterations. The default is 20. |
Example Output
Note: CPU cross-validation disabled
matmul : 0.073 ms, 115.2 GB/s, 29495.8 GFLOPS
matmul_naive : 0.497 ms, 16.9 GB/s, 4322.8 GFLOPS
Prerequisites
- CUDA Toolkit version 13.3 or later.
- CUDA Driver version 580 or later.
- Host compiler with C++20 support.