Dheemanth aeab82ff30
CUDA 13.2 samples update (#432)
- Added Python samples for CUDA Python 1.0 release
- Renamed top-level `Samples` directory to `cpp` to accommodate Python samples.
2026-05-13 17:13:18 -05:00
..
2026-05-13 17:13:18 -05:00

Matrix Multiplication with Shared Memory (GEMM)

Demonstrates efficient matrix multiplication using nvmath-python APIs and custom CUDA kernels with tiling, shared memory, and loop unrolling.

Overview

  • Uses nvmath.linalg.advanced.Matmul for high-performance GEMM via cuBLASLt
  • Compares with custom CUDA kernel using tiling and shared memory
  • Shows how tiling reduces global memory bandwidth requirements
  • Demonstrates shared memory for data reuse within thread blocks
  • Uses loop unrolling to improve instruction-level parallelism

What You'll Learn

  • How to use nvmath stateful API for optimized matrix multiplication
  • How to tile matrix operations for better cache locality
  • Using shared memory to reduce redundant global memory accesses
  • Loop unrolling techniques for GPU kernels
  • Benchmarking and comparing kernel performance

Key Libraries

  • nvmath-python - NVIDIA math library with cuBLASLt access
  • cuda.core - Modern CUDA Python API for custom kernel compilation
  • cupy - GPU array library for Python

Key APIs

From nvmath.linalg.advanced:

  • Matmul() - Stateful matrix multiplication with planning and execution phases
  • MatmulComputeType - Compute type options for mixed-precision

From cuda.core:

  • Device() - CUDA device management and properties
  • Program() - Runtime kernel compilation (NVRTC)
  • LaunchConfig() - Kernel launch configuration (grid/block dimensions)
  • launch() - Kernel execution on a stream
  • Stream.record_event() / Event.elapsed_time() - GPU timing

Requirements

Hardware:

  • NVIDIA GPU with Compute Capability 7.0 or higher
  • Minimum GPU memory: 256 MB (for 1024×1024 matrices)

Software:

  • CUDA Toolkit 13.0 or newer
  • Python 3.10 or newer
  • See requirements.txt for package dependencies

Installation

cd cuda-samples/python/2_CoreConcepts/matrixMulSharedMem
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

How to Run

python matrixMulSharedMem.py

Expected Output

======================================================================
Matrix Multiplication with Shared Memory (GEMM)
Using nvmath and cuda.core APIs
======================================================================

Device: NVIDIA GeForce RTX 4090
Compute Capability: sm_89

Custom kernel compiled ✓

Matrix dimensions: A(1024x1024) × B(1024x1024) = C(1024x1024)
Custom kernel tile size: 16x16

----------------------------------------------------------------------
NVMATH MATMUL (cuBLASLt)
----------------------------------------------------------------------
Using nvmath.linalg.advanced.Matmul stateful API
Average time: X.XXX ms
Performance: XXXX.XX GFLOPS

----------------------------------------------------------------------
CUSTOM KERNEL (Tiled + Shared Memory + Loop Unrolling)
----------------------------------------------------------------------
Grid: (64, 64), Block: (16, 16)
Average time: X.XXX ms
Performance: XXX.XX GFLOPS

----------------------------------------------------------------------
VERIFICATION
----------------------------------------------------------------------
nvmath         : PASSED (max error: X.XXe-XX)
Custom kernel  : PASSED (max error: X.XXe-XX)

======================================================================
PERFORMANCE SUMMARY
======================================================================
Implementation                 Time (ms)    GFLOPS
----------------------------------------------------------------------
nvmath (cuBLASLt)              X.XXX        XXXX.XX
Custom (shared mem + unroll)   X.XXX        XXX.XX

Tiling Concept

     Matrix A (M×K)          Matrix B (K×N)          Matrix C (M×N)
    ┌───────────────┐       ┌───────────────┐       ┌───────────────┐
    │ T00 │ T01 │...│       │ T00 │ T01 │...│       │     │     │   │
    ├─────┼─────┼───┤       ├─────┼─────┼───┤       ├─────┼─────┼───┤
    │ T10 │ T11 │...│   ×   │ T10 │ T11 │...│   =   │     │ Cij │   │
    ├─────┼─────┼───┤       ├─────┼─────┼───┤       ├─────┼─────┼───┤
    │ ... │ ... │...│       │ ... │ ... │...│       │     │     │   │
    └───────────────┘       └───────────────┘       └───────────────┘

    Cij = Σ (A_tile_row × B_tile_col) for all tiles along K

nvmath Stateful API

import nvmath.linalg.advanced as nvmath_advanced

# Create matrices (CuPy arrays)
A = cp.random.rand(m, k).astype(cp.float32)
B = cp.random.rand(k, n).astype(cp.float32)

# Use stateful API for fine-grained control
with nvmath_advanced.Matmul(A, B) as mm:
    mm.plan()           # Find optimal algorithm
    C = mm.execute()    # Execute computation

Memory Access Optimization (Custom Kernel)

Implementation Global Reads per C element Reduction
Naive 2 × K (baseline)
Tiled (16×16) 2 × K / 16 16×

Files

  • matrixMulSharedMem.py - Python implementation comparing nvmath vs custom kernel
  • README.md - This file
  • requirements.txt - Sample dependencies

See Also