# Matrix Multiplication with Shared Memory (GEMM) Demonstrates efficient matrix multiplication using nvmath-python APIs and custom CUDA kernels with tiling, shared memory, and loop unrolling. ## Overview - Uses nvmath.linalg.advanced.Matmul for high-performance GEMM via cuBLASLt - Compares with custom CUDA kernel using tiling and shared memory - Shows how tiling reduces global memory bandwidth requirements - Demonstrates shared memory for data reuse within thread blocks - Uses loop unrolling to improve instruction-level parallelism ## What You'll Learn - How to use nvmath stateful API for optimized matrix multiplication - How to tile matrix operations for better cache locality - Using shared memory to reduce redundant global memory accesses - Loop unrolling techniques for GPU kernels - Benchmarking and comparing kernel performance ## Key Libraries - `nvmath-python` - NVIDIA math library with cuBLASLt access - `cuda.core` - Modern CUDA Python API for custom kernel compilation - `cupy` - GPU array library for Python ## Key APIs ### From `nvmath.linalg.advanced`: - `Matmul()` - Stateful matrix multiplication with planning and execution phases - `MatmulComputeType` - Compute type options for mixed-precision ### From `cuda.core`: - `Device()` - CUDA device management and properties - `Program()` - Runtime kernel compilation (NVRTC) - `LaunchConfig()` - Kernel launch configuration (grid/block dimensions) - `launch()` - Kernel execution on a stream - `Stream.record_event()` / `Event.elapsed_time()` - GPU timing ## Requirements ### Hardware: - NVIDIA GPU with Compute Capability 7.0 or higher - Minimum GPU memory: 256 MB (for 1024×1024 matrices) ### Software: - CUDA Toolkit 13.0 or newer - Python 3.10 or newer - See requirements.txt for package dependencies ## Installation ```bash cd cuda-samples/python/2_CoreConcepts/matrixMulSharedMem python -m venv venv source venv/bin/activate pip install -r requirements.txt ``` ## How to Run ```bash python matrixMulSharedMem.py ``` ## Expected Output ``` ====================================================================== Matrix Multiplication with Shared Memory (GEMM) Using nvmath and cuda.core APIs ====================================================================== Device: NVIDIA GeForce RTX 4090 Compute Capability: sm_89 Custom kernel compiled ✓ Matrix dimensions: A(1024x1024) × B(1024x1024) = C(1024x1024) Custom kernel tile size: 16x16 ---------------------------------------------------------------------- NVMATH MATMUL (cuBLASLt) ---------------------------------------------------------------------- Using nvmath.linalg.advanced.Matmul stateful API Average time: X.XXX ms Performance: XXXX.XX GFLOPS ---------------------------------------------------------------------- CUSTOM KERNEL (Tiled + Shared Memory + Loop Unrolling) ---------------------------------------------------------------------- Grid: (64, 64), Block: (16, 16) Average time: X.XXX ms Performance: XXX.XX GFLOPS ---------------------------------------------------------------------- VERIFICATION ---------------------------------------------------------------------- nvmath : PASSED (max error: X.XXe-XX) Custom kernel : PASSED (max error: X.XXe-XX) ====================================================================== PERFORMANCE SUMMARY ====================================================================== Implementation Time (ms) GFLOPS ---------------------------------------------------------------------- nvmath (cuBLASLt) X.XXX XXXX.XX Custom (shared mem + unroll) X.XXX XXX.XX ``` ## Tiling Concept ``` Matrix A (M×K) Matrix B (K×N) Matrix C (M×N) ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ T00 │ T01 │...│ │ T00 │ T01 │...│ │ │ │ │ ├─────┼─────┼───┤ ├─────┼─────┼───┤ ├─────┼─────┼───┤ │ T10 │ T11 │...│ × │ T10 │ T11 │...│ = │ │ Cij │ │ ├─────┼─────┼───┤ ├─────┼─────┼───┤ ├─────┼─────┼───┤ │ ... │ ... │...│ │ ... │ ... │...│ │ │ │ │ └───────────────┘ └───────────────┘ └───────────────┘ Cij = Σ (A_tile_row × B_tile_col) for all tiles along K ``` ## nvmath Stateful API ```python import nvmath.linalg.advanced as nvmath_advanced # Create matrices (CuPy arrays) A = cp.random.rand(m, k).astype(cp.float32) B = cp.random.rand(k, n).astype(cp.float32) # Use stateful API for fine-grained control with nvmath_advanced.Matmul(A, B) as mm: mm.plan() # Find optimal algorithm C = mm.execute() # Execute computation ``` ## Memory Access Optimization (Custom Kernel) | Implementation | Global Reads per C element | Reduction | |---------------|---------------------------|-----------| | Naive | 2 × K | (baseline)| | Tiled (16×16) | 2 × K / 16 | 16× | ## Files - `matrixMulSharedMem.py` - Python implementation comparing nvmath vs custom kernel - `README.md` - This file - `requirements.txt` - Sample dependencies ## See Also - [nvmath-python Documentation](https://docs.nvidia.com/cuda/nvmath-python/) - [CUDA Python Documentation](https://nvidia.github.io/cuda-python/) - [cuda.core API Guide](https://nvidia.github.io/cuda-python/cuda-core/latest/) - [CuPy Documentation](https://docs.cupy.dev/)