# simplePrint - Printing from CUDA Kernels ## Description This sample demonstrates how to use `printf()` inside CUDA kernels using **two different approaches**: 1. **CUDA C++ kernels** compiled with `cuda.core.Program` - Full C++ features and control 2. **Numba CUDA kernels** - Pythonic kernel authoring using `numba.cuda.grid()` for modern indexing The sample shows basic device management, kernel compilation with inline CUDA C++ code, and multi-dimensional kernel launches (2D grid × 3D blocks) using modern CUDA Python. The Numba example demonstrates the recommended `numba.cuda.grid()` indexing style while also showing how it relates to classic CUDA C++ block/thread IDs. Both approaches use `cuda.core` APIs for stream management and synchronization, demonstrating interoperability. This is the Python equivalent of the C++ `simplePrintf` sample, enhanced with Numba CUDA examples. ## Key Concepts CUDA Python (cuda.core), Numba CUDA, Kernel Compilation, Printf in Kernels, Multi-dimensional Launch, Pythonic GPU Programming, Modern Thread Indexing (grid()), Stream-based Execution, cuda.core/Numba Interoperability ## CUDA APIs involved ### [cuda.core (cuda-python)](https://nvidia.github.io/cuda-python/) - `Device()` - Device management - `Device.create_stream()` - Create CUDA streams - `Stream.sync()` - Synchronize stream execution - `Program()` - Compile CUDA C++ kernels - `LaunchConfig()` - Configure kernel launch - `launch()` - Execute kernels on streams ### [Numba CUDA](https://nvidia.github.io/numba-cuda/) - `@cuda.jit` - JIT compile Python functions to CUDA kernels - `cuda.grid()` - Get global thread position (recommended modern approach) - `cuda.blockIdx`, `cuda.threadIdx` - Thread/block indices (classic style) - `cuda.gridDim`, `cuda.blockDim` - Grid/block dimensions - **Note:** Uses `cuda.core` APIs for stream management (interoperability) ### CUDA Kernel Functions - `printf()` - Print from device code (C++) - `print()` - Print from device code (Numba, limited formatting) - `blockIdx`, `threadIdx` - Thread/block indices - `gridDim`, `blockDim` - Grid/block dimensions ### What You Learn - Device initialization with `cuda.core.Device` - Compiling CUDA C++ kernels with `Program` and `ProgramOptions` - Writing Pythonic CUDA kernels with Numba's `@cuda.jit` decorator - Using `numba.cuda.grid()` for modern thread indexing (recommended approach) - Understanding the relationship between global coordinates and classic block/thread IDs - **Interoperability**: Using `cuda.core` streams with Numba CUDA kernels - Comparing CUDA C++ vs Pythonic kernel authoring approaches - Multi-dimensional kernel launches (2D grid, 3D blocks) - Using streams for kernel execution and synchronization - Using `printf()` and `print()` in GPU kernels for debugging - Understanding print limitations in Numba CUDA (no f-strings) - Proper error handling and resource management ## Requirements ### Hardware: - NVIDIA GPU with Compute Capability 7.0 or higher - Minimum GPU memory: 512 MB ### Software: - CUDA Toolkit 13.0 or newer - Python 3.10 or newer - `cuda-python` package (13.0+) - `cuda-core` package (>=0.6.0) - `numba-cuda` package (0.24.0+, for Pythonic kernel authoring) Download and install: - [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) - [cuda-python package](https://nvidia.github.io/cuda-python/): `pip install cuda-python` - [numba-cuda](https://nvidia.github.io/numba-cuda/): `pip install numba-cuda` ## Build and Run ```bash # Install dependencies pip install -r requirements.txt # Run the sample python simplePrint.py ``` ## Expected Output ``` Simple Print - Printing from CUDA Kernels Demonstrating both CUDA C++ and Numba CUDA approaches Device: Compute Capability: sm_ ====================================================================== METHOD 1: CUDA C++ Kernel (via cuda.core.Program) ====================================================================== Advantage: Full C++ features, better for complex kernels Compiling CUDA C++ kernel... Kernel compiled successfully. Kernel configuration: Grid: (2, 2) Block: (2, 2, 2) Total threads: 32 Launching kernel with value=10. Output: [0, 0]: Value is: 10 [0, 1]: Value is: 10 [0, 2]: Value is: 10 [0, 3]: Value is: 10 [0, 4]: Value is: 10 [0, 5]: Value is: 10 [0, 6]: Value is: 10 [0, 7]: Value is: 10 [1, 0]: Value is: 10 ... [3, 7]: Value is: 10 CUDA C++ kernel execution complete. ====================================================================== METHOD 2: Numba CUDA Kernel (Pythonic / modern indexing) ====================================================================== Advantage: Uses numba.cuda.grid(3) for global indexing, while still showing classic CUDA C++ IDs for reference. Uses cuda.core for stream management (interoperability). Kernel configuration: Grid: (2, 2) Block: (2, 2, 2) Total threads: 32 Launching Numba CUDA kernel (grid(3) + classic IDs) with value=10: Uses numba.cuda.grid(3) to get global (x, y, z), and prints the corresponding blockId/threadId like the C++ sample. Stream managed by cuda.core for consistency with C++ example. global[ 0 , 0 , 0 ] -> [ 0 , 0 ]: Value is: 10 global[ 1 , 0 , 0 ] -> [ 0 , 1 ]: Value is: 10 global[ 0 , 1 , 0 ] -> [ 0 , 2 ]: Value is: 10 ... global[ 3 , 3 , 1 ] -> [ 3 , 7 ]: Value is: 10 Numba CUDA kernel execution complete. ====================================================================== Done! Both kernel approaches demonstrated successfully. ====================================================================== ``` ## Understanding the Output - **Grid**: 2×2 = 4 blocks (labeled 0-3) - **Block**: 2×2×2 = 8 threads per block (labeled 0-7) - **Total**: 32 threads, each printing its position and value ### CUDA C++ Kernel: Each thread calculates: - Block ID (linear): `blockIdx.y * gridDim.x + blockIdx.x` - Thread ID (linear): `threadIdx.z * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x` ### Numba CUDA Kernel: Each thread shows: - **Global position** using `numba.cuda.grid(3)` → `(x, y, z)` coordinates across entire grid - **Classic IDs** (block ID, thread ID) calculated the same way as C++ for comparison - This demonstrates how modern indexing relates to traditional CUDA C++ style ### Comparing the Two Approaches **CUDA C++ Kernel (Method 1):** - Uses C++ syntax and `printf()` with full formatting control - Requires compilation via `cuda.core.Program` - Best for complex kernels needing C++ features (templates, libraries, etc.) - Uses classic block/thread ID indexing - Output: `[0, 0]: Value is: 10` (clean formatting) **Numba CUDA Kernel (Method 2):** - Uses Python syntax with `@cuda.jit` decorator - JIT compiled automatically when called - Best for prototyping and simpler kernels - **Modern indexing**: Uses `numba.cuda.grid(3)` to get global thread coordinates (recommended) - Also shows classic block/thread IDs to help relate the two indexing models - **Interoperability**: Uses `cuda.core` streams via `stream` for consistency - Demonstrates that numba-cuda kernels can work seamlessly with cuda.core infrastructure - Limited print formatting (no f-strings, basic `print()` only; adds spaces between arguments) - Output: `global[ 0 , 0 , 0 ] -> [ 0 , 0 ]: Value is: 10` (shows both indexing styles; note extra spaces due to `print()` behavior) ## Experiments Try modifying: ### For Both Approaches: - **Grid size**: Change `grid=(4, 4)` for 16 blocks - **Block size**: Change `block=(4, 4, 4)` for 64 threads per block - **Conditional printing**: Print only from specific threads (e.g., `if threadId == 0:`) ### CUDA C++ Specific: - **Format strings**: Experiment with different `printf()` formats - **Kernel code**: Add complex C++ computations before printing - **External libraries**: Include CUDA math libraries or device functions (e.g., ``, ``) ### Numba CUDA Specific: - **Grid indexing**: Try `numba.cuda.grid(1)` or `numba.cuda.grid(2)` for different dimensions - **Conditional printing**: Print only from threads where `x == 0` or `y == z` - **Python operations**: Use NumPy-like operations in the kernel - **Device math libraries**: Use [nvmath-python device APIs](https://docs.nvidia.com/cuda/nvmath-python/latest/device-apis/index.html) for optimized math operations (similar to CUDA math libraries in C++) - **Shared memory**: Add `numba.cuda.shared.array()` for fast inter-thread communication - **Atomic operations**: Try `numba.cuda.atomic.add()` for thread-safe updates - **Print variations**: Experiment with what numba-cuda's `print()` can and cannot handle - **Streams**: Create multiple `cuda.core` streams and launch numba-cuda kernels on them concurrently - **Interoperability**: Mix numba-cuda kernels and CUDA C++ kernels on the same stream ## Notes ### General: - Printing from GPU is relatively slow - use sparingly in production code - Printf output is buffered and limited (~1MB buffer on most GPUs) ### CUDA C++ Kernels: - Always call `stream.sync()` after kernel launch to flush printf output - Full `printf()` format string support (%, flags, width, precision) ### Numba CUDA Kernels: - **Recommended**: Use `numba.cuda.grid(ndim)` for thread indexing (modern, Pythonic) - `grid(1)` for 1D indexing, `grid(2)` for 2D, `grid(3)` for 3D - Returns global thread position across the entire grid - **Interoperability**: Use `cuda.core` streams with Numba kernels via `stream` - Create streams: `stream = device.create_stream()` - Launch kernels: `kernel[grid, block, stream](args)` - Synchronize: `stream.sync()` - Numba's `print()` has limited capabilities compared to Python's `print()` - F-strings are NOT supported in Numba CUDA kernels - Use comma-separated arguments: `print("Value:", x)` instead of f-strings - **Note**: `print()` automatically adds spaces between comma-separated arguments (e.g., `print("[", x, "]")` outputs `[ 0 ]` not `[0]`) - Always synchronize the stream to flush output ## Files - `simplePrint.py` - Python implementation using cuda.core API - `README.md` - This file - `requirements.txt` - Sample dependencies ## See Also ### CUDA Python (cuda.core): - [cuda.core Documentation](https://nvidia.github.io/cuda-python/) - [CUDA Python Examples](https://github.com/NVIDIA/cuda-python/tree/main/cuda_core/examples) ### Numba CUDA: - [Numba CUDA Documentation](https://nvidia.github.io/numba-cuda/) - [numba.cuda.grid() Reference](https://nvidia.github.io/numba-cuda/reference/kernel.html#numba.cuda.grid) - [nvmath-python Device APIs](https://docs.nvidia.com/cuda/nvmath-python/latest/device-apis/index.html) - Optimized math operations for Numba CUDA kernels ### CUDA References: - [CUDA C Programming Guide - Printf](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#formatted-output) - [C++ simplePrintf Sample](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/0_Introduction/simplePrintf)