mirror of
https://github.com/NVIDIA/cuda-samples.git
synced 2026-05-14 14:06:53 +08:00
- Added Python samples for CUDA Python 1.0 release - Renamed top-level `Samples` directory to `cpp` to accommodate Python samples.
264 lines
11 KiB
Markdown
264 lines
11 KiB
Markdown
# simplePrint - Printing from CUDA Kernels
|
||
|
||
## Description
|
||
|
||
This sample demonstrates how to use `printf()` inside CUDA kernels using **two different approaches**:
|
||
|
||
1. **CUDA C++ kernels** compiled with `cuda.core.Program` - Full C++ features and control
|
||
2. **Numba CUDA kernels** - Pythonic kernel authoring using `numba.cuda.grid()` for modern indexing
|
||
|
||
The sample shows basic device management, kernel compilation with inline CUDA C++ code, and multi-dimensional kernel launches (2D grid × 3D blocks) using modern CUDA Python. The Numba example demonstrates the recommended `numba.cuda.grid()` indexing style while also showing how it relates to classic CUDA C++ block/thread IDs. Both approaches use `cuda.core` APIs for stream management and synchronization, demonstrating interoperability.
|
||
|
||
This is the Python equivalent of the C++ `simplePrintf` sample, enhanced with Numba CUDA examples.
|
||
|
||
## Key Concepts
|
||
|
||
CUDA Python (cuda.core), Numba CUDA, Kernel Compilation, Printf in Kernels, Multi-dimensional Launch, Pythonic GPU Programming, Modern Thread Indexing (grid()), Stream-based Execution, cuda.core/Numba Interoperability
|
||
|
||
## CUDA APIs involved
|
||
|
||
### [cuda.core (cuda-python)](https://nvidia.github.io/cuda-python/)
|
||
|
||
- `Device()` - Device management
|
||
- `Device.create_stream()` - Create CUDA streams
|
||
- `Stream.sync()` - Synchronize stream execution
|
||
- `Program()` - Compile CUDA C++ kernels
|
||
- `LaunchConfig()` - Configure kernel launch
|
||
- `launch()` - Execute kernels on streams
|
||
|
||
### [Numba CUDA](https://nvidia.github.io/numba-cuda/)
|
||
|
||
- `@cuda.jit` - JIT compile Python functions to CUDA kernels
|
||
- `cuda.grid()` - Get global thread position (recommended modern approach)
|
||
- `cuda.blockIdx`, `cuda.threadIdx` - Thread/block indices (classic style)
|
||
- `cuda.gridDim`, `cuda.blockDim` - Grid/block dimensions
|
||
- **Note:** Uses `cuda.core` APIs for stream management (interoperability)
|
||
|
||
### CUDA Kernel Functions
|
||
|
||
- `printf()` - Print from device code (C++)
|
||
- `print()` - Print from device code (Numba, limited formatting)
|
||
- `blockIdx`, `threadIdx` - Thread/block indices
|
||
- `gridDim`, `blockDim` - Grid/block dimensions
|
||
|
||
### What You Learn
|
||
|
||
- Device initialization with `cuda.core.Device`
|
||
- Compiling CUDA C++ kernels with `Program` and `ProgramOptions`
|
||
- Writing Pythonic CUDA kernels with Numba's `@cuda.jit` decorator
|
||
- Using `numba.cuda.grid()` for modern thread indexing (recommended approach)
|
||
- Understanding the relationship between global coordinates and classic block/thread IDs
|
||
- **Interoperability**: Using `cuda.core` streams with Numba CUDA kernels
|
||
- Comparing CUDA C++ vs Pythonic kernel authoring approaches
|
||
- Multi-dimensional kernel launches (2D grid, 3D blocks)
|
||
- Using streams for kernel execution and synchronization
|
||
- Using `printf()` and `print()` in GPU kernels for debugging
|
||
- Understanding print limitations in Numba CUDA (no f-strings)
|
||
- Proper error handling and resource management
|
||
|
||
## Requirements
|
||
|
||
### Hardware:
|
||
|
||
- NVIDIA GPU with Compute Capability 7.0 or higher
|
||
- Minimum GPU memory: 512 MB
|
||
|
||
### Software:
|
||
|
||
- CUDA Toolkit 13.0 or newer
|
||
- Python 3.10 or newer
|
||
- `cuda-python` package (13.0+)
|
||
- `cuda-core` package (>=0.6.0)
|
||
- `numba-cuda` package (0.24.0+, for Pythonic kernel authoring)
|
||
|
||
Download and install:
|
||
- [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads)
|
||
- [cuda-python package](https://nvidia.github.io/cuda-python/): `pip install cuda-python`
|
||
- [numba-cuda](https://nvidia.github.io/numba-cuda/): `pip install numba-cuda`
|
||
|
||
## Build and Run
|
||
|
||
```bash
|
||
# Install dependencies
|
||
pip install -r requirements.txt
|
||
|
||
# Run the sample
|
||
python simplePrint.py
|
||
```
|
||
|
||
## Expected Output
|
||
|
||
```
|
||
Simple Print - Printing from CUDA Kernels
|
||
Demonstrating both CUDA C++ and Numba CUDA approaches
|
||
|
||
Device: <Your GPU Name>
|
||
Compute Capability: sm_<XX>
|
||
|
||
======================================================================
|
||
METHOD 1: CUDA C++ Kernel (via cuda.core.Program)
|
||
======================================================================
|
||
Advantage: Full C++ features, better for complex kernels
|
||
|
||
Compiling CUDA C++ kernel...
|
||
Kernel compiled successfully.
|
||
|
||
Kernel configuration:
|
||
Grid: (2, 2)
|
||
Block: (2, 2, 2)
|
||
Total threads: 32
|
||
|
||
Launching kernel with value=10. Output:
|
||
|
||
[0, 0]: Value is: 10
|
||
[0, 1]: Value is: 10
|
||
[0, 2]: Value is: 10
|
||
[0, 3]: Value is: 10
|
||
[0, 4]: Value is: 10
|
||
[0, 5]: Value is: 10
|
||
[0, 6]: Value is: 10
|
||
[0, 7]: Value is: 10
|
||
[1, 0]: Value is: 10
|
||
...
|
||
[3, 7]: Value is: 10
|
||
|
||
CUDA C++ kernel execution complete.
|
||
|
||
|
||
======================================================================
|
||
METHOD 2: Numba CUDA Kernel (Pythonic / modern indexing)
|
||
======================================================================
|
||
Advantage: Uses numba.cuda.grid(3) for global indexing,
|
||
while still showing classic CUDA C++ IDs for reference.
|
||
Uses cuda.core for stream management (interoperability).
|
||
|
||
Kernel configuration:
|
||
Grid: (2, 2)
|
||
Block: (2, 2, 2)
|
||
Total threads: 32
|
||
|
||
Launching Numba CUDA kernel (grid(3) + classic IDs) with value=10:
|
||
Uses numba.cuda.grid(3) to get global (x, y, z),
|
||
and prints the corresponding blockId/threadId like the C++ sample.
|
||
Stream managed by cuda.core for consistency with C++ example.
|
||
|
||
global[ 0 , 0 , 0 ] -> [ 0 , 0 ]: Value is: 10
|
||
global[ 1 , 0 , 0 ] -> [ 0 , 1 ]: Value is: 10
|
||
global[ 0 , 1 , 0 ] -> [ 0 , 2 ]: Value is: 10
|
||
...
|
||
global[ 3 , 3 , 1 ] -> [ 3 , 7 ]: Value is: 10
|
||
|
||
Numba CUDA kernel execution complete.
|
||
|
||
======================================================================
|
||
Done! Both kernel approaches demonstrated successfully.
|
||
======================================================================
|
||
```
|
||
|
||
## Understanding the Output
|
||
|
||
- **Grid**: 2×2 = 4 blocks (labeled 0-3)
|
||
- **Block**: 2×2×2 = 8 threads per block (labeled 0-7)
|
||
- **Total**: 32 threads, each printing its position and value
|
||
|
||
### CUDA C++ Kernel:
|
||
Each thread calculates:
|
||
- Block ID (linear): `blockIdx.y * gridDim.x + blockIdx.x`
|
||
- Thread ID (linear): `threadIdx.z * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x`
|
||
|
||
### Numba CUDA Kernel:
|
||
Each thread shows:
|
||
- **Global position** using `numba.cuda.grid(3)` → `(x, y, z)` coordinates across entire grid
|
||
- **Classic IDs** (block ID, thread ID) calculated the same way as C++ for comparison
|
||
- This demonstrates how modern indexing relates to traditional CUDA C++ style
|
||
|
||
### Comparing the Two Approaches
|
||
|
||
**CUDA C++ Kernel (Method 1):**
|
||
- Uses C++ syntax and `printf()` with full formatting control
|
||
- Requires compilation via `cuda.core.Program`
|
||
- Best for complex kernels needing C++ features (templates, libraries, etc.)
|
||
- Uses classic block/thread ID indexing
|
||
- Output: `[0, 0]: Value is: 10` (clean formatting)
|
||
|
||
**Numba CUDA Kernel (Method 2):**
|
||
- Uses Python syntax with `@cuda.jit` decorator
|
||
- JIT compiled automatically when called
|
||
- Best for prototyping and simpler kernels
|
||
- **Modern indexing**: Uses `numba.cuda.grid(3)` to get global thread coordinates (recommended)
|
||
- Also shows classic block/thread IDs to help relate the two indexing models
|
||
- **Interoperability**: Uses `cuda.core` streams via `stream` for consistency
|
||
- Demonstrates that numba-cuda kernels can work seamlessly with cuda.core infrastructure
|
||
- Limited print formatting (no f-strings, basic `print()` only; adds spaces between arguments)
|
||
- Output: `global[ 0 , 0 , 0 ] -> [ 0 , 0 ]: Value is: 10` (shows both indexing styles; note extra spaces due to `print()` behavior)
|
||
|
||
## Experiments
|
||
|
||
Try modifying:
|
||
|
||
### For Both Approaches:
|
||
- **Grid size**: Change `grid=(4, 4)` for 16 blocks
|
||
- **Block size**: Change `block=(4, 4, 4)` for 64 threads per block
|
||
- **Conditional printing**: Print only from specific threads (e.g., `if threadId == 0:`)
|
||
|
||
### CUDA C++ Specific:
|
||
- **Format strings**: Experiment with different `printf()` formats
|
||
- **Kernel code**: Add complex C++ computations before printing
|
||
- **External libraries**: Include CUDA math libraries or device functions (e.g., `<cuda/std/cmath>`, `<cub/cub.cuh>`)
|
||
|
||
### Numba CUDA Specific:
|
||
- **Grid indexing**: Try `numba.cuda.grid(1)` or `numba.cuda.grid(2)` for different dimensions
|
||
- **Conditional printing**: Print only from threads where `x == 0` or `y == z`
|
||
- **Python operations**: Use NumPy-like operations in the kernel
|
||
- **Device math libraries**: Use [nvmath-python device APIs](https://docs.nvidia.com/cuda/nvmath-python/latest/device-apis/index.html) for optimized math operations (similar to CUDA math libraries in C++)
|
||
- **Shared memory**: Add `numba.cuda.shared.array()` for fast inter-thread communication
|
||
- **Atomic operations**: Try `numba.cuda.atomic.add()` for thread-safe updates
|
||
- **Print variations**: Experiment with what numba-cuda's `print()` can and cannot handle
|
||
- **Streams**: Create multiple `cuda.core` streams and launch numba-cuda kernels on them concurrently
|
||
- **Interoperability**: Mix numba-cuda kernels and CUDA C++ kernels on the same stream
|
||
|
||
## Notes
|
||
|
||
### General:
|
||
- Printing from GPU is relatively slow - use sparingly in production code
|
||
- Printf output is buffered and limited (~1MB buffer on most GPUs)
|
||
|
||
### CUDA C++ Kernels:
|
||
- Always call `stream.sync()` after kernel launch to flush printf output
|
||
- Full `printf()` format string support (%, flags, width, precision)
|
||
|
||
### Numba CUDA Kernels:
|
||
- **Recommended**: Use `numba.cuda.grid(ndim)` for thread indexing (modern, Pythonic)
|
||
- `grid(1)` for 1D indexing, `grid(2)` for 2D, `grid(3)` for 3D
|
||
- Returns global thread position across the entire grid
|
||
- **Interoperability**: Use `cuda.core` streams with Numba kernels via `stream`
|
||
- Create streams: `stream = device.create_stream()`
|
||
- Launch kernels: `kernel[grid, block, stream](args)`
|
||
- Synchronize: `stream.sync()`
|
||
- Numba's `print()` has limited capabilities compared to Python's `print()`
|
||
- F-strings are NOT supported in Numba CUDA kernels
|
||
- Use comma-separated arguments: `print("Value:", x)` instead of f-strings
|
||
- **Note**: `print()` automatically adds spaces between comma-separated arguments (e.g., `print("[", x, "]")` outputs `[ 0 ]` not `[0]`)
|
||
- Always synchronize the stream to flush output
|
||
|
||
## Files
|
||
|
||
- `simplePrint.py` - Python implementation using cuda.core API
|
||
- `README.md` - This file
|
||
- `requirements.txt` - Sample dependencies
|
||
|
||
## See Also
|
||
|
||
### CUDA Python (cuda.core):
|
||
- [cuda.core Documentation](https://nvidia.github.io/cuda-python/)
|
||
- [CUDA Python Examples](https://github.com/NVIDIA/cuda-python/tree/main/cuda_core/examples)
|
||
|
||
### Numba CUDA:
|
||
- [Numba CUDA Documentation](https://nvidia.github.io/numba-cuda/)
|
||
- [numba.cuda.grid() Reference](https://nvidia.github.io/numba-cuda/reference/kernel.html#numba.cuda.grid)
|
||
- [nvmath-python Device APIs](https://docs.nvidia.com/cuda/nvmath-python/latest/device-apis/index.html) - Optimized math operations for Numba CUDA kernels
|
||
|
||
### CUDA References:
|
||
- [CUDA C Programming Guide - Printf](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#formatted-output)
|
||
- [C++ simplePrintf Sample](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/0_Introduction/simplePrintf)
|