mirror of
https://github.com/NVIDIA/cuda-samples.git
synced 2026-05-14 14:06:53 +08:00
- Added Python samples for CUDA Python 1.0 release - Renamed top-level `Samples` directory to `cpp` to accommodate Python samples.
5.4 KiB
5.4 KiB
Sample: Launch Configuration Tuning (Python)
Description
Benchmark different CUDA kernel launch configurations to find the optimal block-size setting using cuda.core APIs. This sample demonstrates performance tuning by measuring execution time across various thread block sizes.
What You'll Learn
- Compiling CUDA kernels at runtime with
cuda.core.Program - Launching kernels with different
LaunchConfigsettings - Benchmarking kernel performance with precise timing
- Understanding how thread block size affects performance
- Tuning for memory-bound vs compute-bound kernels
Key Concepts
Launch Configuration with cuda.core
# Configure kernel launch with specific thread block size
config = LaunchConfig(
grid=(grid_size,),
block=(block_size,),
shmem_size=shared_memory_bytes
)
# Launch kernel
launch(stream, config, kernel, *args)
stream.sync()
Thread Block Sizing
Thread block size significantly impacts performance due to:
| Factor | Impact |
|---|---|
| Occupancy | More active warps can hide memory latency |
| Registers | More threads/block = fewer registers/thread |
| Shared Memory | Divided among blocks on each SM |
| Warp Efficiency | Block size should be multiple of 32 |
Benchmarking Approach
# Use CUDA events for accurate GPU timing (not CPU wall-clock)
start_event = device.create_event(options=EventOptions(enable_timing=True))
end_event = device.create_event(options=EventOptions(enable_timing=True))
stream.record(start_event)
for _ in range(n_iterations):
launch(stream, config, kernel, *args)
stream.record(end_event)
end_event.sync()
elapsed_ms = (end_event - start_event) / n_iterations
Key APIs
From cuda.core:
Device- CUDA device managementProgram- Runtime kernel compilation (NVRTC)ProgramOptions- Compilation options (architecture target)LaunchConfig- Kernel launch configuration (grid/block dimensions)launch- Execute compiled kernel (accepts Buffer objects directly)EventOptions- GPU timing with CUDA eventsManagedMemoryResource- Device-preferred unified memoryManagedMemoryResourceOptions- Set preferred_location for representative benchmarks
From numpy:
np.from_dlpack()- Zero-copy view of GPU buffers via DLPack
Benchmarked Kernels:
- vector_add - Simple memory-bound kernel (C = A + B) - low sensitivity to block size
- reduce_sum - Shared memory reduction - high sensitivity to block size
Requirements
Hardware:
- NVIDIA GPU with CUDA support
- Minimum GPU memory: 512 MB
Software:
- CUDA Toolkit 13.0 or newer
- Python 3.10 or newer
- See
requirements.txtfor Python packages
Installation
pip install -r requirements.txt
How to Run
python launchConfigTuning.py
Expected Output
============================================================
Launch Configuration Tuning (cuda.core)
Finding the Best Block Size for Your Kernel
============================================================
Device: <Your GPU Name>
Compute Capability: X.X
Compiling CUDA kernels with cuda.core.Program...
Target architecture: sm_XX
✓ vector_add kernel compiled
✓ reduce_sum kernel compiled
============================================================
VECTOR ADDITION - Launch Configuration Tuning
============================================================
Problem size: 10,000,000 elements
Kernel: vector_add (C = A + B)
Testing thread configurations: [32, 64, 128, 256, 512, 1024]
------------------------------------------------------------
Block Size: 32 | Blocks: 312500 | Time: X.XXXX ± X.XXXX ms
Block Size: 64 | Blocks: 156250 | Time: X.XXXX ± X.XXXX ms
...
------------------------------------------------------------
✓ OPTIMAL: block_size=XXX (X.XXXX ms)
✗ WORST: block_size=XXX (X.XXXX ms)
Speedup: X.XXx
✓ Results verified correct!
...
============================================================
SAMPLE COMPLETE
============================================================
Key Takeaway: The optimal thread configuration depends on your
specific kernel characteristics. Always benchmark to find the best!
Tuning Guidelines
Start Here
- 128-256 threads/block is a good starting point for most kernels
- Always use multiples of 32 (warp size)
Memory-Bound Kernels
- Less sensitive to thread configuration
- Focus on memory access patterns
- Higher thread counts help hide latency
Compute-Bound Kernels
- More sensitive to thread configuration
- Watch for register pressure at high thread counts
- Profile with Nsight Compute
Reduction Kernels
- Block size affects shared memory usage
- Power-of-2 sizes simplify reduction logic
- Often 256-512 threads works well
Files
launchConfigTuning.py- Python implementation using cuda.coreREADME.md- This filerequirements.txt- Sample dependencies