- Added Python samples for CUDA Python 1.0 release - Renamed top-level `Samples` directory to `cpp` to accommodate Python samples.
4.8 KiB
Sample: Single-Pass Multi-Block Reduction with Cooperative Groups (Python)
Description
Single-kernel, two-stage reduction using Cooperative Groups and grid.sync() so all blocks synchronize inside one launch—no second kernel or CPU stage for the reduction tree.
Stack: cuda-core (device, compile, cooperative launch(), stream, CUDA events for GPU timing). CuPy for H↔D copies on the same stream (Stream.from_external(cuda.core_stream), ndarray.data.ptr to launch()). try/finally closes the stream if cooperative launch fails. Requires compute capability > 6.0 (Pascal+).
What you will learn
cooperative_groups::grid_groupandgrid.sync()across the grid- Cooperative
LaunchConfig(..., cooperative_launch=True)and sizing blocks for residency - Timing the GPU path with
EventOptions/stream.record()/ event elapsed time
Key libraries
| Library | Role |
|---|---|
cuda-core |
Device, stream, events, Program / ProgramOptions, cooperative launch() |
cupy |
cp.empty, cp.asarray, cp.asnumpy, Stream.from_external |
numpy |
Host data, reference sum, default_rng |
Requirements
- NVIDIA GPU, Pascal or newer; CUDA Toolkit 13+; Python 3.10+
- NVRTC must see
cooperative_groups.hand CCCL headers (cuda/std/*)
pip install -r requirements.txt
Pick a CuPy wheel that matches your CUDA major version (e.g. cupy-cuda13x in requirements.txt).
How to run
--cuda-include-dir is required (colon-separated list). Typical desktop layout:
python reductionMultiBlockCG.py \
--cuda-include-dir /usr/local/cuda/include/cccl:/usr/local/cuda/include
Jetson / split include trees: pass every directory NVRTC needs in one --cuda-include-dir argument, e.g.
/usr/local/cuda/include/cccl:/usr/local/cuda/targets/sbsa-linux/include (adjust paths to your image). If headers are scattered, you can instead merge them into one tree with symlinks and point --cuda-include-dir at that folder.
Defaults: 2²⁵ elements, threads = device max (capped at 1024), auto --maxblocks, 100 iterations. Other flags: --n, --threads, --maxblocks, --iterations. See python reductionMultiBlockCG.py --help.
Output
======================================================================
Single-Pass Multi-Block Reduction with Cooperative Groups
======================================================================
Demonstrates: Multi-stage reduction in a single kernel using grid.sync()
Device Information:
Name: NVIDIA Thor
Compute Capability: sm_11.0
Reduction Configuration:
Number of elements: 33,554,432
Data size: 128.00 MB
Compiling CUDA kernel...
Kernel compiled successfully
Launch Configuration:
Threads per block: 1024
Number of blocks: 20
Total threads: 20,480
Shared memory per block: 4096 bytes
Launch mode: Cooperative (grid-wide sync enabled)
> Generating random input data...
> Computing reference result on CPU...
CPU time: 0.008903 seconds
> Warming up GPU...
Warm-up successful
> Running benchmark (100 iterations)...
> Performance Results:
Average GPU time: 0.977166 ms
Throughput: 137.35 GB/s
Speedup vs CPU: 9.11x
> Validating results...
Test PASSED
======================================================================
Summary
======================================================================
Single-kernel two-stage reduction:
Stage 1: 20 blocks → 20 partial sums
grid.sync() ← All blocks synchronize (KEY innovation)
Stage 2: Block 0 → 1 final result
Total: 1 kernel launch, 137.35 GB/s
Comparison:
• Traditional: 2 kernel launches or kernel + CPU
• This sample: 1 kernel with grid.sync() between stages
• Benefit: Eliminates ~5-20μs launch overhead per stage
======================================================================
Single-Pass Multi-Block Reduction completed successfully!
======================================================================
Troubleshooting (short)
- Cooperative launch not supported / fails: need sm_60+; reduce
--maxblocksor--threadsso all blocks can be resident. - Compile errors missing headers: extend
--cuda-include-dirwith the path that contains CCCL / cooperative groups (see Jetson note above). - Low throughput: often block count vs occupancy; try defaults first, then tune
--threads/--maxblocks.
Related samples
blockArraySum (atomics + grid-stride) → reduction (two-stage shared memory) → this sample (single kernel + grid.sync()).
Further reading
Files
reductionMultiBlockCG.py · requirements.txt · README.md