Dheemanth aeab82ff30
CUDA 13.2 samples update (#432)
- Added Python samples for CUDA Python 1.0 release
- Renamed top-level `Samples` directory to `cpp` to accommodate Python samples.
2026-05-13 17:13:18 -05:00
..
2026-05-13 17:13:18 -05:00

Sample: Single-Pass Multi-Block Reduction with Cooperative Groups (Python)

Description

Single-kernel, two-stage reduction using Cooperative Groups and grid.sync() so all blocks synchronize inside one launch—no second kernel or CPU stage for the reduction tree.

Stack: cuda-core (device, compile, cooperative launch(), stream, CUDA events for GPU timing). CuPy for H↔D copies on the same stream (Stream.from_external(cuda.core_stream), ndarray.data.ptr to launch()). try/finally closes the stream if cooperative launch fails. Requires compute capability > 6.0 (Pascal+).

What you will learn

  • cooperative_groups::grid_group and grid.sync() across the grid
  • Cooperative LaunchConfig(..., cooperative_launch=True) and sizing blocks for residency
  • Timing the GPU path with EventOptions / stream.record() / event elapsed time

Key libraries

Library Role
cuda-core Device, stream, events, Program / ProgramOptions, cooperative launch()
cupy cp.empty, cp.asarray, cp.asnumpy, Stream.from_external
numpy Host data, reference sum, default_rng

Requirements

  • NVIDIA GPU, Pascal or newer; CUDA Toolkit 13+; Python 3.10+
  • NVRTC must see cooperative_groups.h and CCCL headers (cuda/std/*)
pip install -r requirements.txt

Pick a CuPy wheel that matches your CUDA major version (e.g. cupy-cuda13x in requirements.txt).

How to run

--cuda-include-dir is required (colon-separated list). Typical desktop layout:

python reductionMultiBlockCG.py \
  --cuda-include-dir /usr/local/cuda/include/cccl:/usr/local/cuda/include

Jetson / split include trees: pass every directory NVRTC needs in one --cuda-include-dir argument, e.g. /usr/local/cuda/include/cccl:/usr/local/cuda/targets/sbsa-linux/include (adjust paths to your image). If headers are scattered, you can instead merge them into one tree with symlinks and point --cuda-include-dir at that folder.

Defaults: 2²⁵ elements, threads = device max (capped at 1024), auto --maxblocks, 100 iterations. Other flags: --n, --threads, --maxblocks, --iterations. See python reductionMultiBlockCG.py --help.

Output

======================================================================
Single-Pass Multi-Block Reduction with Cooperative Groups
======================================================================

Demonstrates: Multi-stage reduction in a single kernel using grid.sync()

Device Information:
  Name: NVIDIA Thor
  Compute Capability: sm_11.0

Reduction Configuration:
  Number of elements: 33,554,432
  Data size: 128.00 MB

Compiling CUDA kernel...
  Kernel compiled successfully

Launch Configuration:
  Threads per block: 1024
  Number of blocks: 20
  Total threads: 20,480
  Shared memory per block: 4096 bytes
  Launch mode: Cooperative (grid-wide sync enabled)

> Generating random input data...
> Computing reference result on CPU...
  CPU time: 0.008903 seconds

> Warming up GPU...
  Warm-up successful

> Running benchmark (100 iterations)...

> Performance Results:
  Average GPU time: 0.977166 ms
  Throughput: 137.35 GB/s
  Speedup vs CPU: 9.11x

> Validating results...
Test PASSED

======================================================================
Summary
======================================================================

Single-kernel two-stage reduction:
  Stage 1: 20 blocks → 20 partial sums
  grid.sync() ← All blocks synchronize (KEY innovation)
  Stage 2: Block 0 → 1 final result
  Total: 1 kernel launch, 137.35 GB/s

Comparison:
  • Traditional: 2 kernel launches or kernel + CPU
  • This sample: 1 kernel with grid.sync() between stages
  • Benefit: Eliminates ~5-20μs launch overhead per stage

======================================================================
Single-Pass Multi-Block Reduction completed successfully!
======================================================================

Troubleshooting (short)

  • Cooperative launch not supported / fails: need sm_60+; reduce --maxblocks or --threads so all blocks can be resident.
  • Compile errors missing headers: extend --cuda-include-dir with the path that contains CCCL / cooperative groups (see Jetson note above).
  • Low throughput: often block count vs occupancy; try defaults first, then tune --threads / --maxblocks.

blockArraySum (atomics + grid-stride) → reduction (two-stage shared memory) → this sample (single kernel + grid.sync()).

Further reading

Files

reductionMultiBlockCG.py · requirements.txt · README.md