# Sample: Single-Pass Multi-Block Reduction with Cooperative Groups (Python)

## Description

Single-kernel, two-stage reduction using **Cooperative Groups** and `grid.sync()` so all blocks synchronize inside one launch—no second kernel or CPU stage for the reduction tree.

**Stack:** `cuda-core` (device, compile, cooperative `launch()`, stream, **CUDA events** for GPU timing). **CuPy** for H↔D copies on the same stream (`Stream.from_external(cuda.core_stream)`, `ndarray.data.ptr` to `launch()`). **`try`/`finally`** closes the stream if cooperative launch fails. Requires **compute capability > 6.0** (Pascal+).

## What you will learn

- `cooperative_groups::grid_group` and `grid.sync()` across the grid
- Cooperative `LaunchConfig(..., cooperative_launch=True)` and sizing blocks for residency
- Timing the GPU path with `EventOptions` / `stream.record()` / event elapsed time

## Key libraries

| Library | Role |
|---------|------|
| `cuda-core` | Device, stream, events, `Program` / `ProgramOptions`, cooperative `launch()` |
| `cupy` | `cp.empty`, `cp.asarray`, `cp.asnumpy`, `Stream.from_external` |
| `numpy` | Host data, reference sum, `default_rng` |

## Requirements

- NVIDIA GPU, **Pascal or newer**; **CUDA Toolkit 13+**; **Python 3.10+**
- NVRTC must see **`cooperative_groups.h`** and **CCCL** headers (`cuda/std/*`)

```bash
pip install -r requirements.txt
```

Pick a CuPy wheel that matches your CUDA major version (e.g. `cupy-cuda13x` in `requirements.txt`).

## How to run

**`--cuda-include-dir` is required** (colon-separated list). Typical desktop layout:

```bash
python reductionMultiBlockCG.py \
  --cuda-include-dir /usr/local/cuda/include/cccl:/usr/local/cuda/include
```

**Jetson / split include trees:** pass every directory NVRTC needs in one `--cuda-include-dir` argument, e.g.
`/usr/local/cuda/include/cccl:/usr/local/cuda/targets/sbsa-linux/include` (adjust paths to your image). If headers are scattered, you can instead merge them into one tree with symlinks and point `--cuda-include-dir` at that folder.

Defaults: **2²⁵** elements, threads = device max (capped at 1024), auto `--maxblocks`, **100** iterations. Other flags: `--n`, `--threads`, `--maxblocks`, `--iterations`. See **`python reductionMultiBlockCG.py --help`**.

## Output

```
======================================================================
Single-Pass Multi-Block Reduction with Cooperative Groups
======================================================================

Demonstrates: Multi-stage reduction in a single kernel using grid.sync()

Device Information:
  Name: NVIDIA Thor
  Compute Capability: sm_11.0

Reduction Configuration:
  Number of elements: 33,554,432
  Data size: 128.00 MB

Compiling CUDA kernel...
  Kernel compiled successfully

Launch Configuration:
  Threads per block: 1024
  Number of blocks: 20
  Total threads: 20,480
  Shared memory per block: 4096 bytes
  Launch mode: Cooperative (grid-wide sync enabled)

> Generating random input data...
> Computing reference result on CPU...
  CPU time: 0.008903 seconds

> Warming up GPU...
  Warm-up successful

> Running benchmark (100 iterations)...

> Performance Results:
  Average GPU time: 0.977166 ms
  Throughput: 137.35 GB/s
  Speedup vs CPU: 9.11x

> Validating results...
Test PASSED

======================================================================
Summary
======================================================================

Single-kernel two-stage reduction:
  Stage 1: 20 blocks → 20 partial sums
  grid.sync() ← All blocks synchronize (KEY innovation)
  Stage 2: Block 0 → 1 final result
  Total: 1 kernel launch, 137.35 GB/s

Comparison:
  • Traditional: 2 kernel launches or kernel + CPU
  • This sample: 1 kernel with grid.sync() between stages
  • Benefit: Eliminates ~5-20μs launch overhead per stage

======================================================================
Single-Pass Multi-Block Reduction completed successfully!
======================================================================
```

## Troubleshooting (short)

- **Cooperative launch not supported / fails:** need sm_60+; reduce `--maxblocks` or `--threads` so all blocks can be resident.
- **Compile errors missing headers:** extend `--cuda-include-dir` with the path that contains CCCL / cooperative groups (see Jetson note above).
- **Low throughput:** often block count vs occupancy; try defaults first, then tune `--threads` / `--maxblocks`.

## Related samples

**blockArraySum** (atomics + grid-stride) → **reduction** (two-stage shared memory) → **this sample** (single kernel + `grid.sync()`).

## Further reading

- [CUDA Cooperative Groups](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cooperative-groups)
- [Reduction whitepaper (PDF)](https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf)

## Files

`reductionMultiBlockCG.py` · `requirements.txt` · `README.md`