# Sample: Fast Array Sum using Shared Memory (Python)

## Description

Two-stage parallel reduction: each GPU block sums its chunk in **shared memory** (tree reduction, two elements per thread), writes one partial sum per block; the host combines partial sums for the final result.

**Stack:** `cuda-core` for `Device`, stream, events, `Program` / `launch()`. **CuPy** allocates device memory and copies; `launch()` takes device pointers as `ndarray.data.ptr` (Python `int`). Copies run on the same CUDA stream as the kernel via `cp.cuda.Stream.from_external(stream)` (cuda.core `Stream` implements the CUDA stream protocol) and `with cp_stream:`.

## What you will learn

- Shared-memory block reduction and sequential-addressing tree reduction
- `LaunchConfig` with dynamic shared memory and `launch()` with pointer arguments
- Aligning CuPy transfers with a `cuda.core` stream (`Stream.from_external`)
- GPU timing with `EventOptions` / `device.create_event()`

## Key libraries

| Library    | Role |
|------------|------|
| `cuda-core`| Device, stream, events, compile, launch |
| `cupy`     | `cp.empty`, `cp.asarray`, `cp.asnumpy`, `Stream.from_external` |
| `numpy`    | Host data and CPU reference sum |

## Key APIs (quick reference)

- **cuda.core:** `Device`, `create_stream`, `Program` / `ProgramOptions`, `LaunchConfig`, `launch`, `EventOptions`, `create_event`
- **CuPy:** `cp.empty`, `cp.asarray`, `cp.cuda.Stream.from_external(stream)`, `with cp_stream:`, `cp.asnumpy`

## Requirements

- NVIDIA GPU, CUDA-capable driver; **CUDA Toolkit 13+** (for toolchain alignment with `cuda-core`)
- **Python 3.10+**

```bash
pip install -r requirements.txt
```

## How to run

```bash
python reduction.py
```

Defaults: 2²⁴ elements, 256 threads/block, `float`, 100 benchmark iterations.

**Change data type** (selects `blockReduceKernel_int` / `_float` / `_double`):

```bash
python reduction.py --type float    # default; 32-bit float
python reduction.py --type double   # 64-bit float
python reduction.py --type int      # 32-bit integer (exact equality check)
```

Combine with other flags as needed, e.g. `python reduction.py --type int --n 1048576`.

Other main flags: `--n`, `--threads`, `--iterations`. Full list: `python reduction.py --help`.

## Output

Example run (`python reduction.py`, defaults) on **Tesla T10**, compute capability **7.5**:

```
======================================================================
Fast Array Sum using Shared Memory - Two-Stage Reduction
======================================================================

Demonstrates: Efficient parallel reduction using shared memory

Device Information:
  Name: Tesla T10
  Compute Capability: sm_7.5

Configuration:
  Array size: 16,777,216 elements
  Data type: float
  Memory: 64.00 MB
  Threads per block: 256

Two-Stage Reduction Strategy:
  Stage 1: GPU block reduction
    - Number of blocks: 32768
    - Elements per block: 512
    - Output: 32768 partial sums
  Stage 2: CPU final reduction
    - Combine 32768 partial sums → 1 final result

Compiling CUDA kernel...
  Kernel 'blockReduceKernel_float' compiled successfully

> Generating random input data...
> Computing reference result on CPU...
  CPU time: 2.428208 seconds

> Warming up GPU...
  Warm-up completed

> Benchmarking Stage 1 (GPU block reduction)...
  Running 100 iterations...

> Running Stage 2 (CPU final reduction)...

======================================================================
Performance Results
======================================================================

Stage 1 (GPU block reduction):
  Average time: 0.338404 ms
  Throughput: 198.31 GB/s

Stage 2 (CPU final reduction):
  Time: 0.078073 ms
  (32768 partial sums)

Total time: 0.416477 ms
Speedup vs CPU: 5830.35x

> Validating results...
  GPU result: 2147639808.00000000
  CPU result: 2147639929.62027407
Test PASSED

======================================================================
Summary
======================================================================
Key optimizations:
  - Load 2 elements per thread: 8,388,608 global reads (50% savings)
  - Shared memory for reduction: ~10-20x faster than global memory
  - Parallel block outputs: 32768 independent writes
Result: 198.31 GB/s throughput
======================================================================
Two-Stage Reduction completed successfully!
======================================================================
```

## Files

`reduction.py` · `requirements.txt` · `README.md`