Dheemanth aeab82ff30
CUDA 13.2 samples update (#432)
- Added Python samples for CUDA Python 1.0 release
- Renamed top-level `Samples` directory to `cpp` to accommodate Python samples.
2026-05-13 17:13:18 -05:00
..
2026-05-13 17:13:18 -05:00
2026-05-13 17:13:18 -05:00

Sample: Fast Array Sum using Shared Memory (Python)

Description

Two-stage parallel reduction: each GPU block sums its chunk in shared memory (tree reduction, two elements per thread), writes one partial sum per block; the host combines partial sums for the final result.

Stack: cuda-core for Device, stream, events, Program / launch(). CuPy allocates device memory and copies; launch() takes device pointers as ndarray.data.ptr (Python int). Copies run on the same CUDA stream as the kernel via cp.cuda.Stream.from_external(stream) (cuda.core Stream implements the CUDA stream protocol) and with cp_stream:.

What you will learn

  • Shared-memory block reduction and sequential-addressing tree reduction
  • LaunchConfig with dynamic shared memory and launch() with pointer arguments
  • Aligning CuPy transfers with a cuda.core stream (Stream.from_external)
  • GPU timing with EventOptions / device.create_event()

Key libraries

Library Role
cuda-core Device, stream, events, compile, launch
cupy cp.empty, cp.asarray, cp.asnumpy, Stream.from_external
numpy Host data and CPU reference sum

Key APIs (quick reference)

  • cuda.core: Device, create_stream, Program / ProgramOptions, LaunchConfig, launch, EventOptions, create_event
  • CuPy: cp.empty, cp.asarray, cp.cuda.Stream.from_external(stream), with cp_stream:, cp.asnumpy

Requirements

  • NVIDIA GPU, CUDA-capable driver; CUDA Toolkit 13+ (for toolchain alignment with cuda-core)
  • Python 3.10+
pip install -r requirements.txt

How to run

python reduction.py

Defaults: 2²⁴ elements, 256 threads/block, float, 100 benchmark iterations.

Change data type (selects blockReduceKernel_int / _float / _double):

python reduction.py --type float    # default; 32-bit float
python reduction.py --type double   # 64-bit float
python reduction.py --type int      # 32-bit integer (exact equality check)

Combine with other flags as needed, e.g. python reduction.py --type int --n 1048576.

Other main flags: --n, --threads, --iterations. Full list: python reduction.py --help.

Output

Example run (python reduction.py, defaults) on Tesla T10, compute capability 7.5:

======================================================================
Fast Array Sum using Shared Memory - Two-Stage Reduction
======================================================================

Demonstrates: Efficient parallel reduction using shared memory

Device Information:
  Name: Tesla T10
  Compute Capability: sm_7.5

Configuration:
  Array size: 16,777,216 elements
  Data type: float
  Memory: 64.00 MB
  Threads per block: 256

Two-Stage Reduction Strategy:
  Stage 1: GPU block reduction
    - Number of blocks: 32768
    - Elements per block: 512
    - Output: 32768 partial sums
  Stage 2: CPU final reduction
    - Combine 32768 partial sums → 1 final result

Compiling CUDA kernel...
  Kernel 'blockReduceKernel_float' compiled successfully

> Generating random input data...
> Computing reference result on CPU...
  CPU time: 2.428208 seconds

> Warming up GPU...
  Warm-up completed

> Benchmarking Stage 1 (GPU block reduction)...
  Running 100 iterations...

> Running Stage 2 (CPU final reduction)...

======================================================================
Performance Results
======================================================================

Stage 1 (GPU block reduction):
  Average time: 0.338404 ms
  Throughput: 198.31 GB/s

Stage 2 (CPU final reduction):
  Time: 0.078073 ms
  (32768 partial sums)

Total time: 0.416477 ms
Speedup vs CPU: 5830.35x

> Validating results...
  GPU result: 2147639808.00000000
  CPU result: 2147639929.62027407
Test PASSED

======================================================================
Summary
======================================================================
Key optimizations:
  - Load 2 elements per thread: 8,388,608 global reads (50% savings)
  - Shared memory for reduction: ~10-20x faster than global memory
  - Parallel block outputs: 32768 independent writes
Result: 198.31 GB/s throughput
======================================================================
Two-Stage Reduction completed successfully!
======================================================================

Files

reduction.py · requirements.txt · README.md