- Added Python samples for CUDA Python 1.0 release - Renamed top-level `Samples` directory to `cpp` to accommodate Python samples.
Sample: Fast Array Sum using Shared Memory (Python)
Description
Two-stage parallel reduction: each GPU block sums its chunk in shared memory (tree reduction, two elements per thread), writes one partial sum per block; the host combines partial sums for the final result.
Stack: cuda-core for Device, stream, events, Program / launch(). CuPy allocates device memory and copies; launch() takes device pointers as ndarray.data.ptr (Python int). Copies run on the same CUDA stream as the kernel via cp.cuda.Stream.from_external(stream) (cuda.core Stream implements the CUDA stream protocol) and with cp_stream:.
What you will learn
- Shared-memory block reduction and sequential-addressing tree reduction
LaunchConfigwith dynamic shared memory andlaunch()with pointer arguments- Aligning CuPy transfers with a
cuda.corestream (Stream.from_external) - GPU timing with
EventOptions/device.create_event()
Key libraries
| Library | Role |
|---|---|
cuda-core |
Device, stream, events, compile, launch |
cupy |
cp.empty, cp.asarray, cp.asnumpy, Stream.from_external |
numpy |
Host data and CPU reference sum |
Key APIs (quick reference)
- cuda.core:
Device,create_stream,Program/ProgramOptions,LaunchConfig,launch,EventOptions,create_event - CuPy:
cp.empty,cp.asarray,cp.cuda.Stream.from_external(stream),with cp_stream:,cp.asnumpy
Requirements
- NVIDIA GPU, CUDA-capable driver; CUDA Toolkit 13+ (for toolchain alignment with
cuda-core) - Python 3.10+
pip install -r requirements.txt
How to run
python reduction.py
Defaults: 2²⁴ elements, 256 threads/block, float, 100 benchmark iterations.
Change data type (selects blockReduceKernel_int / _float / _double):
python reduction.py --type float # default; 32-bit float
python reduction.py --type double # 64-bit float
python reduction.py --type int # 32-bit integer (exact equality check)
Combine with other flags as needed, e.g. python reduction.py --type int --n 1048576.
Other main flags: --n, --threads, --iterations. Full list: python reduction.py --help.
Output
Example run (python reduction.py, defaults) on Tesla T10, compute capability 7.5:
======================================================================
Fast Array Sum using Shared Memory - Two-Stage Reduction
======================================================================
Demonstrates: Efficient parallel reduction using shared memory
Device Information:
Name: Tesla T10
Compute Capability: sm_7.5
Configuration:
Array size: 16,777,216 elements
Data type: float
Memory: 64.00 MB
Threads per block: 256
Two-Stage Reduction Strategy:
Stage 1: GPU block reduction
- Number of blocks: 32768
- Elements per block: 512
- Output: 32768 partial sums
Stage 2: CPU final reduction
- Combine 32768 partial sums → 1 final result
Compiling CUDA kernel...
Kernel 'blockReduceKernel_float' compiled successfully
> Generating random input data...
> Computing reference result on CPU...
CPU time: 2.428208 seconds
> Warming up GPU...
Warm-up completed
> Benchmarking Stage 1 (GPU block reduction)...
Running 100 iterations...
> Running Stage 2 (CPU final reduction)...
======================================================================
Performance Results
======================================================================
Stage 1 (GPU block reduction):
Average time: 0.338404 ms
Throughput: 198.31 GB/s
Stage 2 (CPU final reduction):
Time: 0.078073 ms
(32768 partial sums)
Total time: 0.416477 ms
Speedup vs CPU: 5830.35x
> Validating results...
GPU result: 2147639808.00000000
CPU result: 2147639929.62027407
Test PASSED
======================================================================
Summary
======================================================================
Key optimizations:
- Load 2 elements per thread: 8,388,608 global reads (50% savings)
- Shared memory for reduction: ~10-20x faster than global memory
- Parallel block outputs: 32768 independent writes
Result: 198.31 GB/s throughput
======================================================================
Two-Stage Reduction completed successfully!
======================================================================
Files
reduction.py · requirements.txt · README.md