mirror of
https://github.com/NVIDIA/cuda-samples.git
synced 2026-05-14 14:06:53 +08:00
- Added Python samples for CUDA Python 1.0 release - Renamed top-level `Samples` directory to `cpp` to accommodate Python samples.
Prefix Sum (Scan)
Demonstrates parallel prefix sum (scan) algorithms using cuda.compute with cuda.core stream management.
Overview
- Inclusive scan:
output[i] = [init_value] + input[0] + input[1] + ... + input[i] - Exclusive scan:
output[i] = init_value + input[0] + input[1] + ... + input[i-1] - Uses cuda.compute APIs for optimized CUB-based implementations
- Uses cuda.core APIs for device and stream management
- Demonstrates CuPy integration via
ExternalStream
Requirements
Hardware
- NVIDIA GPU with CUDA support
Software
- CUDA Toolkit 13.0+
- Python 3.10+
cuda-python(13.0.0+)cuda-core(>=0.6.0)cuda-cccl(1.0.0+)cupy-cuda13x(13.0.0+)numpy(>=2.3.2)
Usage
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # Linux/macOS
# venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Run sample
python prefixSum.py
Key Concepts
| Scan Type | Formula | First Element |
|---|---|---|
| Inclusive | output[i] = [init_value] + Σ input[0..i] |
[init_value] + input[0] |
| Exclusive | output[i] = init_value + Σ input[0..i-1] |
init_value (typically 0, the identity for sum) |
Stream Management
This sample demonstrates proper stream usage across libraries:
# Create stream with cuda.core
stream = device.create_stream()
# Wrap for CuPy compatibility (requires int handle)
cp_stream = cp.cuda.ExternalStream(int(stream.handle))
# Use with CuPy operations
with cp_stream:
d_input = cp.asarray(data)
d_output = cp.empty_like(d_input)
# Pass to cuda.compute
inclusive_scan(
d_in=d_input,
d_out=d_output,
op=OpKind.PLUS,
init_value=None,
num_items=len(d_input),
stream=stream,
)
Applications
- Stream compaction
- Radix sort
- Histogram computation
- Polynomial evaluation