mirror of
https://github.com/NVIDIA/cuda-samples.git
synced 2026-05-14 14:06:53 +08:00
- Added Python samples for CUDA Python 1.0 release - Renamed top-level `Samples` directory to `cpp` to accommodate Python samples.
Sample: Parallel Histogram with Atomics (Python)
Description
Compute histograms on the GPU using atomic operations to handle concurrent updates from multiple threads. This sample demonstrates the modern cuda.core API for kernel compilation and launch, comparing two approaches:
- Global Atomics - All threads atomically update a single global histogram
- Privatized Histograms - Each block uses shared memory, then merges to global
What You'll Learn
- Compiling CUDA C kernels with
cuda.core.Program - Configuring kernel launches with
cuda.core.LaunchConfig - Launching kernels with
cuda.core.launch() - Using atomic operations (
atomicAdd) for thread-safe updates - Optimizing with shared memory privatization
- GPU timing with
cuda.coreEvents
Key Concepts
Atomic Operations
When multiple threads update the same histogram bin, a race condition occurs. Atomic operations ensure thread-safe updates:
atomicAdd(&histogram[data[i]], 1); // Thread-safe increment
Global vs Privatized Atomics
| Approach | Pros | Cons |
|---|---|---|
| Global | Simple | High contention on popular bins |
| Privatized | Significantly faster | Extra shared memory, synchronization |
Key APIs
From cuda.core:
Device- Device management and contextProgram- Compile CUDA C source codeProgramOptions- Set architecture, optimization flagsLaunchConfig- Configure grid and block dimensionslaunch()- Launch compiled kernelStream- Async stream managementEventOptions- Configure events for GPU timingstream.record()- Record events for timing
From cupy:
cp.random.randint()- Generate random data directly on GPUcp.zeros()- Allocate zeroed GPU arrays
CUDA Atomic Functions (in kernel):
atomicAdd()- Thread-safe addition
Requirements
Hardware:
- NVIDIA GPU with CUDA support
Software:
- CUDA Toolkit 13.0 or newer
- Python 3.10 or newer
- See
requirements.txtfor Python packages
Installation
pip install -r requirements.txt
How to Run
python parallelHistogram.py
Expected Output
============================================================
Parallel Histogram with Atomics (cuda.core)
============================================================
Device: <Your GPU>
Compute Capability: ComputeCapability(major=X, minor=Y)
Compiling CUDA kernels with cuda.core.Program...
Compiled for architecture: sm_XY
Generating 10,000,000 random values on GPU...
Verifying correctness...
Global atomics: PASSED
Privatized atomics: PASSED
Benchmarking (100 iterations)...
Global atomics: X.XXX ms
Privatized atomics: X.XXX ms
Speedup: XXx
Test PASSED
Files
parallelHistogram.py- Main sample using cuda.coreREADME.md- This filerequirements.txt- Dependencies