Dheemanth aeab82ff30
CUDA 13.2 samples update (#432)
- Added Python samples for CUDA Python 1.0 release
- Renamed top-level `Samples` directory to `cpp` to accommodate Python samples.
2026-05-13 17:13:18 -05:00

3.1 KiB

Sample: Parallel Histogram with Atomics (Python)

Description

Compute histograms on the GPU using atomic operations to handle concurrent updates from multiple threads. This sample demonstrates the modern cuda.core API for kernel compilation and launch, comparing two approaches:

  1. Global Atomics - All threads atomically update a single global histogram
  2. Privatized Histograms - Each block uses shared memory, then merges to global

What You'll Learn

  • Compiling CUDA C kernels with cuda.core.Program
  • Configuring kernel launches with cuda.core.LaunchConfig
  • Launching kernels with cuda.core.launch()
  • Using atomic operations (atomicAdd) for thread-safe updates
  • Optimizing with shared memory privatization
  • GPU timing with cuda.core Events

Key Concepts

Atomic Operations

When multiple threads update the same histogram bin, a race condition occurs. Atomic operations ensure thread-safe updates:

atomicAdd(&histogram[data[i]], 1);  // Thread-safe increment

Global vs Privatized Atomics

Approach Pros Cons
Global Simple High contention on popular bins
Privatized Significantly faster Extra shared memory, synchronization

Key APIs

From cuda.core:

  • Device - Device management and context
  • Program - Compile CUDA C source code
  • ProgramOptions - Set architecture, optimization flags
  • LaunchConfig - Configure grid and block dimensions
  • launch() - Launch compiled kernel
  • Stream - Async stream management
  • EventOptions - Configure events for GPU timing
  • stream.record() - Record events for timing

From cupy:

  • cp.random.randint() - Generate random data directly on GPU
  • cp.zeros() - Allocate zeroed GPU arrays

CUDA Atomic Functions (in kernel):

  • atomicAdd() - Thread-safe addition

Requirements

Hardware:

  • NVIDIA GPU with CUDA support

Software:

  • CUDA Toolkit 13.0 or newer
  • Python 3.10 or newer
  • See requirements.txt for Python packages

Installation

pip install -r requirements.txt

How to Run

python parallelHistogram.py

Expected Output

============================================================
Parallel Histogram with Atomics (cuda.core)
============================================================

Device: <Your GPU>
Compute Capability: ComputeCapability(major=X, minor=Y)

Compiling CUDA kernels with cuda.core.Program...
  Compiled for architecture: sm_XY

Generating 10,000,000 random values on GPU...

Verifying correctness...
  Global atomics:     PASSED
  Privatized atomics: PASSED

Benchmarking (100 iterations)...
  Global atomics:     X.XXX ms
  Privatized atomics: X.XXX ms
  Speedup:            XXx

Test PASSED

Files

  • parallelHistogram.py - Main sample using cuda.core
  • README.md - This file
  • requirements.txt - Dependencies

See Also