Dheemanth aeab82ff30
CUDA 13.2 samples update (#432)
- Added Python samples for CUDA Python 1.0 release
- Renamed top-level `Samples` directory to `cpp` to accommodate Python samples.
2026-05-13 17:13:18 -05:00
..
2026-05-13 17:13:18 -05:00

Sample: multiGPUGradientAverage (Python)

Description

This sample demonstrates gradient averaging across multiple GPUs using MPI and cuda.core. Each GPU computes local gradients, which are synchronized (averaged) across all GPUs using MPI Allreduce with host-staging (GPU → CPU → MPI → CPU → GPU) for maximum compatibility.

What you will learn

  • How to initialize MPI for multi-process GPU communication
  • How to map MPI ranks to CUDA devices consistently
  • How to integrate cuda.core streams with CuPy using ExternalStream
  • How to compile and launch custom CUDA kernels using cuda.core
  • How to use cuda.core Event for GPU timing measurements
  • How to use MPI Allreduce with host-staging for universal compatibility

Prerequisites

  • Python 3.10+
  • CUDA Toolkit 13.0+
  • Standard MPI implementation (OpenMPI, MPICH, or Intel MPI)
  • Multiple NVIDIA GPUs (tested with 2+ GPUs)

Installation

pip install mpi4py cupy-cuda13x cuda-python cuda-core

Running

IMPORTANT: This sample MUST be run with mpirun with at least 2 processes.

# Single node (2 GPUs)
mpirun -np 2 python multiGPUGradientAverage.py --size 10000

# Single node (4 GPUs)
mpirun -np 4 python multiGPUGradientAverage.py --size 10000

# With specific GPUs
CUDA_VISIBLE_DEVICES=0,2 mpirun -np 2 python multiGPUGradientAverage.py

Sample Output

[Rank 0] World size = 4

======================================================================
Multi-GPU Gradient Average Demo
======================================================================
Number of MPI ranks (GPUs): 4
Gradient vector length per GPU: 10000
Device: NVIDIA GeForce RTX 4090
Computation: gradients computed on GPU via cuda.core.
Communication: gradients averaged via MPI_Allreduce on host (CPU) buffers.
======================================================================

Sample averaged gradient values (rank 0):
  avg_grad[0] = 1.500000
  avg_grad[5000] = 6.500000
  avg_grad[9999] = 11.499000

Expected values:
  expected[0] = 1.500000
  expected[5000] = 6.500000
  expected[9999] = 11.499000

Verifying gradient averaging correctness...
[PASS] Gradient averaging is correct.
[PASS] Gradient averaging is correct on all ranks.

Performance:
  Kernel time (GPU only): 0.123 ms
  MPI communication time (host-staging, end-to-end): 0.456 ms
  Total time: 0.579 ms

======================================================================
Demo complete.
======================================================================

Key Technical Details

The sample uses cuda.core streams and makes CuPy use them via ExternalStream:

stream = device.create_stream()
cp.cuda.ExternalStream(int(stream.handle)).use()

GPU timing is measured using cuda.core Event:

from cuda.core import EventOptions
timing_options = EventOptions(enable_timing=True)
start_event = stream.record(options=timing_options)
# ... GPU work ...
end_event = stream.record(options=timing_options)
end_event.sync()
kernel_time = end_event - start_event  # Returns milliseconds

The host-staging pattern transfers data GPU → CPU → MPI → CPU → GPU for universal MPI compatibility without requiring CUDA-aware MPI.

Troubleshooting

Error: "This sample requires at least 2 MPI processes!"

Solution: Run with mpirun -np 2 python multiGPUGradientAverage.py