mirror of
https://github.com/NVIDIA/cuda-samples.git
synced 2026-05-14 14:06:53 +08:00
- Added Python samples for CUDA Python 1.0 release - Renamed top-level `Samples` directory to `cpp` to accommodate Python samples.
3.3 KiB
3.3 KiB
Sample: multiGPUGradientAverage (Python)
Description
This sample demonstrates gradient averaging across multiple GPUs using MPI and cuda.core. Each GPU computes local gradients, which are synchronized (averaged) across all GPUs using MPI Allreduce with host-staging (GPU → CPU → MPI → CPU → GPU) for maximum compatibility.
What you will learn
- How to initialize MPI for multi-process GPU communication
- How to map MPI ranks to CUDA devices consistently
- How to integrate cuda.core streams with CuPy using
ExternalStream - How to compile and launch custom CUDA kernels using cuda.core
- How to use cuda.core Event for GPU timing measurements
- How to use MPI Allreduce with host-staging for universal compatibility
Prerequisites
- Python 3.10+
- CUDA Toolkit 13.0+
- Standard MPI implementation (OpenMPI, MPICH, or Intel MPI)
- Multiple NVIDIA GPUs (tested with 2+ GPUs)
Installation
pip install mpi4py cupy-cuda13x cuda-python cuda-core
Running
IMPORTANT: This sample MUST be run with mpirun with at least 2 processes.
# Single node (2 GPUs)
mpirun -np 2 python multiGPUGradientAverage.py --size 10000
# Single node (4 GPUs)
mpirun -np 4 python multiGPUGradientAverage.py --size 10000
# With specific GPUs
CUDA_VISIBLE_DEVICES=0,2 mpirun -np 2 python multiGPUGradientAverage.py
Sample Output
[Rank 0] World size = 4
======================================================================
Multi-GPU Gradient Average Demo
======================================================================
Number of MPI ranks (GPUs): 4
Gradient vector length per GPU: 10000
Device: NVIDIA GeForce RTX 4090
Computation: gradients computed on GPU via cuda.core.
Communication: gradients averaged via MPI_Allreduce on host (CPU) buffers.
======================================================================
Sample averaged gradient values (rank 0):
avg_grad[0] = 1.500000
avg_grad[5000] = 6.500000
avg_grad[9999] = 11.499000
Expected values:
expected[0] = 1.500000
expected[5000] = 6.500000
expected[9999] = 11.499000
Verifying gradient averaging correctness...
[PASS] Gradient averaging is correct.
[PASS] Gradient averaging is correct on all ranks.
Performance:
Kernel time (GPU only): 0.123 ms
MPI communication time (host-staging, end-to-end): 0.456 ms
Total time: 0.579 ms
======================================================================
Demo complete.
======================================================================
Key Technical Details
The sample uses cuda.core streams and makes CuPy use them via ExternalStream:
stream = device.create_stream()
cp.cuda.ExternalStream(int(stream.handle)).use()
GPU timing is measured using cuda.core Event:
from cuda.core import EventOptions
timing_options = EventOptions(enable_timing=True)
start_event = stream.record(options=timing_options)
# ... GPU work ...
end_event = stream.record(options=timing_options)
end_event.sync()
kernel_time = end_event - start_event # Returns milliseconds
The host-staging pattern transfers data GPU → CPU → MPI → CPU → GPU for universal MPI compatibility without requiring CUDA-aware MPI.
Troubleshooting
Error: "This sample requires at least 2 MPI processes!"
Solution: Run with mpirun -np 2 python multiGPUGradientAverage.py