cuda-samples/python/2_CoreConcepts/streamingCopyComputeOverlap
Dheemanth aeab82ff30
CUDA 13.2 samples update (#432)
- Added Python samples for CUDA Python 1.0 release
- Renamed top-level `Samples` directory to `cpp` to accommodate Python samples.
2026-05-13 17:13:18 -05:00
..
2026-05-13 17:13:18 -05:00

Sample: Streaming Copy + Compute Overlap (Python)

Description

Demonstrate how to overlap memory transfers (H2D/D2H) with kernel computation using CUDA streams. This technique hides transfer latency and improves GPU utilization.

What You'll Learn

  • Using PinnedMemoryResource for async-capable host memory
  • Using DeviceMemoryResource for GPU memory allocation
  • Creating multiple streams with Device.create_stream()
  • Async memory copies with Buffer.copy_to()
  • Overlapping H2D transfers, kernel execution, and D2H transfers

Key Concept

Without overlap (sequential):

[====H2D====][====Compute====][====D2H====]

With overlap (multiple streams):

Stream 0: [H2D][Compute][D2H]
Stream 1:      [H2D][Compute][D2H]
Stream 2:           [H2D][Compute][D2H]

Key APIs (all from cuda.core)

  • Device - Device management
  • Device.create_stream() - Create CUDA streams
  • Stream.sync() - Synchronize stream
  • PinnedMemoryResource - Pinned host memory (required for async transfers)
  • DeviceMemoryResource - GPU device memory
  • Buffer.copy_to(dst, stream=stream) - Async memory copy
  • Program, LaunchConfig, launch - Kernel compilation and execution

From numpy:

  • np.from_dlpack() - Zero-copy view of pinned memory buffers

Requirements

  • CUDA Toolkit 13.0+
  • Python 3.10+
  • cuda-python, cuda-core, numpy

Installation

pip install -r requirements.txt

How to Run

python streamingCopyComputeOverlap.py

Expected Output

============================================================
Streaming Copy + Compute Overlap
Using pure cuda.core APIs
============================================================

Device: NVIDIA GeForce RTX XXXX
Kernel compiled ✓

Problem size: 16,000,000 elements (61 MB)

--- Sequential (no overlap) ---
Timeline: [H2D][Compute][D2H]
Time: X.XX ms (±X.XX)

--- Streamed (with overlap) ---
Stream 0: [H2D][Compute][D2H]
Stream 1:      [H2D][Compute][D2H]
Stream 2:           [H2D][Compute][D2H]
...
2 streams: X.XX ms (±X.XX) - speedup: X.XXx
4 streams: X.XX ms (±X.XX) - speedup: X.XXx
8 streams: X.XX ms (±X.XX) - speedup: X.XXx

============================================================
Key: Pinned memory + multiple streams = overlap transfers with compute

Note: Speedup depends on hardware characteristics. This technique
benefits most when transfer time is significant relative to compute.
============================================================

See Also