cuda-samples/python/2_CoreConcepts/streamingCopyComputeOverlap
Dheemanth b7c5481c55
Release v13.3 of the CUDA samples with CUDA 13.3 Toolkit (#435)
This is the release of the CUDA 13.3 samples, which include additions for CUDA Tile C++, and updated CCCL and Python samples.
2026-05-27 16:50:59 -05:00
..

Sample: Streaming Copy + Compute Overlap (Python)

Description

Demonstrate how to overlap memory transfers (H2D/D2H) with kernel computation using CUDA streams. This technique hides transfer latency and improves GPU utilization.

What You'll Learn

  • Using PinnedMemoryResource for async-capable host memory
  • Using DeviceMemoryResource for GPU memory allocation
  • Creating multiple streams with Device.create_stream()
  • Async memory copies with Buffer.copy_to()
  • Overlapping H2D transfers, kernel execution, and D2H transfers

Key Concept

Without overlap (sequential):

[====H2D====][====Compute====][====D2H====]

With overlap (multiple streams):

Stream 0: [H2D][Compute][D2H]
Stream 1:      [H2D][Compute][D2H]
Stream 2:           [H2D][Compute][D2H]

Key APIs (all from cuda.core)

  • Device - Device management
  • Device.create_stream() - Create CUDA streams
  • Stream.sync() - Synchronize stream
  • PinnedMemoryResource - Pinned host memory (required for async transfers)
  • DeviceMemoryResource - GPU device memory
  • Buffer.copy_to(dst, stream=stream) - Async memory copy
  • Program, LaunchConfig, launch - Kernel compilation and execution

From numpy:

  • np.from_dlpack() - Zero-copy view of pinned memory buffers

Requirements

  • CUDA Toolkit 13.0+
  • Python 3.10+
  • cuda-python, cuda-core, numpy

Installation

pip install -r requirements.txt

How to Run

python streamingCopyComputeOverlap.py

Expected Output

============================================================
Streaming Copy + Compute Overlap
Using pure cuda.core APIs
============================================================

Device: NVIDIA GeForce RTX XXXX
Kernel compiled [OK]

Problem size: 16,000,000 elements (61 MB)

--- Sequential (no overlap) ---
Timeline: [H2D][Compute][D2H]
Time: X.XX ms (±X.XX)

--- Streamed (with overlap) ---
Stream 0: [H2D][Compute][D2H]
Stream 1:      [H2D][Compute][D2H]
Stream 2:           [H2D][Compute][D2H]
...
2 streams: X.XX ms (±X.XX) - speedup: X.XXx
4 streams: X.XX ms (±X.XX) - speedup: X.XXx
8 streams: X.XX ms (±X.XX) - speedup: X.XXx

============================================================
Key: Pinned memory + multiple streams = overlap transfers with compute

Note: Speedup depends on hardware characteristics. This technique
benefits most when transfer time is significant relative to compute.
============================================================

See Also