mirror of https://github.com/NVIDIA/cuda-samples.git synced 2026-07-16 21:06:52 +08:00

History

Release v13.3 of the CUDA samples with CUDA 13.3 Toolkit (#435 )

This is the release of the CUDA 13.3 samples, which include additions for CUDA Tile C++, and updated CCCL and Python samples.

2026-05-27 16:50:59 -05:00

README.md

Release v13.3 of the CUDA samples with CUDA 13.3 Toolkit (#435 )

2026-05-27 16:50:59 -05:00

requirements.txt

Release v13.3 of the CUDA samples with CUDA 13.3 Toolkit (#435 )

2026-05-27 16:50:59 -05:00

streamingCopyComputeOverlap.py

Release v13.3 of the CUDA samples with CUDA 13.3 Toolkit (#435 )

2026-05-27 16:50:59 -05:00

README.md

Sample: Streaming Copy + Compute Overlap (Python)

Description

Demonstrate how to overlap memory transfers (H2D/D2H) with kernel computation using CUDA streams. This technique hides transfer latency and improves GPU utilization.

What You'll Learn

Using PinnedMemoryResource for async-capable host memory
Using DeviceMemoryResource for GPU memory allocation
Creating multiple streams with Device.create_stream()
Async memory copies with Buffer.copy_to()
Overlapping H2D transfers, kernel execution, and D2H transfers

Key Concept

Without overlap (sequential):

[====H2D====][====Compute====][====D2H====]

With overlap (multiple streams):

Stream 0: [H2D][Compute][D2H]
Stream 1:      [H2D][Compute][D2H]
Stream 2:           [H2D][Compute][D2H]

Key APIs (all from `cuda.core`)

Device - Device management
Device.create_stream() - Create CUDA streams
Stream.sync() - Synchronize stream
PinnedMemoryResource - Pinned host memory (required for async transfers)
DeviceMemoryResource - GPU device memory
Buffer.copy_to(dst, stream=stream) - Async memory copy
Program, LaunchConfig, launch - Kernel compilation and execution

From `numpy`:

np.from_dlpack() - Zero-copy view of pinned memory buffers

Requirements

CUDA Toolkit 13.0+
Python 3.10+
cuda-python, cuda-core, numpy

Installation

pip install -r requirements.txt

How to Run

python streamingCopyComputeOverlap.py

Expected Output

============================================================
Streaming Copy + Compute Overlap
Using pure cuda.core APIs
============================================================

Device: NVIDIA GeForce RTX XXXX
Kernel compiled [OK]

Problem size: 16,000,000 elements (61 MB)

--- Sequential (no overlap) ---
Timeline: [H2D][Compute][D2H]
Time: X.XX ms (±X.XX)

--- Streamed (with overlap) ---
Stream 0: [H2D][Compute][D2H]
Stream 1:      [H2D][Compute][D2H]
Stream 2:           [H2D][Compute][D2H]
...
2 streams: X.XX ms (±X.XX) - speedup: X.XXx
4 streams: X.XX ms (±X.XX) - speedup: X.XXx
8 streams: X.XX ms (±X.XX) - speedup: X.XXx

============================================================
Key: Pinned memory + multiple streams = overlap transfers with compute

Note: Speedup depends on hardware characteristics. This technique
benefits most when transfer time is significant relative to compute.
============================================================

README.md

Sample: Streaming Copy + Compute Overlap (Python)

Description

What You'll Learn

Key Concept

Key APIs (all from cuda.core)

From numpy:

Requirements

Installation

How to Run

Expected Output

See Also

Key APIs (all from `cuda.core`)

From `numpy`: