# Sample: Streaming Copy + Compute Overlap (Python) ## Description Demonstrate how to overlap memory transfers (H2D/D2H) with kernel computation using CUDA streams. This technique hides transfer latency and improves GPU utilization. ## What You'll Learn - Using `PinnedMemoryResource` for async-capable host memory - Using `DeviceMemoryResource` for GPU memory allocation - Creating multiple streams with `Device.create_stream()` - Async memory copies with `Buffer.copy_to()` - Overlapping H2D transfers, kernel execution, and D2H transfers ## Key Concept **Without overlap (sequential):** ``` [====H2D====][====Compute====][====D2H====] ``` **With overlap (multiple streams):** ``` Stream 0: [H2D][Compute][D2H] Stream 1: [H2D][Compute][D2H] Stream 2: [H2D][Compute][D2H] ``` ## Key APIs (all from `cuda.core`) - `Device` - Device management - `Device.create_stream()` - Create CUDA streams - `Stream.sync()` - Synchronize stream - `PinnedMemoryResource` - Pinned host memory (required for async transfers) - `DeviceMemoryResource` - GPU device memory - `Buffer.copy_to(dst, stream=stream)` - Async memory copy - `Program`, `LaunchConfig`, `launch` - Kernel compilation and execution ### From `numpy`: - `np.from_dlpack()` - Zero-copy view of pinned memory buffers ## Requirements - CUDA Toolkit 13.0+ - Python 3.10+ - `cuda-python`, `cuda-core`, `numpy` ## Installation ```bash pip install -r requirements.txt ``` ## How to Run ```bash python streamingCopyComputeOverlap.py ``` ## Expected Output ``` ============================================================ Streaming Copy + Compute Overlap Using pure cuda.core APIs ============================================================ Device: NVIDIA GeForce RTX XXXX Kernel compiled ✓ Problem size: 16,000,000 elements (61 MB) --- Sequential (no overlap) --- Timeline: [H2D][Compute][D2H] Time: X.XX ms (±X.XX) --- Streamed (with overlap) --- Stream 0: [H2D][Compute][D2H] Stream 1: [H2D][Compute][D2H] Stream 2: [H2D][Compute][D2H] ... 2 streams: X.XX ms (±X.XX) - speedup: X.XXx 4 streams: X.XX ms (±X.XX) - speedup: X.XXx 8 streams: X.XX ms (±X.XX) - speedup: X.XXx ============================================================ Key: Pinned memory + multiple streams = overlap transfers with compute Note: Speedup depends on hardware characteristics. This technique benefits most when transfer time is significant relative to compute. ============================================================ ``` ## See Also - [cuda.core Documentation](https://nvidia.github.io/cuda-python/cuda-core/latest/) - [CUDA Streams Best Practices](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#overlapping-data-transfers)