Dheemanth aeab82ff30
CUDA 13.2 samples update (#432)
- Added Python samples for CUDA Python 1.0 release
- Renamed top-level `Samples` directory to `cpp` to accommodate Python samples.
2026-05-13 17:13:18 -05:00
..
2026-05-13 17:13:18 -05:00

Sample: simpleZeroCopy (Python)

Description

This sample demonstrates zero-copy access using cuda.core to compile and launch a kernel, and cuda.bindings.runtime for mapped pinned host memory (cudaHostAlloc with cudaHostAllocMapped, cudaHostGetDevicePointer, and cudaFreeHost). The GPU loads and stores through device addresses that refer to that host memory—no cudaMemcpy in or out. The example is vector add with inputs and output as NumPy views of the host side of those buffers.

What you will learn

  • How to allocate mapped pinned host memory with cudaHostAlloc (via cuda.bindings.runtime) so the GPU can use cudaHostGetDevicePointer addresses in a kernel
  • How cuda.core.PinnedMemoryResource differs (staging/copies; not guaranteed to be cudaHostAllocMapped for direct kernel access)
  • How to build NumPy views of host addresses with ctypes and numpy.frombuffer
  • How to launch CUDA kernels with cuda.cores Program and launch, passing device pointers for mapped buffers
  • When zero-copy is beneficial vs. device memory with explicit transfers
  • How to validate results on the host without a D2H memcpy

Key libraries

  • numpy CPU arrays and reference computation
  • cuda-core Device, stream, Program, LaunchConfig, launch
  • cuda-python (cuda.bindings.runtime) cudaHostAlloc / cudaHostGetDevicePointer / cudaFreeHost for mapped host memory

Key APIs

From cuda.core: Device, device.create_stream(), Program, ProgramOptions, LaunchConfig, launch

From cuda.bindings.runtime: cudaHostAlloc (with cudaHostAllocMapped | cudaHostAllocPortable), cudaHostGetDevicePointer, cudaFreeHost

From the standard library: ctypes wrap host pointers for numpy.frombuffer float32 views

Memory management: Free host memory with cudaFreeHost in a finally block; call stream.close() when done.

Zero-Copy Memory: When to Use

Benefits

  • No explicit transfers: Simplifies code by eliminating cudaMemcpy calls
  • Automatic synchronization: Host can access results immediately after kernel completes
  • Good for small data: Overhead of explicit transfers can exceed benefits for small arrays
  • Excellent for integrated GPUs: On systems like Jetson (Tegra), CPU and GPU share physical memory

Limitations

  • Slower access: Limited by PCIe bandwidth vs. device memory bandwidth
  • Not for compute-intensive: Device memory is much faster for frequently accessed data
  • Discrete GPU overhead: Each access crosses PCIe bus

Best Use Cases

  1. Small data sets where transfer overhead dominates
  2. Data accessed infrequently by GPU
  3. Integrated GPU platforms (shared memory)
  4. Streaming data from host to device
  5. Prototyping and debugging (simplifies memory management)

Requirements

  1. NVIDIA GPU and a driver compatible with your installed cuda-python / cuda-core wheels.
  2. Python 3.10 or newer
  3. Install pip install -r requirements.txt (NumPy, cuda-python, cuda-core). A system CUDA Toolkit is not strictly required if the process can load the driver/runtime; use LD_LIBRARY_PATH in How to run if you hit missing-library errors.

Install packages:

pip install -r requirements.txt

Or manually:

pip install numpy>=2.3.2 cuda-core>=0.6.0 cuda-python>=13.0.0

How to run

Basic usage:

# Pre-steps: Set library path
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# Run with default parameters (1M elements)
python simpleZeroCopy.py

With custom parameters:

# Use 2M elements
python simpleZeroCopy.py --num_elements 2097152

# Show help
python simpleZeroCopy.py --help

Command line arguments

  • --num_elements: Number of elements in vectors (default: 1048576)
    • Each vector uses num_elements * 4 bytes (float32)
    • Default: ~4 MB per vector, ~12 MB total

Expected Output

Device name and compute capability depend on your system; the rest of the log should match this shape when validation passes.

======================================================================
simpleZeroCopy - CUDA Python Sample
======================================================================

Device Information:
  Name: <your GPU>
  Compute Capability: <major>.<minor>

> Memory: mapped pinned host (cudaHostAlloc + cudaHostGetDevicePointer)

Compiling CUDA kernel...
  Kernel compiled successfully

Allocating memory:
  Vector size: 1,048,576 elements
  Memory per vector: 4.00 MB
  Total memory: 12.00 MB

> Allocating mapped pinned host memory...
  Mapped host memory allocated successfully

> Initializing vectors on host...
> Computing reference result on CPU...

> Launching vectorAddGPU kernel...
  Note: GPU accesses host memory directly (zero-copy)
  Kernel execution complete

> Checking results from vectorAddGPU()...
  Comparing 1,048,576 elements...
  Relative error: 0.000000e+00
  Validation PASSED

======================================================================
simpleZeroCopy completed successfully!
======================================================================

Files

  • simpleZeroCopy.py Main Python implementation
  • README.md This file
  • requirements.txt Python package dependencies