- Added Python samples for CUDA Python 1.0 release - Renamed top-level `Samples` directory to `cpp` to accommodate Python samples.
cudaGraphs (Python)
Description
This sample demonstrates how to capture a multi-stage kernel pipeline as a
CUDA graph with cuda.core and replay it with a single driver call.
The sample runs a three-stage elementwise pipeline
r3 = (a + b) * c - a in two modes:
- Individual launches - one
launch(stream, ...)per stage, repeated for every iteration of the pipeline. - CUDA graph replay - the same three launches are recorded into a
Graphonce and replayed withgraph.launch(stream)on each iteration.
Both paths are timed over N iterations and their results are verified against a reference computation. The sample also re-launches the graph after mutating the input buffers to show that the graph captures pointers (not data), so the same graph can process new inputs without rebuilding.
What You'll Learn
- Creating a
GraphBuilderfrom a stream withstream.create_graph_builder() - Capturing launches with
begin_building()andend_building() - Completing a graph with
builder.complete()and uploading it to a stream - Replaying the graph with
graph.launch(stream) - Measuring the launch-overhead savings for small kernels
- Re-running the same graph against updated input data
Key Libraries
cuda.core- Pythonic access to CUDA runtime, programs, and graphscupy- input buffers and result verificationnumpy- scalar kernel arguments
Key APIs
From cuda.core
Stream.create_graph_builder()- obtain aGraphBuilderGraphBuilder.begin_building()/end_building()- begin and finish recording launches issued against the builderGraphBuilder.complete()- produce an executableGraphGraph.upload(stream)- upload the graph structure to the deviceGraph.launch(stream)- replay the entire graphlaunch(graph_builder, config, kernel, ...)- record a kernel launch into the graph being built
From cuda_samples_utils
print_gpu_info()- print device name and compute capability
Requirements
Hardware
- NVIDIA GPU with Compute Capability 7.0 or higher
- Minimum GPU memory: 512 MB
Software
- CUDA Toolkit 13.0 or newer (matches
cuda-python13.x) - Python 3.10 or newer
cuda-python(>=13.0.0)cuda-core(>=0.6.0)cupy-cuda13x(>=13.0.0)
Installation
Install the required packages from requirements.txt:
cd /path/to/cuda-samples/python/2_CoreConcepts/cudaGraphs
pip install -r requirements.txt
The requirements.txt installs:
cuda-python(>=13.0.0)cuda-core(>=0.6.0)cupy-cuda13x(>=13.0.0)
How to Run
Basic usage
cd cuda-samples/python/2_CoreConcepts/cudaGraphs
python cudaGraphs.py
With custom parameters
# Larger vectors and more iterations
python cudaGraphs.py --elements 4096 --iters 2000
# Use a specific GPU
python cudaGraphs.py --device 1
Short vectors exaggerate the launch-overhead savings; larger vectors will show the two approaches converging because per-launch overhead becomes negligible next to kernel runtime.
Expected Output
Speedup numbers vary with GPU and host CPU.
Device: <Your GPU Name>
Compute Capability: <X.Y>
Individual launches: 1000 iters in 0.0085s (8.49 us/iter)
Building CUDA graph...
Graph replay: 1000 iters in 0.0034s (3.41 us/iter)
Graph speedup: 2.49x
Graph replay on updated data verified (same graph, new buffer contents)
Done
Note: Device name, compute capability, and speedup will vary based on your GPU and host CPU.
Files
cudaGraphs.py- Python implementation usingcuda.coreCUDA graphsREADME.md- This filerequirements.txt- Sample dependencies../../Utilities/cuda_samples_utils.py- Common utilities (imported by this sample)
See Also
- CUDA Python Documentation
cuda.coregraphs API- Upstream
cuda.coreexample:cuda_graphs.py - CUDA Graphs programming guide