mirror of
https://github.com/NVIDIA/cuda-samples.git
synced 2026-05-14 14:06:53 +08:00
- Added Python samples for CUDA Python 1.0 release - Renamed top-level `Samples` directory to `cpp` to accommodate Python samples.
JIT Compilation and Link-Time Optimization (Python)
Description
This sample demonstrates how to build a kernel out of two independently
compiled translation units and link them at runtime with
cuda.core.Linker. This is the pattern a library would use to accept
user-supplied device code as a plug-in without recompiling its own
kernels from scratch.
The sample runs the same program in two linking modes:
- PTX linking - each module is compiled with
ProgramOptions(relocatable_device_code=True)down to PTX, and theLinkeremits a final cubin. The two modules stay independently compiled (no cross-module inlining). - Link-Time Optimization (LTO) - each module is compiled with
ProgramOptions(link_time_optimization=True)down to LTO IR, and theLinkeris configured withLinkerOptions(link_time_optimization=True)so the optimizer runs again across both modules, typically matching the code generation of a single-source build.
The "main" kernel apply_transform calls a user_transform device
function that lives in a separate source string, and the results of both
linking modes are verified against a NumPy reference.
What You'll Learn
- Compiling multiple
Programobjects into PTX or LTO IR - Linking independent object codes into a single cubin with
Linker - Choosing between
relocatable_device_codeandlink_time_optimization - How a library's main kernel can call into user-supplied device code
- When to prefer LTO over plain PTX linking
Key Libraries
cuda.core- Pythonic access to CUDA runtime, programs, and the JIT linkercupy- input and output buffers on the GPUnumpy- reference computation on the host
Key APIs
From cuda.core
ProgramOptions(relocatable_device_code=True)+Program.compile("ptx")- produce relocatable PTXProgramOptions(link_time_optimization=True)+Program.compile("ltoir")- produce LTO IRLinker(*object_codes, options=LinkerOptions(...))- create a JIT linker over multiple object codesLinkerOptions(link_time_optimization=True)- opt into LTO during linkingLinker.link("cubin")- produce a loadable moduleObjectCode.get_kernel(name)- fetch a kernel from the linked module
From cuda_samples_utils
print_gpu_info()- print device name and compute capability
Requirements
Hardware
- NVIDIA GPU with Compute Capability 7.0 or higher
Software
- CUDA Toolkit 13.0 or newer (matches
cuda-python13.x) - Python 3.10 or newer
cuda-python(>=13.0.0)cuda-core(>=0.6.0)cupy-cuda13x(>=13.0.0)
Installation
Install the required packages from requirements.txt:
cd /path/to/cuda-samples/python/2_CoreConcepts/jitLtoLinking
pip install -r requirements.txt
The requirements.txt installs:
cuda-python(>=13.0.0)cuda-core(>=0.6.0)cupy-cuda13x(>=13.0.0)
How to Run
Basic usage
cd cuda-samples/python/2_CoreConcepts/jitLtoLinking
python jitLtoLinking.py
With custom parameters
# Larger element count
python jitLtoLinking.py --elements 1048576
# Use a specific GPU
python jitLtoLinking.py --device 1
Expected Output
Device: <Your GPU Name>
Compute Capability: <X.Y>
[1] PTX linking (no LTO)
[ptx] result verified against NumPy reference
[2] LTO linking (link-time optimization)
[lto] result verified against NumPy reference
Both PTX and LTO linked kernels produced matching results. Done
Note: Device name and compute capability will vary based on your GPU.
Files
jitLtoLinking.py- Python implementation usingcuda.core.LinkerREADME.md- This filerequirements.txt- Sample dependencies../../Utilities/cuda_samples_utils.py- Common utilities (imported by this sample)
See Also
- CUDA Python Documentation
cuda.corecompilation API- Upstream
cuda.coreexample:jit_lto_fractal.py - NVIDIA nvJitLink documentation