# greenContext (Python) ## Description This sample demonstrates how to use **green contexts** with `cuda.core` to statically partition a GPU's streaming multiprocessors (SMs) so that independent kernels can run on dedicated subsets of the device. This examples takes A long-running kernel that fills the GPU's SMs, and a short but latency-sensitive "critical" kernel is launched shortly after. Without green contexts, the critical kernel must wait for SMs to free up. With green contexts, the GPU's SMs are partitioned so the critical kernel has its own dedicated SMs and can start immediately. Three timed scenarios are compared: 1. **Reference**: the critical kernel alone on the primary context, with no competing work. Establishes the pure compute time of the critical kernel when every SM on the device is available to it. 2. **Baseline**: both kernels run on the device's primary context, on two non-blocking streams that contend for all SMs. 3. **Green contexts**: the SMs are split into two disjoint groups (e.g. 112 + 16). Each kernel runs on a stream that belongs to its own green context, so the critical kernel never waits for SMs held by the long-running kernel. The headline metric is the total wall time of the critical kernel from launch to completion. In the baseline it is dominated by time spent waiting behind the long-running kernel. With green contexts it reflects the kernel's own compute time on its (smaller) SM partition. The reference row lets you separate those two effects: - `baseline - reference` is roughly the time the critical kernel spent waiting for SMs in the baseline run (the cost that green contexts eliminate). - `green / reference` is the compute slowdown caused by running on a smaller SM partition (the cost that green contexts introduce). ## What You'll Learn - Querying a device's SM resources via `Device.resources.sm` and reading `sm_count`, `min_partition_size`, `coscheduled_alignment` - Splitting an `SMResource` into disjoint partitions with `sm.split(SMResourceOptions(count=(A, B)))` - Creating a green context from an SM partition via `Device.create_context(ContextOptions(resources=[group]))` - Creating a non-blocking stream on a green context with `ctx.create_stream()` - Using CUDA events with timing enabled to measure kernel wall time across streams - Cleaning up green contexts safely with `ctx.close()` ## Key Libraries - [`cuda.core`](https://nvidia.github.io/cuda-python/cuda-core/latest/) - device management, SM partitioning, green contexts, compilation, and launching - `numpy` - scalar kernel arguments ## Key APIs ### From `cuda.core` - `Device.resources.sm` - the device's SM-type device resource - `SMResource.split(SMResourceOptions(count=(A, B)))` - partition SMs into disjoint groups (plus an optional remainder) - `Device.create_context(ContextOptions(resources=[sm_group]))` - create a green context provisioned with a specific SM partition - `Context.is_green` / `Context.resources` - introspect a green context - `Context.create_stream()` - create a non-blocking stream that is tied to the green context's SM partition - `Context.close()` - destroy a green context (must not be the thread's current context when closed) - `Device.create_event(EventOptions(timing_enabled=True))` / `Stream.record(event)` / `event2 - event1` - measure elapsed time in milliseconds between two events on the device - `Program(..., ProgramOptions(std="c++17", arch=f"sm_{device.arch}"))` / `program.compile("cubin", name_expressions=(...))` - compile the delay and critical kernels in one TU - `launch(stream, LaunchConfig(grid=..., block=...), kernel, ...)` - submit a kernel on a specific stream ## Requirements ### Hardware - Any NVIDIA GPU supported by green contexts. - Green-context SM partitioning is designed for larger server GPUs (H100, H200, B200, ...) but works on any supported GPU as long as the SM count is large enough to split meaningfully. ### Software - NVIDIA driver >= 12.4 - CUDA Toolkit 13.0 or newer. - Python 3.10 or newer. - `cuda-core` (`>=1.0.0`) ## Installation Install the required packages from `requirements.txt`: ```bash cd /path/to/cuda-samples/python/2_CoreConcepts/greenContext pip install -r requirements.txt ``` ## How to Run ### Basic usage The auto-default split reserves a small partition (~16 SMs) for the critical kernel and gives the rest to the long-running kernel. The exact sizes are chosen by probing the driver with a dry-run `sm.split`, escalating the alignment granularity in powers of two until the driver accepts the pair. This handles architectures where the driver enforces stricter alignment (e.g. TPC/GPC-pair alignment on Blackwell) than the reported `min_partition_size`. When that happens the sample prints a `Note:` line with the granularity it landed on. ```bash cd cuda-samples/python/2_CoreConcepts/greenContext python greenContext.py ``` ### Match the CUDA programming guide example (112 + 16) ```bash python greenContext.py --split 112,16 ``` ### Tune the workload ```bash # Longer long-running kernel, larger host launch gap python greenContext.py --delay-us 3000 --launch-gap-ms 2.0 # Smaller/lighter critical kernel so its own compute time is negligible python greenContext.py --critical-n 65536 --critical-iters 128 # Symmetric split: maximum SMs for the critical kernel, long kernel is # roughly 2x slower but the critical kernel runs close to its reference time. python greenContext.py --split 64,64 # Use a specific GPU python greenContext.py --device 1 ``` ### All options ``` --device CUDA device ID (default: 0) --split SM split as 'LONG,CRITICAL', e.g. '112,16'. Each side must be a multiple of the device's min_partition_size, and the driver may enforce additional architecture-specific alignment (e.g. TPC/partition-grid alignment on Blackwell). Omit --split to auto-select a driver-accepted split. --delay-us Per-block busy-wait of the delay kernel, in us (default: 2000) --delay-waves Number of waves of the delay kernel on the long partition. Drives the default --delay-blocks (default: 16) --delay-blocks Number of blocks for the delay kernel. Overrides --delay-waves if set. (default: --delay-waves * device SM count) --critical-n Work size of the critical kernel (default: 4194304) --critical-iters Inner math-loop iterations inside the critical kernel. Higher values make the critical kernel's own compute time more substantial relative to its wait time (default: 1024) --launch-gap-ms Host delay between launching the long and critical kernels (default: 1.0 ms) ``` ## Expected Output The output depends on the GPU and the number of SMs. On an RTX 4090 (128 SMs) with the default auto split: ``` [Green Context Sample using CUDA Core API] Device: NVIDIA GeForce RTX 4090 Compute Capability: sm_89 Total SMs: 128 Min. SM partition size: 2 SM co-scheduled alignment: 2 SM split (long/critical): 112 / 16 Workload parameters: delay kernel: 2048 blocks, 2000 us/block (~32.0 ms on 128 SMs) critical kernel: 4194304 elements, 1024 inner iterations host launch gap: 1.0 ms Compiling kernels ... Running reference scenario (critical kernel alone) ... Running baseline scenario (primary context) ... Running green context scenario ... scenario SMs (long/crit) long (ms) crit total (ms) crit offset (ms) ------------------------------------------------------------------------------------------------------- crit alone (primary ctx) -/128 - 0.425 - baseline (primary ctx) 128/128 32.034 30.024 1.090 green ctx (112+16 SMs) 112/16 38.017 2.696 1.075 long (ms) : wall time of the delay kernel crit total (ms) : launch-to-complete wall time of the critical kernel crit offset (ms) : when the critical stream started, relative to the long stream start Critical-kernel latency speedup (baseline vs green ctx): 11.14x Green-ctx compute cost vs unconstrained (crit alone): 6.34x Baseline time spent waiting for SMs (not computing): ~29.60 ms Done ``` **What to look for:** - The critical kernel alone (reference row) takes only a fraction of a millisecond; almost all of the baseline's `crit total` is time spent queued waiting for SMs, not compute. - The **critical kernel's wall time drops sharply** in the green-context scenario (from ~30 ms to a few ms in the example above) because it no longer waits for SMs held by the long-running kernel. - The **long-running kernel's duration may increase** proportional to the reduction in SMs available to it (128 -> 112 SMs ~= 14% slower; 128 -> 64 SMs ~= 2x slower). This is an expected tradeoff: you reserve SMs for a critical kernel by taking them away from the background workload. - The **compute cost** ratio (`green / reference`) shows how close the critical kernel is to ideal linear scaling with its SM count. A 112/16 split gives the critical kernel only 12.5% of the SMs and costs it roughly 6-7x its reference time; a 64/64 split gives it half the SMs and costs roughly 1.5-2x. - The `crit offset` column is approximately `--launch-gap-ms` in both full scenarios; it confirms the host launched the critical kernel the same amount of time after the long kernel in both runs. Exact timings vary with GPU model, driver version, clock state, and other concurrent GPU work. ## Files - `greenContext.py` - Python implementation using `cuda.core` green-context APIs - `README.md` - This file - `requirements.txt` - Sample dependencies ## See Also - [CUDA Python Documentation](https://nvidia.github.io/cuda-python/) - [Green Contexts in the CUDA C++ Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#green-contexts) - [`cuda.core` green-context test suite](https://github.com/NVIDIA/cuda-python/blob/main/cuda_core/tests/test_green_context.py) - the authoritative API reference