# Sample: Launch Configuration Tuning (Python)

## Description

Benchmark different CUDA kernel launch configurations to find the optimal block-size setting using `cuda.core` APIs. This sample demonstrates **performance tuning** by measuring execution time across various thread block sizes.

## What You'll Learn

- Compiling CUDA kernels at runtime with `cuda.core.Program`
- Launching kernels with different `LaunchConfig` settings
- Benchmarking kernel performance with precise timing
- Understanding how thread block size affects performance
- Tuning for memory-bound vs compute-bound kernels

## Key Concepts

### Launch Configuration with cuda.core

```python
# Configure kernel launch with specific thread block size
config = LaunchConfig(
    grid=(grid_size,),
    block=(block_size,),
    shmem_size=shared_memory_bytes
)

# Launch kernel
launch(stream, config, kernel, *args)
stream.sync()
```

### Thread Block Sizing

Thread block size significantly impacts performance due to:

| Factor | Impact |
|--------|--------|
| **Occupancy** | More active warps can hide memory latency |
| **Registers** | More threads/block = fewer registers/thread |
| **Shared Memory** | Divided among blocks on each SM |
| **Warp Efficiency** | Block size should be multiple of 32 |

### Benchmarking Approach

```python
# Use CUDA events for accurate GPU timing (not CPU wall-clock)
start_event = device.create_event(options=EventOptions(enable_timing=True))
end_event = device.create_event(options=EventOptions(enable_timing=True))

stream.record(start_event)
for _ in range(n_iterations):
    launch(stream, config, kernel, *args)
stream.record(end_event)
end_event.sync()
elapsed_ms = (end_event - start_event) / n_iterations
```

## Key APIs

### From `cuda.core`:

- `Device` - CUDA device management
- `Program` - Runtime kernel compilation (NVRTC)
- `ProgramOptions` - Compilation options (architecture target)
- `LaunchConfig` - Kernel launch configuration (grid/block dimensions)
- `launch` - Execute compiled kernel (accepts Buffer objects directly)
- `EventOptions` - GPU timing with CUDA events
- `ManagedMemoryResource` - Device-preferred unified memory
- `ManagedMemoryResourceOptions` - Set preferred_location for representative benchmarks

### From `numpy`:

- `np.from_dlpack()` - Zero-copy view of GPU buffers via DLPack

### Benchmarked Kernels:

- **vector_add** - Simple memory-bound kernel (C = A + B) - low sensitivity to block size
- **reduce_sum** - Shared memory reduction - high sensitivity to block size

## Requirements

### Hardware:

- NVIDIA GPU with CUDA support
- Minimum GPU memory: 512 MB

### Software:

- CUDA Toolkit 13.0 or newer
- Python 3.10 or newer
- See `requirements.txt` for Python packages

## Installation

```bash
pip install -r requirements.txt
```

## How to Run

```bash
python launchConfigTuning.py
```

## Expected Output

```
============================================================
Launch Configuration Tuning (cuda.core)
Finding the Best Block Size for Your Kernel
============================================================

Device: <Your GPU Name>
Compute Capability: X.X

Compiling CUDA kernels with cuda.core.Program...
  Target architecture: sm_XX
  ✓ vector_add kernel compiled
  ✓ reduce_sum kernel compiled

============================================================
VECTOR ADDITION - Launch Configuration Tuning
============================================================

Problem size: 10,000,000 elements
Kernel: vector_add (C = A + B)

Testing thread configurations: [32, 64, 128, 256, 512, 1024]
------------------------------------------------------------
Block Size:   32 | Blocks: 312500 | Time: X.XXXX ± X.XXXX ms
Block Size:   64 | Blocks: 156250 | Time: X.XXXX ± X.XXXX ms
...
------------------------------------------------------------

✓ OPTIMAL: block_size=XXX (X.XXXX ms)
✗ WORST:   block_size=XXX (X.XXXX ms)
  Speedup: X.XXx

✓ Results verified correct!

...

============================================================
SAMPLE COMPLETE
============================================================

Key Takeaway: The optimal thread configuration depends on your
specific kernel characteristics. Always benchmark to find the best!
```

## Tuning Guidelines

### Start Here
- **128-256 threads/block** is a good starting point for most kernels
- Always use **multiples of 32** (warp size)

### Memory-Bound Kernels
- Less sensitive to thread configuration
- Focus on memory access patterns
- Higher thread counts help hide latency

### Compute-Bound Kernels
- More sensitive to thread configuration
- Watch for register pressure at high thread counts
- Profile with Nsight Compute

### Reduction Kernels
- Block size affects shared memory usage
- Power-of-2 sizes simplify reduction logic
- Often 256-512 threads works well

## Files

- `launchConfigTuning.py` - Python implementation using cuda.core
- `README.md` - This file
- `requirements.txt` - Sample dependencies

## See Also

- [cuda.core Documentation](https://nvidia.github.io/cuda-python/cuda-core/latest/)
- [cuda.core.LaunchConfig](https://nvidia.github.io/cuda-python/cuda-core/latest/generated/cuda.core.LaunchConfig.html)
- [CUDA Occupancy Calculator](https://docs.nvidia.com/cuda/cuda-occupancy-calculator/)
- [CUDA Best Practices Guide - Execution Configuration](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#execution-configuration-optimizations)
- [Nsight Compute Profiler](https://developer.nvidia.com/nsight-compute)