Dheemanth b7c5481c55
Release v13.3 of the CUDA samples with CUDA 13.3 Toolkit (#435)
This is the release of the CUDA 13.3 samples, which include additions for CUDA Tile C++, and updated CCCL and Python samples.
2026-05-27 16:50:59 -05:00

1.1 KiB

tileRope

Description

This sample demonstrates a Rotary Position Embedding (RoPE) forward pass using CUDA Tile C++. RoPE injects positional information into the query and key projections of an attention layer by rotating pairs of features in the head dimension by per-position angles. This implementation uses the split-half convention: for each token at position s the pair (q[i], q[i + D/2]) is rotated by theta = s * 10000^(-2i / D), so q[i]' = q[i]*cos(theta) - q[i+D/2]*sin(theta) and q[i+D/2]' = q[i]*sin(theta) + q[i+D/2]*cos(theta). The cuda::tiles::partition_view type is used to partition each (batch, position) token's Q and K tensors into 2D tiles over (heads, half_rope_dim), and a single block processes all heads for one token in parallel, writing the result back in place. A SIMT kernel is used to initialize the inputs and the cos/sin tables.

Expected Output

Success! RoPE matches expected results.

Prerequisites