Dheemanth b7c5481c55
Release v13.3 of the CUDA samples with CUDA 13.3 Toolkit (#435)
This is the release of the CUDA 13.3 samples, which include additions for CUDA Tile C++, and updated CCCL and Python samples.
2026-05-27 16:50:59 -05:00

925 B

tileLayerNorm

Description

This sample demonstrates a persistent layer-norm forward pass using CUDA Tile C++: y = (x - mean) * rsqrt(var + eps) * weight + bias. The grid launches NUM_SMS persistent blocks; each block walks the row dimension with a grid-stride loop, processing BLOCK_N rows by BLOCK_D cols per iteration and striding by NUM_SMS * BLOCK_N rows between iterations. Per-row mean and inverse standard deviation are reduced across the column dimension with cuda::tiles row reductions and saved to float32 side buffers, while the weight and bias tiles are loaded once and broadcast across rows.

Expected Output

Success! Persistent LayerNorm matches expected results.

Prerequisites