Release v13.3 of the CUDA samples with CUDA 13.3 Toolkit (#435)

This is the release of the CUDA 13.3 samples, which include additions for CUDA Tile C++, and updated CCCL and Python samples.
This commit is contained in:
Dheemanth 2026-05-27 14:50:59 -07:00 committed by GitHub
parent aeab82ff30
commit b7c5481c55
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
150 changed files with 7977 additions and 246 deletions

5
.gitignore vendored
View File

@ -1,6 +1,11 @@
build
build-*/
test-results*/
.vs
.clangd
test
settings.json
launch.json
__pycache__/
*.py[co]
.pytest_cache/

View File

@ -1,5 +1,18 @@
## Changelog
### CUDA 13.3
* Added **CUDA Tile C++** samples under `cpp/9_CUDA_Tile`.
* Added a set of **CCCL 3.3 feature samples** under `cpp/4_CUDA_Libraries/`, each built against CCCL fetched via CPM (pinned to v3.3.3, with an optional `CCCL_SOURCE_DIR` override):
* `cubDeviceFind` - `cub::DeviceFind::FindIf`, `LowerBound`, and `UpperBound` device-wide search algorithms.
* `cubDeviceSegmentedScan` - `cub::DeviceSegmentedScan::ExclusiveSegmentedSum` and `InclusiveSegmentedScan` with a custom binary operator.
* `cubDeviceTransform` - N-to-M `cub::DeviceTransform::Transform` where the op returns a `cuda::std::tuple`.
* `libcuxxRandom` - `cuda::pcg64` and `cuda::std::philox4x32` engines driving the uniform, normal, Poisson, and Bernoulli distributions from `<cuda/std/random>`.
* `libcuxxMdspan` - DLPack <-> `cuda::std::mdspan` bridging via `cuda::to_device_mdspan` / `cuda::to_dlpack_tensor`, plus `cuda::shared_memory_mdspan` for multi-dimensional views of shared memory.
* Added **cuda.compute 1.0 Python samples** under `python/2_CoreConcepts/`:
* `cudaComputeLambdas` - Python lambdas / regular callables driving `reduce_into`, `unary_transform`, and `inclusive_scan` in `cuda.compute` (from the `cuda-cccl` package).
* `binarySearch` - parallel `cuda.compute.upper_bound` / `lower_bound`, verified against `numpy.searchsorted`.
### CUDA 13.2 (update)
* Added **CUDA Python samples** under `python/`. These scripts use [CUDA Python](https://nvidia.github.io/cuda-python/) (including `cuda.core`) and are organized like the C++ tree: `1_GettingStarted`, `2_CoreConcepts`, `3_FrameworkInterop`, and `4_DistributedComputing`, plus shared helpers in `python/Utilities`. Each sample includes a `README.md` and `requirements.txt`. They are **not** built by the root CMake project; install dependencies with `pip install -r requirements.txt` in the sample directory, then run the corresponding `.py` file as documented in that samples README.
* Renamed top-level `Samples` directory to `cpp` to accommodate Python samples alongside existing C++ samples; updated path references in `CMakeLists.txt`, `README.md`, and `Common` headers accordingly.

View File

@ -354,7 +354,13 @@ inline char *sdkFindFilePath(const char *filename,
"../../../../Common/data/", // up 4 in tree
"../../../Common/data/", // up 3 in tree
"../../Common/data/" // up 2 in tree
"../../Common/data/", // up 2 in tree
"../../../../cpp/9_CUDA_Tile/<executable_name>/", // up 4 in tree
"../../../cpp/9_CUDA_Tile/<executable_name>/", // up 3 in tree
"../../cpp/9_CUDA_Tile/<executable_name>/", // up 2 in tree
"../cpp/9_CUDA_Tile/<executable_name>/", // up 1 in tree
"./cpp/9_CUDA_Tile/<executable_name>/" // up 0 in tree
};
// Extract the executable name

View File

@ -1,6 +1,6 @@
# CUDA Samples
Samples for CUDA Developers which demonstrates features in CUDA Toolkit. This version supports [CUDA Toolkit 13.2](https://developer.nvidia.com/cuda-downloads).
Samples for CUDA Developers which demonstrates features in CUDA Toolkit. This version supports [CUDA Toolkit 13.3](https://developer.nvidia.com/cuda-downloads).
## Release Notes
@ -181,10 +181,10 @@ QNX_HOST=/path/to/qnx/host \
QNX_TARGET=/path/to/qnx/target \
cmake .. \
-DBUILD_TEGRA=True \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-safe-13.0/bin/nvcc \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.3/bin/nvcc \
-DCMAKE_TOOLCHAIN_FILE=../cmake/toolchains/toolchain-aarch64-qnx.cmake \
-DCMAKE_LIBRARY_PATH=/usr/local/cuda-safe-13.0/thor/targets/aarch64-qnx/lib/stubs/ \
-DCMAKE_INCLUDE_PATH=/usr/local/cuda-safe-13.0/thor/targets/aarch64-qnx/include/
-DCMAKE_LIBRARY_PATH=/usr/local/cuda-13.3/thor/targets/aarch64-qnx/lib/stubs/ \
-DCMAKE_INCLUDE_PATH=/usr/local/cuda-13.3/thor/targets/aarch64-qnx/include/
```
### Forward Compatibility
@ -476,6 +476,9 @@ Samples that demonstrate the use of libNVVVM and NVVM IR.
### [8. Platform Specific](./cpp/8_Platform_Specific/Tegra/README.md)
Samples that are specific to certain platforms (Tegra, cuDLA, NvMedia, NvSci, OpenGL ES).
### [9. CUDA Tile](./cpp/9_CUDA_Tile/README.md)
Samples that demonstrate how to use CUDA Tile C++.
## Dependencies
Some CUDA Samples rely on third-party applications and/or libraries, or features provided by the CUDA Toolkit and Driver, to either build or execute. These dependencies are listed below.

1297
cmake/CPM.cmake Normal file

File diff suppressed because it is too large Load Diff

View File

@ -9,6 +9,9 @@ add_subdirectory(conjugateGradientMultiBlockCG)
add_subdirectory(conjugateGradientMultiDeviceCG)
add_subdirectory(conjugateGradientPrecond)
add_subdirectory(conjugateGradientUM)
add_subdirectory(cubDeviceFind)
add_subdirectory(cubDeviceSegmentedScan)
add_subdirectory(cubDeviceTransform)
add_subdirectory(cudaNvSci)
add_subdirectory(cuSolverDn_LinearSolver)
add_subdirectory(cuSolverRf)
@ -18,6 +21,8 @@ add_subdirectory(cuSolverSp_LowlevelQR)
add_subdirectory(freeImageInteropNPP)
add_subdirectory(histEqualizationNPP)
add_subdirectory(jitLto)
add_subdirectory(libcuxxMdspan)
add_subdirectory(libcuxxRandom)
add_subdirectory(lineOfSight)
add_subdirectory(matrixMulCUBLAS)
add_subdirectory(nvJPEG)

View File

@ -0,0 +1,62 @@
cmake_minimum_required(VERSION 3.20)
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/Modules")
project(cubDeviceFind LANGUAGES C CXX CUDA)
# Disable response file for libraries on QNX as qcc does not support lib paths with double quotes
if(CMAKE_SYSTEM_NAME STREQUAL "QNX")
set(CMAKE_CUDA_USE_RESPONSE_FILE_FOR_LIBRARIES OFF)
endif()
find_package(CUDAToolkit REQUIRED)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(CMAKE_CUDA_ARCHITECTURES 75 80 86 87 89 90 100 110 120)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Wno-deprecated-gpu-targets")
if(ENABLE_CUDA_DEBUG)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G") # enable cuda-gdb (may significantly affect performance on some targets)
else()
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -lineinfo") # add line information to all builds for debug tools (exclusive to -G option)
endif()
# Fetch CCCL via CPM.
# Override with -DCCCL_SOURCE_DIR=/path/to/cccl to use a local checkout
set(CCCL_SAMPLES_CCCL_TAG "v3.3.3" CACHE STRING
"Tag/branch of NVIDIA/cccl to fetch for the CCCL samples")
if(NOT TARGET CCCL::CCCL)
include("${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/CPM.cmake")
if(DEFINED CCCL_SOURCE_DIR AND NOT CCCL_SOURCE_DIR STREQUAL "")
CPMAddPackage(NAME CCCL SOURCE_DIR "${CCCL_SOURCE_DIR}")
else()
CPMAddPackage(
NAME CCCL
GIT_REPOSITORY "https://github.com/NVIDIA/cccl"
GIT_TAG "${CCCL_SAMPLES_CCCL_TAG}"
)
endif()
endif()
# Include directories and libraries
include_directories(../../../Common)
# Source file
# Add target for cubDeviceFind
add_executable(cubDeviceFind cubDeviceFind.cu)
target_compile_options(cubDeviceFind PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:--extended-lambda>)
target_compile_features(cubDeviceFind PRIVATE cxx_std_17 cuda_std_17)
set_target_properties(cubDeviceFind PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
target_link_libraries(cubDeviceFind PRIVATE
CUDA::cudart
CCCL::CCCL
)
# Include installation configuration
include(${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/InstallSamples.cmake)
setup_samples_install()

View File

@ -0,0 +1,48 @@
# cubDeviceFind - CUB DeviceFind Search Algorithms
## Description
This sample demonstrates the three device-wide search algorithms: `cub::DeviceFind::FindIf` for predicate search, and `cub::DeviceFind::LowerBound` / `UpperBound` for parallel binary search. Results are verified against `std::find_if`, `std::lower_bound`, and `std::upper_bound` on the host.
## Key Concepts
CCCL 3.3, CUB Device Algorithms, Parallel Search, Binary Search
## Supported SM Architectures
[SM 7.0 ](https://developer.nvidia.com/cuda-gpus) [SM 7.5 ](https://developer.nvidia.com/cuda-gpus) [SM 8.0 ](https://developer.nvidia.com/cuda-gpus) [SM 8.6 ](https://developer.nvidia.com/cuda-gpus) [SM 8.9 ](https://developer.nvidia.com/cuda-gpus) [SM 9.0 ](https://developer.nvidia.com/cuda-gpus) [SM 10.0 ](https://developer.nvidia.com/cuda-gpus) [SM 11.0 ](https://developer.nvidia.com/cuda-gpus) [SM 12.0 ](https://developer.nvidia.com/cuda-gpus)
## Supported OSes
Linux, Windows
## Supported CPU Architecture
x86_64, aarch64
## CUDA APIs involved
### [CCCL CUB](https://nvidia.github.io/cccl/unstable/cub/index.html)
cub::DeviceFind::FindIf, cub::DeviceFind::LowerBound, cub::DeviceFind::UpperBound
### [CCCL libcu++](https://nvidia.github.io/cccl/unstable/libcudacxx/index.html)
cuda::std::less
### [CUDA Runtime API](http://docs.nvidia.com/cuda/cuda-runtime-api/index.html)
cudaDeviceSynchronize, cudaGetDeviceProperties
## Dependencies needed to build/run
[CCCL 3.3+](https://github.com/NVIDIA/cccl). Fetched automatically via CPM at configure time (pinned to `v3.3.3`). Override with `-DCCCL_SOURCE_DIR=/path/to/cccl` to use a local checkout.
## Prerequisites
Download and install the [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) for your corresponding platform.
Make sure the dependencies mentioned in [Dependencies]() section above are installed.
## References (for more details)
[CCCL 3.3 release notes](https://github.com/NVIDIA/cccl/releases), [cub::DeviceFind header](https://github.com/NVIDIA/cccl/blob/main/cub/cub/device/device_find.cuh)

View File

@ -0,0 +1,248 @@
/* Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/* This sample demonstrates the three device-wide search algorithms
* introduced in CCCL 3.3: cub::DeviceFind::FindIf for predicate search,
* and cub::DeviceFind::LowerBound / UpperBound for parallel binary
* search. Results are verified against std::find_if, std::lower_bound,
* and std::upper_bound on the host.
*/
/* Includes, system */
#include <algorithm>
#include <stdio.h>
#include <stdlib.h>
#include <vector>
/* Includes, cuda */
#include <cuda_runtime.h>
#include <helper_cuda.h>
/* Includes, cccl */
#include <cub/device/device_find.cuh>
#include <cuda/std/functional>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
/* Predicate used with cub::DeviceFind::FindIf. */
struct is_greater_than_t
{
int threshold;
__host__ __device__ bool operator()(int value) const { return value > threshold; }
};
static bool run_find_if()
{
/* Input: 0, 1, ..., 15. Predicate: value > 9. Expected index: 10. */
const int num_items = 16;
thrust::device_vector<int> d_in(num_items);
for (int i = 0; i < num_items; ++i)
d_in[i] = i;
thrust::device_vector<int> d_out(1);
is_greater_than_t predicate{9};
size_t temp_bytes = 0;
checkCudaErrors(
cub::DeviceFind::FindIf(nullptr, temp_bytes, d_in.begin(), d_out.begin(), predicate, num_items));
thrust::device_vector<char> temp(temp_bytes);
checkCudaErrors(cub::DeviceFind::FindIf(thrust::raw_pointer_cast(temp.data()),
temp_bytes,
d_in.begin(),
d_out.begin(),
predicate,
num_items));
checkCudaErrors(cudaDeviceSynchronize());
const int got = d_out[0];
thrust::host_vector<int> h_in = d_in;
auto host_it = std::find_if(h_in.begin(), h_in.end(),
[&](int v) { return v > predicate.threshold; });
const int expected = static_cast<int>(host_it - h_in.begin());
printf("cub::DeviceFind::FindIf(value > %d) over [0..%d)\n", predicate.threshold, num_items);
printf(" got index = %d, expected = %d %s\n", got, expected, (got == expected ? "OK" : "FAIL"));
return got == expected;
}
static bool run_lower_bound()
{
/* Sorted range: [0, 2, 4, 6, 8]. Values to locate: [1, 3, 5, 7]. */
thrust::device_vector<int> d_range = {0, 2, 4, 6, 8};
thrust::device_vector<int> d_values = {1, 3, 5, 7};
thrust::device_vector<int> d_out(d_values.size());
size_t temp_bytes = 0;
checkCudaErrors(cub::DeviceFind::LowerBound(nullptr,
temp_bytes,
d_range.begin(),
static_cast<int>(d_range.size()),
d_values.begin(),
static_cast<int>(d_values.size()),
d_out.begin(),
cuda::std::less{}));
thrust::device_vector<char> temp(temp_bytes);
checkCudaErrors(cub::DeviceFind::LowerBound(thrust::raw_pointer_cast(temp.data()),
temp_bytes,
d_range.begin(),
static_cast<int>(d_range.size()),
d_values.begin(),
static_cast<int>(d_values.size()),
d_out.begin(),
cuda::std::less{}));
checkCudaErrors(cudaDeviceSynchronize());
thrust::host_vector<int> h_range = d_range;
thrust::host_vector<int> h_values = d_values;
thrust::host_vector<int> got = d_out;
std::vector<int> expected(h_values.size());
for (size_t i = 0; i < h_values.size(); ++i) {
expected[i] = static_cast<int>(
std::lower_bound(h_range.begin(), h_range.end(), h_values[i]) - h_range.begin());
}
bool ok = true;
printf("cub::DeviceFind::LowerBound\n");
printf(" range = { 0, 2, 4, 6, 8 }\n");
printf(" values = { 1, 3, 5, 7 }\n");
printf(" got = {");
for (size_t i = 0; i < got.size(); ++i) {
printf(" %d", got[i]);
if (got[i] != expected[i])
ok = false;
}
printf(" }\n expect = {");
for (size_t i = 0; i < expected.size(); ++i)
printf(" %d", expected[i]);
printf(" } %s\n", ok ? "OK" : "FAIL");
return ok;
}
static bool run_upper_bound()
{
/* Range with duplicates so LowerBound and UpperBound differ on values
* that appear in the range. */
thrust::device_vector<int> d_range = {0, 2, 2, 4, 6, 8};
thrust::device_vector<int> d_values = {2, 2};
thrust::device_vector<int> d_lb(d_values.size());
thrust::device_vector<int> d_ub(d_values.size());
size_t temp_bytes_lb = 0;
checkCudaErrors(cub::DeviceFind::LowerBound(nullptr,
temp_bytes_lb,
d_range.begin(),
static_cast<int>(d_range.size()),
d_values.begin(),
static_cast<int>(d_values.size()),
d_lb.begin(),
cuda::std::less{}));
thrust::device_vector<char> temp_lb(temp_bytes_lb);
checkCudaErrors(cub::DeviceFind::LowerBound(thrust::raw_pointer_cast(temp_lb.data()),
temp_bytes_lb,
d_range.begin(),
static_cast<int>(d_range.size()),
d_values.begin(),
static_cast<int>(d_values.size()),
d_lb.begin(),
cuda::std::less{}));
size_t temp_bytes_ub = 0;
checkCudaErrors(cub::DeviceFind::UpperBound(nullptr,
temp_bytes_ub,
d_range.begin(),
static_cast<int>(d_range.size()),
d_values.begin(),
static_cast<int>(d_values.size()),
d_ub.begin(),
cuda::std::less{}));
thrust::device_vector<char> temp_ub(temp_bytes_ub);
checkCudaErrors(cub::DeviceFind::UpperBound(thrust::raw_pointer_cast(temp_ub.data()),
temp_bytes_ub,
d_range.begin(),
static_cast<int>(d_range.size()),
d_values.begin(),
static_cast<int>(d_values.size()),
d_ub.begin(),
cuda::std::less{}));
checkCudaErrors(cudaDeviceSynchronize());
thrust::host_vector<int> h_range = d_range;
thrust::host_vector<int> h_values = d_values;
thrust::host_vector<int> got_lb = d_lb;
thrust::host_vector<int> got_ub = d_ub;
std::vector<int> exp_lb(h_values.size());
std::vector<int> exp_ub(h_values.size());
for (size_t i = 0; i < h_values.size(); ++i) {
exp_lb[i] =
static_cast<int>(std::lower_bound(h_range.begin(), h_range.end(), h_values[i]) - h_range.begin());
exp_ub[i] =
static_cast<int>(std::upper_bound(h_range.begin(), h_range.end(), h_values[i]) - h_range.begin());
}
bool ok = true;
printf("cub::DeviceFind::UpperBound (with duplicates in range)\n");
printf(" range = { 0, 2, 2, 4, 6, 8 }\n");
printf(" values = { 2, 2 }\n");
printf(" lb = {");
for (size_t i = 0; i < got_lb.size(); ++i) {
printf(" %d", got_lb[i]);
if (got_lb[i] != exp_lb[i])
ok = false;
}
printf(" } expected = {");
for (size_t i = 0; i < exp_lb.size(); ++i)
printf(" %d", exp_lb[i]);
printf(" }\n ub = {");
for (size_t i = 0; i < got_ub.size(); ++i) {
printf(" %d", got_ub[i]);
if (got_ub[i] != exp_ub[i])
ok = false;
}
printf(" } expected = {");
for (size_t i = 0; i < exp_ub.size(); ++i)
printf(" %d", exp_ub[i]);
printf(" } %s\n", ok ? "OK" : "FAIL");
return ok;
}
int main(int argc, char **argv)
{
int devID = findCudaDevice(argc, (const char **)argv);
cudaDeviceProp props;
checkCudaErrors(cudaGetDeviceProperties(&props, devID));
printf("Device: %s (Compute Capability %d.%d)\n\n", props.name, props.major, props.minor);
bool ok = true;
ok &= run_find_if();
printf("\n");
ok &= run_lower_bound();
printf("\n");
ok &= run_upper_bound();
printf("\n%s\n", ok ? "Done" : "FAILED");
return ok ? EXIT_SUCCESS : EXIT_FAILURE;
}

View File

@ -0,0 +1,63 @@
cmake_minimum_required(VERSION 3.20)
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/Modules")
project(cubDeviceSegmentedScan LANGUAGES C CXX CUDA)
# Disable response file for libraries on QNX as qcc does not support lib paths with double quotes
if(CMAKE_SYSTEM_NAME STREQUAL "QNX")
set(CMAKE_CUDA_USE_RESPONSE_FILE_FOR_LIBRARIES OFF)
endif()
find_package(CUDAToolkit REQUIRED)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(CMAKE_CUDA_ARCHITECTURES 75 80 86 87 89 90 100 110 120)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Wno-deprecated-gpu-targets")
if(ENABLE_CUDA_DEBUG)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G") # enable cuda-gdb (may significantly affect performance on some targets)
else()
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -lineinfo") # add line information to all builds for debug tools (exclusive to -G option)
endif()
# Fetch CCCL via CPM.
# Override with -DCCCL_SOURCE_DIR=/path/to/cccl to use a local checkout
# instead of fetching from GitHub.
set(CCCL_SAMPLES_CCCL_TAG "v3.3.3" CACHE STRING
"Tag/branch of NVIDIA/cccl to fetch for the CCCL samples")
if(NOT TARGET CCCL::CCCL)
include("${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/CPM.cmake")
if(DEFINED CCCL_SOURCE_DIR AND NOT CCCL_SOURCE_DIR STREQUAL "")
CPMAddPackage(NAME CCCL SOURCE_DIR "${CCCL_SOURCE_DIR}")
else()
CPMAddPackage(
NAME CCCL
GIT_REPOSITORY "https://github.com/NVIDIA/cccl"
GIT_TAG "${CCCL_SAMPLES_CCCL_TAG}"
)
endif()
endif()
# Include directories and libraries
include_directories(../../../Common)
# Source file
# Add target for cubDeviceSegmentedScan
add_executable(cubDeviceSegmentedScan cubDeviceSegmentedScan.cu)
target_compile_options(cubDeviceSegmentedScan PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:--extended-lambda>)
target_compile_features(cubDeviceSegmentedScan PRIVATE cxx_std_17 cuda_std_17)
set_target_properties(cubDeviceSegmentedScan PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
target_link_libraries(cubDeviceSegmentedScan PRIVATE
CUDA::cudart
CCCL::CCCL
)
# Include installation configuration
include(${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/InstallSamples.cmake)
setup_samples_install()

View File

@ -0,0 +1,48 @@
# cubDeviceSegmentedScan - CUB DeviceSegmentedScan
## Description
This sample demonstrates `cub::DeviceSegmentedScan`. A segmented scan computes an independent scan over each of many contiguous segments in a single device-wide call. Two operations are shown: `ExclusiveSegmentedSum` across three independent segments, and `InclusiveSegmentedScan` with a custom binary operator (running maximum via `cuda::maximum<>`).
## Key Concepts
CUB Device Algorithms, Segmented Scan, Prefix Sum
## Supported SM Architectures
[SM 7.0 ](https://developer.nvidia.com/cuda-gpus) [SM 7.5 ](https://developer.nvidia.com/cuda-gpus) [SM 8.0 ](https://developer.nvidia.com/cuda-gpus) [SM 8.6 ](https://developer.nvidia.com/cuda-gpus) [SM 8.9 ](https://developer.nvidia.com/cuda-gpus) [SM 9.0 ](https://developer.nvidia.com/cuda-gpus) [SM 10.0 ](https://developer.nvidia.com/cuda-gpus) [SM 11.0 ](https://developer.nvidia.com/cuda-gpus) [SM 12.0 ](https://developer.nvidia.com/cuda-gpus)
## Supported OSes
Linux, Windows
## Supported CPU Architecture
x86_64, aarch64
## CUDA APIs involved
### [CCCL CUB](https://nvidia.github.io/cccl/cub/)
cub::DeviceSegmentedScan::ExclusiveSegmentedSum, cub::DeviceSegmentedScan::InclusiveSegmentedScan
### [CCCL libcu++](https://nvidia.github.io/cccl/libcudacxx/)
cuda::maximum
### [CUDA Runtime API](http://docs.nvidia.com/cuda/cuda-runtime-api/index.html)
cudaDeviceSynchronize, cudaGetDeviceProperties
## Dependencies needed to build/run
[CCCL 3.3+](https://github.com/NVIDIA/cccl). Fetched automatically via CPM at configure time (pinned to `v3.3.3`). Override with `-DCCCL_SOURCE_DIR=/path/to/cccl` to use a local checkout.
## Prerequisites
Download and install the [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) for your corresponding platform.
Make sure the dependencies mentioned in [Dependencies]() section above are installed.
## References (for more details)
[CCCL 3.3 release notes](https://github.com/NVIDIA/cccl/releases), [cub::DeviceSegmentedScan header](https://github.com/NVIDIA/cccl/blob/main/cub/cub/device/device_segmented_scan.cuh)

View File

@ -0,0 +1,188 @@
/* Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/* This sample demonstrates cub::DeviceSegmentedScan, added in CCCL 3.3.
* Two operations are shown: ExclusiveSegmentedSum across three independent
* segments, and InclusiveSegmentedScan with a custom binary operator
* (running maximum via cuda::maximum<>). Each is verified against a
* host reference implementation.
*/
/* Includes, system */
#include <algorithm>
#include <limits>
#include <stdio.h>
#include <stdlib.h>
#include <vector>
/* Includes, cuda */
#include <cuda_runtime.h>
#include <helper_cuda.h>
/* Includes, cccl */
#include <cub/device/device_segmented_scan.cuh>
#include <cuda/functional>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
template <typename T>
static void print_vec(const char *label, const std::vector<T> &v)
{
printf(" %-24s{", label);
for (size_t i = 0; i < v.size(); ++i)
printf(" %d", static_cast<int>(v[i]));
printf(" }\n");
}
static std::vector<int> host_exclusive_segmented_sum(const std::vector<int> &input, const std::vector<size_t> &offsets)
{
std::vector<int> out(input.size(), 0);
for (size_t s = 0; s + 1 < offsets.size(); ++s) {
int running = 0;
for (size_t i = offsets[s]; i < offsets[s + 1]; ++i) {
out[i] = running;
running += input[i];
}
}
return out;
}
static std::vector<int> host_inclusive_segmented_max(const std::vector<int> &input, const std::vector<size_t> &offsets)
{
std::vector<int> out(input.size(), 0);
for (size_t s = 0; s + 1 < offsets.size(); ++s) {
int running = std::numeric_limits<int>::min();
for (size_t i = offsets[s]; i < offsets[s + 1]; ++i) {
running = std::max(running, input[i]);
out[i] = running;
}
}
return out;
}
static bool run_exclusive_segmented_sum()
{
/* 3 segments: [1,2,3] [4,5] [6,7,8] -> [0,1,3] [0,4] [0,6,13]. */
thrust::device_vector<int> d_in = {1, 2, 3, 4, 5, 6, 7, 8};
thrust::device_vector<size_t> d_offsets = {0, 3, 5, 8};
thrust::device_vector<int> d_out(d_in.size());
const auto num_segments = d_offsets.size() - 1;
auto begin_offsets = d_offsets.begin();
auto end_offsets = d_offsets.begin() + 1;
size_t temp_bytes = 0;
checkCudaErrors(cub::DeviceSegmentedScan::ExclusiveSegmentedSum(
nullptr, temp_bytes, d_in.begin(), d_out.begin(), begin_offsets, end_offsets, num_segments));
thrust::device_vector<char> temp(temp_bytes);
checkCudaErrors(cub::DeviceSegmentedScan::ExclusiveSegmentedSum(thrust::raw_pointer_cast(temp.data()),
temp_bytes,
d_in.begin(),
d_out.begin(),
begin_offsets,
end_offsets,
num_segments));
checkCudaErrors(cudaDeviceSynchronize());
std::vector<int> h_in(d_in.begin(), d_in.end());
std::vector<size_t> h_off(d_offsets.begin(), d_offsets.end());
std::vector<int> got(d_out.begin(), d_out.end());
std::vector<int> expected = host_exclusive_segmented_sum(h_in, h_off);
printf("cub::DeviceSegmentedScan::ExclusiveSegmentedSum\n");
print_vec("input:", h_in);
printf(" %-24s{", "offsets:");
for (auto o : h_off)
printf(" %zu", o);
printf(" }\n");
print_vec("got:", got);
print_vec("expected:", expected);
const bool ok = got == expected;
printf(" %s\n", ok ? "OK" : "FAIL");
return ok;
}
static bool run_inclusive_segmented_max()
{
/* Same three segments, but compute running max per segment. */
thrust::device_vector<int> d_in = {3, 1, 4, 5, 2, 9, 7, 8};
thrust::device_vector<size_t> d_offsets = {0, 3, 5, 8};
thrust::device_vector<int> d_out(d_in.size());
const auto num_segments = d_offsets.size() - 1;
auto begin_offsets = d_offsets.begin();
auto end_offsets = d_offsets.begin() + 1;
auto max_op = [] __host__ __device__(int a, int b) -> int { return cuda::maximum<>{}(a, b); };
size_t temp_bytes = 0;
checkCudaErrors(cub::DeviceSegmentedScan::InclusiveSegmentedScan(
nullptr, temp_bytes, d_in.begin(), d_out.begin(), begin_offsets, end_offsets, num_segments, max_op));
thrust::device_vector<char> temp(temp_bytes);
checkCudaErrors(cub::DeviceSegmentedScan::InclusiveSegmentedScan(thrust::raw_pointer_cast(temp.data()),
temp_bytes,
d_in.begin(),
d_out.begin(),
begin_offsets,
end_offsets,
num_segments,
max_op));
checkCudaErrors(cudaDeviceSynchronize());
std::vector<int> h_in(d_in.begin(), d_in.end());
std::vector<size_t> h_off(d_offsets.begin(), d_offsets.end());
std::vector<int> got(d_out.begin(), d_out.end());
std::vector<int> expected = host_inclusive_segmented_max(h_in, h_off);
printf("cub::DeviceSegmentedScan::InclusiveSegmentedScan (running max)\n");
print_vec("input:", h_in);
printf(" %-24s{", "offsets:");
for (auto o : h_off)
printf(" %zu", o);
printf(" }\n");
print_vec("got:", got);
print_vec("expected:", expected);
const bool ok = got == expected;
printf(" %s\n", ok ? "OK" : "FAIL");
return ok;
}
int main(int argc, char **argv)
{
int devID = findCudaDevice(argc, (const char **)argv);
cudaDeviceProp props;
checkCudaErrors(cudaGetDeviceProperties(&props, devID));
printf("Device: %s (Compute Capability %d.%d)\n\n", props.name, props.major, props.minor);
bool ok = true;
ok &= run_exclusive_segmented_sum();
printf("\n");
ok &= run_inclusive_segmented_max();
printf("\n%s\n", ok ? "Done" : "FAILED");
return ok ? EXIT_SUCCESS : EXIT_FAILURE;
}

View File

@ -0,0 +1,62 @@
cmake_minimum_required(VERSION 3.20)
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/Modules")
project(cubDeviceTransform LANGUAGES C CXX CUDA)
# Disable response file for libraries on QNX as qcc does not support lib paths with double quotes
if(CMAKE_SYSTEM_NAME STREQUAL "QNX")
set(CMAKE_CUDA_USE_RESPONSE_FILE_FOR_LIBRARIES OFF)
endif()
find_package(CUDAToolkit REQUIRED)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(CMAKE_CUDA_ARCHITECTURES 75 80 86 87 89 90 100 110 120)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Wno-deprecated-gpu-targets")
if(ENABLE_CUDA_DEBUG)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G") # enable cuda-gdb (may significantly affect performance on some targets)
else()
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -lineinfo") # add line information to all builds for debug tools (exclusive to -G option)
endif()
# Fetch CCCL via CPM.
# Override with -DCCCL_SOURCE_DIR=/path/to/cccl to use a local checkout
set(CCCL_SAMPLES_CCCL_TAG "v3.3.3" CACHE STRING
"Tag/branch of NVIDIA/cccl to fetch for the CCCL samples")
if(NOT TARGET CCCL::CCCL)
include("${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/CPM.cmake")
if(DEFINED CCCL_SOURCE_DIR AND NOT CCCL_SOURCE_DIR STREQUAL "")
CPMAddPackage(NAME CCCL SOURCE_DIR "${CCCL_SOURCE_DIR}")
else()
CPMAddPackage(
NAME CCCL
GIT_REPOSITORY "https://github.com/NVIDIA/cccl"
GIT_TAG "${CCCL_SAMPLES_CCCL_TAG}"
)
endif()
endif()
# Include directories and libraries
include_directories(../../../Common)
# Source file
# Add target for cubDeviceTransform
add_executable(cubDeviceTransform cubDeviceTransform.cu)
target_compile_options(cubDeviceTransform PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:--extended-lambda>)
target_compile_features(cubDeviceTransform PRIVATE cxx_std_17 cuda_std_17)
set_target_properties(cubDeviceTransform PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
target_link_libraries(cubDeviceTransform PRIVATE
CUDA::cudart
CCCL::CCCL
)
# Include installation configuration
include(${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/InstallSamples.cmake)
setup_samples_install()

View File

@ -0,0 +1,48 @@
# cubDeviceTransform - CUB DeviceTransform N-to-M
## Description
This sample demonstrates `cub::DeviceTransform` in its N-input / M-output form. A single device-wide call reads from N input sequences and writes to M output sequences, driven by a user-provided op that returns a `cuda::std::tuple` of M values. Two cases are shown: N=3 inputs producing 1 output, and N=2 inputs producing 2 outputs (sum and difference in one fused pass).
## Key Concepts
CCCL 3.3, CUB Device Algorithms, Fused Elementwise Transforms, Counting Iterators
## Supported SM Architectures
[SM 7.0 ](https://developer.nvidia.com/cuda-gpus) [SM 7.5 ](https://developer.nvidia.com/cuda-gpus) [SM 8.0 ](https://developer.nvidia.com/cuda-gpus) [SM 8.6 ](https://developer.nvidia.com/cuda-gpus) [SM 8.9 ](https://developer.nvidia.com/cuda-gpus) [SM 9.0 ](https://developer.nvidia.com/cuda-gpus) [SM 10.0 ](https://developer.nvidia.com/cuda-gpus) [SM 11.0 ](https://developer.nvidia.com/cuda-gpus) [SM 12.0 ](https://developer.nvidia.com/cuda-gpus)
## Supported OSes
Linux, Windows
## Supported CPU Architecture
x86_64, aarch64
## CUDA APIs involved
### [CCCL CUB](https://nvidia.github.io/cccl/cub/)
cub::DeviceTransform::Transform
### [CCCL libcu++](https://nvidia.github.io/cccl/libcudacxx/)
cuda::counting_iterator, cuda::std::tuple
### [CUDA Runtime API](http://docs.nvidia.com/cuda/cuda-runtime-api/index.html)
cudaDeviceSynchronize, cudaGetDeviceProperties
## Dependencies needed to build/run
[CCCL 3.3+](https://github.com/NVIDIA/cccl). Fetched automatically via CPM at configure time (pinned to `v3.3.3`). Override with `-DCCCL_SOURCE_DIR=/path/to/cccl` to use a local checkout.
## Prerequisites
Download and install the [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) for your corresponding platform.
Make sure the dependencies mentioned in [Dependencies]() section above are installed.
## References (for more details)
[CCCL 3.3 release notes](https://github.com/NVIDIA/cccl/releases), [cub::DeviceTransform header](https://github.com/NVIDIA/cccl/blob/main/cub/cub/device/device_transform.cuh)

View File

@ -0,0 +1,155 @@
/* Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/* This sample demonstrates cub::DeviceTransform in its N-input/M-output
* form (extended in CCCL 3.3). A single device-wide call reads from
* N input sequences and writes to M output sequences, driven by a
* user-provided op that returns a tuple of M values. Two cases are
* shown: N=3 -> 1 and N=2 -> 2. Results are verified against a host
* reference.
*/
/* Includes, system */
#include <stdio.h>
#include <stdlib.h>
#include <vector>
/* Includes, cuda */
#include <cuda_runtime.h>
#include <helper_cuda.h>
/* Includes, cccl */
#include <cub/device/device_transform.cuh>
#include <cuda/iterator>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
static bool run_n_to_one_transform()
{
/* result[i] = (a[i] + b[i]) * c[i], with c = counting_iterator<int>(100). */
thrust::device_vector<int> a = {0, -2, 5, 3};
thrust::device_vector<float> b = {5.2f, 3.1f, -1.1f, 3.0f};
auto counting = cuda::counting_iterator<int>{100};
thrust::device_vector<int> result(a.size());
auto op = [] __host__ __device__(int x, float y, int z) -> int {
return static_cast<int>((x + y) * z);
};
checkCudaErrors(cub::DeviceTransform::Transform(
cuda::std::tuple{a.begin(), b.begin(), counting}, result.begin(), a.size(), op));
checkCudaErrors(cudaDeviceSynchronize());
thrust::host_vector<int> ha = a;
thrust::host_vector<float> hb = b;
thrust::host_vector<int> got = result;
std::vector<int> expected(a.size());
for (size_t i = 0; i < a.size(); ++i) {
expected[i] = static_cast<int>((ha[i] + hb[i]) * static_cast<int>(100 + i));
}
bool ok = true;
printf("cub::DeviceTransform::Transform (N=3 inputs -> 1 output)\n");
printf(" result = (a + b) * c with c = counting_iterator(100)\n");
printf(" got = {");
for (size_t i = 0; i < got.size(); ++i) {
printf(" %d", got[i]);
if (got[i] != expected[i])
ok = false;
}
printf(" }\n expected = {");
for (size_t i = 0; i < expected.size(); ++i)
printf(" %d", expected[i]);
printf(" } %s\n", ok ? "OK" : "FAIL");
return ok;
}
static bool run_n_to_m_transform()
{
/* (sum[i], diff[i]) = (a[i] + b[i], a[i] - b[i]) in one pass. */
thrust::device_vector<int> a = {1, 5, 10, 7, 3};
thrust::device_vector<int> b = {4, 2, 8, 1, 9};
thrust::device_vector<int> sum(a.size());
thrust::device_vector<int> diff(a.size());
auto op = [] __host__ __device__(int x, int y) -> cuda::std::tuple<int, int> {
return {x + y, x - y};
};
checkCudaErrors(cub::DeviceTransform::Transform(cuda::std::tuple{a.begin(), b.begin()},
cuda::std::tuple{sum.begin(), diff.begin()},
a.size(),
op));
checkCudaErrors(cudaDeviceSynchronize());
thrust::host_vector<int> ha = a, hb = b, got_sum = sum, got_diff = diff;
std::vector<int> exp_sum(a.size()), exp_diff(a.size());
for (size_t i = 0; i < a.size(); ++i) {
exp_sum[i] = ha[i] + hb[i];
exp_diff[i] = ha[i] - hb[i];
}
bool ok = true;
printf("cub::DeviceTransform::Transform (N=2 inputs -> M=2 outputs)\n");
printf(" op returns cuda::std::tuple{a + b, a - b}\n");
printf(" sum = {");
for (size_t i = 0; i < got_sum.size(); ++i) {
printf(" %d", got_sum[i]);
if (got_sum[i] != exp_sum[i])
ok = false;
}
printf(" } expected = {");
for (auto v : exp_sum)
printf(" %d", v);
printf(" }\n diff = {");
for (size_t i = 0; i < got_diff.size(); ++i) {
printf(" %d", got_diff[i]);
if (got_diff[i] != exp_diff[i])
ok = false;
}
printf(" } expected = {");
for (auto v : exp_diff)
printf(" %d", v);
printf(" } %s\n", ok ? "OK" : "FAIL");
return ok;
}
int main(int argc, char **argv)
{
int devID = findCudaDevice(argc, (const char **)argv);
cudaDeviceProp props;
checkCudaErrors(cudaGetDeviceProperties(&props, devID));
printf("Device: %s (Compute Capability %d.%d)\n\n", props.name, props.major, props.minor);
bool ok = true;
ok &= run_n_to_one_transform();
printf("\n");
ok &= run_n_to_m_transform();
printf("\n%s\n", ok ? "Done" : "FAILED");
return ok ? EXIT_SUCCESS : EXIT_FAILURE;
}

View File

@ -0,0 +1,92 @@
cmake_minimum_required(VERSION 3.20)
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/Modules")
project(libcuxxMdspan LANGUAGES C CXX CUDA)
# Disable response file for libraries on QNX as qcc does not support lib paths with double quotes
if(CMAKE_SYSTEM_NAME STREQUAL "QNX")
set(CMAKE_CUDA_USE_RESPONSE_FILE_FOR_LIBRARIES OFF)
endif()
find_package(CUDAToolkit REQUIRED)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(CMAKE_CUDA_ARCHITECTURES 75 80 86 87 89 90 100 110 120)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Wno-deprecated-gpu-targets")
if(ENABLE_CUDA_DEBUG)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G") # enable cuda-gdb (may significantly affect performance on some targets)
else()
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -lineinfo") # add line information to all builds for debug tools (exclusive to -G option)
endif()
# Fetch CCCL (CUB + libcu++ + Thrust) via CPM. The toolkit that ships
# with CUDA 13.2 bundles CCCL 3.2, but this sample uses APIs added in
# CCCL 3.3. Pinning the tag here lets the sample build on any toolkit
# with a usable nvcc. Override with -DCCCL_SOURCE_DIR=/path/to/cccl
# to use a local checkout instead of fetching from GitHub.
set(CCCL_SAMPLES_CCCL_TAG "v3.3.3" CACHE STRING
"Tag/branch of NVIDIA/cccl to fetch for the CCCL samples")
if(NOT TARGET CCCL::CCCL)
include("${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/CPM.cmake")
if(DEFINED CCCL_SOURCE_DIR AND NOT CCCL_SOURCE_DIR STREQUAL "")
CPMAddPackage(NAME CCCL SOURCE_DIR "${CCCL_SOURCE_DIR}")
else()
CPMAddPackage(
NAME CCCL
GIT_REPOSITORY "https://github.com/NVIDIA/cccl"
GIT_TAG "${CCCL_SAMPLES_CCCL_TAG}"
)
endif()
endif()
# DLPack headers are required for cuda::to_device_mdspan and
# cuda::to_dlpack_tensor. Fetch via CPM (override with DLPACK_SOURCE_DIR).
set(CCCL_SAMPLES_DLPACK_TAG "v1.3" CACHE STRING
"Tag of dmlc/dlpack to fetch for libcuxxMdspan")
if(NOT TARGET dlpack::dlpack)
if(NOT COMMAND CPMAddPackage)
include("${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/CPM.cmake")
endif()
if(DEFINED DLPACK_SOURCE_DIR AND NOT DLPACK_SOURCE_DIR STREQUAL "")
CPMAddPackage(NAME dlpack SOURCE_DIR "${DLPACK_SOURCE_DIR}" DOWNLOAD_ONLY YES)
else()
CPMAddPackage(
NAME dlpack
GIT_REPOSITORY "https://github.com/dmlc/dlpack"
GIT_TAG "${CCCL_SAMPLES_DLPACK_TAG}"
DOWNLOAD_ONLY YES
)
endif()
if(dlpack_ADDED)
add_library(dlpack_headers INTERFACE)
target_include_directories(dlpack_headers INTERFACE "${dlpack_SOURCE_DIR}/include")
add_library(dlpack::dlpack ALIAS dlpack_headers)
endif()
endif()
# Include directories and libraries
include_directories(../../../Common)
# Source file
# Add target for libcuxxMdspan
add_executable(libcuxxMdspan libcuxxMdspan.cu)
target_compile_options(libcuxxMdspan PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:--extended-lambda>)
target_compile_features(libcuxxMdspan PRIVATE cxx_std_17 cuda_std_17)
set_target_properties(libcuxxMdspan PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
target_link_libraries(libcuxxMdspan PRIVATE
CUDA::cudart
CCCL::CCCL
dlpack::dlpack
)
# Include installation configuration
include(${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/InstallSamples.cmake)
setup_samples_install()

View File

@ -0,0 +1,44 @@
# libcuxxMdspan - libcu++ mdspan Interop (DLPack + shared_memory_mdspan)
## Description
This sample demonstrates two mdspan-centric features CCCL: DLPack <-> `cuda::std::mdspan` bridging via `cuda::to_device_mdspan` / `cuda::to_dlpack_tensor` (the tensor-interchange protocol used by PyTorch, JAX, CuPy, and others), and `cuda::shared_memory_mdspan` for multi-dimensional views of shared-memory tiles with address-space-safe accessors. A small matrix is built, wrapped in a DLTensor, converted to a `device_mdspan`, scaled row-wise, and transposed through a `shared_memory_mdspan` tile. The output mdspan is converted back to DLPack and its metadata is printed.
## Key Concepts
CCCL 3.3, libcu++ mdspan, DLPack Interoperability, Shared Memory Views
## Supported SM Architectures
[SM 7.0 ](https://developer.nvidia.com/cuda-gpus) [SM 7.5 ](https://developer.nvidia.com/cuda-gpus) [SM 8.0 ](https://developer.nvidia.com/cuda-gpus) [SM 8.6 ](https://developer.nvidia.com/cuda-gpus) [SM 8.9 ](https://developer.nvidia.com/cuda-gpus) [SM 9.0 ](https://developer.nvidia.com/cuda-gpus) [SM 10.0 ](https://developer.nvidia.com/cuda-gpus) [SM 11.0 ](https://developer.nvidia.com/cuda-gpus) [SM 12.0 ](https://developer.nvidia.com/cuda-gpus)
## Supported OSes
Linux, Windows
## Supported CPU Architecture
x86_64, aarch64
## CUDA APIs involved
### [CCCL libcu++](https://nvidia.github.io/cccl/libcudacxx/)
cuda::to_device_mdspan, cuda::to_dlpack_tensor, cuda::device_mdspan, cuda::shared_memory_mdspan, cuda::std::mdspan
### [CUDA Runtime API](http://docs.nvidia.com/cuda/cuda-runtime-api/index.html)
cudaMalloc, cudaFree, cudaMemcpy, cudaMemset, cudaDeviceSynchronize, cudaGetDeviceProperties
## Dependencies needed to build/run
[CCCL 3.3+](https://github.com/NVIDIA/cccl), [DLPack 1.2+](https://github.com/dmlc/dlpack). Both fetched automatically via CPM at configure time (pinned to `v3.3.3` and `v1.3` respectively). Override with `-DCCCL_SOURCE_DIR=/path/to/cccl` and `-DDLPACK_SOURCE_DIR=/path/to/dlpack` to use local checkouts.
## Prerequisites
Download and install the [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) for your corresponding platform.
Make sure the dependencies mentioned in [Dependencies]() section above are installed.
## References (for more details)
[CCCL 3.3 release notes](https://github.com/NVIDIA/cccl/releases), [cuda::to_device_mdspan header](https://github.com/NVIDIA/cccl/blob/main/libcudacxx/include/cuda/__mdspan/dlpack_to_mdspan.h), [cuda::shared_memory_mdspan docs](https://nvidia.github.io/cccl/libcudacxx/extended_api/mdspan/shared_memory_accessor.html), [DLPack specification](https://dmlc.github.io/dlpack/latest/)

View File

@ -0,0 +1,246 @@
/* Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/* This sample demonstrates two mdspan-centric features from CCCL 3.3:
*
* 1. DLPack <-> cuda::std::mdspan bridging through
* cuda::to_device_mdspan<T, Rank>(DLTensor) -> cuda::device_mdspan
* cuda::to_dlpack_tensor(device_mdspan) -> DLManagedTensor
* The DLPack format is the interchange protocol used by PyTorch,
* JAX, CuPy, and other frameworks; cuda::device_mdspan is the
* device-side view with rich shape/stride metadata for kernels.
*
* 2. cuda::shared_memory_mdspan: a multi-dimensional view over a
* shared-memory tile. The accessor guarantees shared-memory
* load/store instructions and adds address-space safety checks.
*
* A sample matrix is built on the device, wrapped in a DLTensor,
* converted to a cuda::device_mdspan, and two kernels run against it:
* scale_rows_kernel multiplies row i by (i + 1), and
* shared_tile_transpose_kernel uses a cuda::shared_memory_mdspan to
* transpose a block-sized tile through shared memory. The output
* mdspan is then converted back to DLPack metadata and printed.
*/
/* Includes, system */
#include <stdio.h>
#include <stdlib.h>
#include <vector>
/* Includes, cuda */
#include <cuda_runtime.h>
#include <helper_cuda.h>
/* Includes, cccl */
#include <cuda/mdspan>
#include <cuda/std/array>
#include <cuda/std/cstdint>
#include <cuda/std/mdspan>
#define ROWS 8
#define COLS 8
#define TILE 8 /* matches ROWS / COLS for simplicity */
using extents2d = cuda::std::dextents<cuda::std::size_t, 2>;
/* Kernel A: multiply row i of a 2-D device_mdspan by (i + 1). Templated
* on the mdspan type so it accepts the exact type produced by
* cuda::to_device_mdspan (which uses layout_stride_relaxed and int64_t
* extents). */
template <typename Tensor>
__global__ void scale_rows_kernel(Tensor tensor)
{
const int r = blockIdx.y * blockDim.y + threadIdx.y;
const int c = blockIdx.x * blockDim.x + threadIdx.x;
if (r < static_cast<int>(tensor.extent(0)) && c < static_cast<int>(tensor.extent(1))) {
tensor(r, c) *= static_cast<float>(r + 1);
}
}
/* Kernel B: block-tile transpose driven by a shared_memory_mdspan.
* Each block loads a TILE x TILE tile from the input into shared memory
* through a cuda::shared_memory_mdspan, transposes in shared, and writes
* to the output. */
template <typename InTensor, typename OutTensor>
__global__ void shared_tile_transpose_kernel(InTensor in, OutTensor out)
{
__shared__ float smem_storage[TILE * TILE];
cuda::shared_memory_mdspan smem(smem_storage, cuda::std::dextents<cuda::std::size_t, 2>{TILE, TILE});
const int tr = threadIdx.y;
const int tc = threadIdx.x;
const int r = blockIdx.y * TILE + tr;
const int c = blockIdx.x * TILE + tc;
if (r < static_cast<int>(in.extent(0)) && c < static_cast<int>(in.extent(1))) {
smem(tr, tc) = in(r, c);
}
__syncthreads();
const int r_out = blockIdx.x * TILE + tr;
const int c_out = blockIdx.y * TILE + tc;
if (r_out < static_cast<int>(out.extent(0)) && c_out < static_cast<int>(out.extent(1))) {
out(r_out, c_out) = smem(tc, tr);
}
}
struct DLTensorStorage
{
::DLTensor tensor{};
cuda::std::array<cuda::std::int64_t, 2> shape{};
cuda::std::array<cuda::std::int64_t, 2> strides{};
};
static DLTensorStorage make_row_major_dltensor(float *device_ptr, int rows, int cols, int device_ordinal)
{
DLTensorStorage s;
s.shape = {rows, cols};
s.strides = {cols, 1};
s.tensor.data = device_ptr;
s.tensor.device = ::DLDevice{::kDLCUDA, device_ordinal};
s.tensor.ndim = 2;
s.tensor.dtype = ::DLDataType{::DLDataTypeCode::kDLFloat, 32, 1};
s.tensor.shape = s.shape.data();
s.tensor.strides = s.strides.data();
s.tensor.byte_offset = 0;
return s;
}
int main(int argc, char **argv)
{
int devID = findCudaDevice(argc, (const char **)argv);
cudaDeviceProp props;
checkCudaErrors(cudaGetDeviceProperties(&props, devID));
printf("Device: %s (Compute Capability %d.%d)\n\n", props.name, props.major, props.minor);
float *d_in = nullptr;
float *d_out = nullptr;
const size_t nelem = static_cast<size_t>(ROWS) * COLS;
checkCudaErrors(cudaMalloc(&d_in, nelem * sizeof(float)));
checkCudaErrors(cudaMalloc(&d_out, nelem * sizeof(float)));
std::vector<float> host(nelem);
for (int r = 0; r < ROWS; ++r) {
for (int c = 0; c < COLS; ++c) {
host[r * COLS + c] = static_cast<float>(r * COLS + c);
}
}
checkCudaErrors(cudaMemcpy(d_in, host.data(), nelem * sizeof(float), cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemset(d_out, 0, nelem * sizeof(float)));
DLTensorStorage in_dl = make_row_major_dltensor(d_in, ROWS, COLS, devID);
DLTensorStorage out_dl = make_row_major_dltensor(d_out, ROWS, COLS, devID);
auto in_md = cuda::to_device_mdspan<float, 2>(in_dl.tensor);
auto out_md = cuda::to_device_mdspan<float, 2>(out_dl.tensor);
printf("cuda::to_device_mdspan produced a 2-D device_mdspan of shape (%zu, %zu)\n\n",
in_md.extent(0),
in_md.extent(1));
dim3 block(8, 8);
dim3 grid((COLS + block.x - 1) / block.x, (ROWS + block.y - 1) / block.y);
scale_rows_kernel<<<grid, block>>>(in_md);
checkCudaErrors(cudaGetLastError());
checkCudaErrors(cudaDeviceSynchronize());
std::vector<float> scaled(nelem);
checkCudaErrors(cudaMemcpy(scaled.data(), d_in, nelem * sizeof(float), cudaMemcpyDeviceToHost));
bool scale_ok = true;
for (int r = 0; r < ROWS && scale_ok; ++r) {
for (int c = 0; c < COLS && scale_ok; ++c) {
const float expect = static_cast<float>((r * COLS + c) * (r + 1));
if (scaled[r * COLS + c] != expect) {
printf("scale_rows mismatch at (%d,%d): got %g expected %g\n",
r,
c,
scaled[r * COLS + c],
expect);
scale_ok = false;
}
}
}
if (scale_ok) {
printf("scale_rows kernel: OK (row i scaled by i+1 via cuda::device_mdspan)\n");
}
cuda::device_mdspan<const float, extents2d> in_md_const(d_in, extents2d{ROWS, COLS});
cuda::device_mdspan<float, extents2d> out_md_rw(d_out, extents2d{ROWS, COLS});
dim3 tile_block(TILE, TILE);
dim3 tile_grid((COLS + TILE - 1) / TILE, (ROWS + TILE - 1) / TILE);
shared_tile_transpose_kernel<<<tile_grid, tile_block>>>(in_md_const, out_md_rw);
checkCudaErrors(cudaGetLastError());
checkCudaErrors(cudaDeviceSynchronize());
std::vector<float> transposed(nelem);
checkCudaErrors(cudaMemcpy(transposed.data(), d_out, nelem * sizeof(float), cudaMemcpyDeviceToHost));
bool tp_ok = true;
for (int r = 0; r < ROWS && tp_ok; ++r) {
for (int c = 0; c < COLS && tp_ok; ++c) {
const float expect = scaled[c * COLS + r];
if (transposed[r * COLS + c] != expect) {
printf("transpose mismatch at (%d,%d): got %g expected %g\n",
r,
c,
transposed[r * COLS + c],
expect);
tp_ok = false;
}
}
}
if (tp_ok) {
printf("shared_tile_transpose kernel: OK (tile transpose via cuda::shared_memory_mdspan)\n");
}
auto dl_wrapper = cuda::to_dlpack_tensor(out_md);
const auto &dltensor = dl_wrapper.get();
printf("\ncuda::to_dlpack_tensor metadata:\n");
printf(" device : kDLCUDA (ordinal %d)\n", dltensor.device.device_id);
printf(" ndim : %d\n", dltensor.ndim);
printf(" dtype : code=%u bits=%u lanes=%u\n",
static_cast<unsigned>(dltensor.dtype.code),
static_cast<unsigned>(dltensor.dtype.bits),
static_cast<unsigned>(dltensor.dtype.lanes));
printf(" shape : [%lld, %lld]\n",
static_cast<long long>(dltensor.shape[0]),
static_cast<long long>(dltensor.shape[1]));
if (dltensor.strides != nullptr) {
printf(" strides : [%lld, %lld]\n",
static_cast<long long>(dltensor.strides[0]),
static_cast<long long>(dltensor.strides[1]));
}
checkCudaErrors(cudaFree(d_in));
checkCudaErrors(cudaFree(d_out));
if (!scale_ok || !tp_ok) {
return EXIT_FAILURE;
}
printf("\nDone\n");
return EXIT_SUCCESS;
}

View File

@ -0,0 +1,65 @@
cmake_minimum_required(VERSION 3.20)
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/Modules")
project(libcuxxRandom LANGUAGES C CXX CUDA)
# Disable response file for libraries on QNX as qcc does not support lib paths with double quotes
if(CMAKE_SYSTEM_NAME STREQUAL "QNX")
set(CMAKE_CUDA_USE_RESPONSE_FILE_FOR_LIBRARIES OFF)
endif()
find_package(CUDAToolkit REQUIRED)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(CMAKE_CUDA_ARCHITECTURES 75 80 86 87 89 90 100 110 120)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Wno-deprecated-gpu-targets")
if(ENABLE_CUDA_DEBUG)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G") # enable cuda-gdb (may significantly affect performance on some targets)
else()
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -lineinfo") # add line information to all builds for debug tools (exclusive to -G option)
endif()
# Fetch CCCL (CUB + libcu++ + Thrust) via CPM. The toolkit that ships
# with CUDA 13.2 bundles CCCL 3.2, but this sample uses APIs added in
# CCCL 3.3. Pinning the tag here lets the sample build on any toolkit
# with a usable nvcc. Override with -DCCCL_SOURCE_DIR=/path/to/cccl
# to use a local checkout instead of fetching from GitHub.
set(CCCL_SAMPLES_CCCL_TAG "v3.3.3" CACHE STRING
"Tag/branch of NVIDIA/cccl to fetch for the CCCL samples")
if(NOT TARGET CCCL::CCCL)
include("${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/CPM.cmake")
if(DEFINED CCCL_SOURCE_DIR AND NOT CCCL_SOURCE_DIR STREQUAL "")
CPMAddPackage(NAME CCCL SOURCE_DIR "${CCCL_SOURCE_DIR}")
else()
CPMAddPackage(
NAME CCCL
GIT_REPOSITORY "https://github.com/NVIDIA/cccl"
GIT_TAG "${CCCL_SAMPLES_CCCL_TAG}"
)
endif()
endif()
# Include directories and libraries
include_directories(../../../Common)
# Source file
# Add target for libcuxxRandom
add_executable(libcuxxRandom libcuxxRandom.cu)
target_compile_options(libcuxxRandom PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:--extended-lambda>)
target_compile_features(libcuxxRandom PRIVATE cxx_std_17 cuda_std_17)
set_target_properties(libcuxxRandom PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
target_link_libraries(libcuxxRandom PRIVATE
CUDA::cudart
CCCL::CCCL
)
# Include installation configuration
include(${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/InstallSamples.cmake)
setup_samples_install()

View File

@ -0,0 +1,44 @@
# libcuxxRandom - libcu++ Random Distributions
## Description
This sample demonstrates the random-number facilities added to libcu++ in CCCL. `<cuda/std/random>` now offers host- and device-compatible implementations of the standard C++ distributions (uniform, normal, Poisson, Bernoulli, and more) and backports the C++26 `cuda::std::philox4x32` / `philox4x64` engines. `<cuda/random>` adds `cuda::pcg64` as an NVIDIA extension (the same generator NumPy uses by default). A kernel draws samples on each thread and the host computes empirical statistics, comparing them to the theoretical mean / variance / probability.
## Key Concepts
CCCL 3.3, libcu++ Random, PCG, Philox, Device-Side PRNG
## Supported SM Architectures
[SM 7.0 ](https://developer.nvidia.com/cuda-gpus) [SM 7.5 ](https://developer.nvidia.com/cuda-gpus) [SM 8.0 ](https://developer.nvidia.com/cuda-gpus) [SM 8.6 ](https://developer.nvidia.com/cuda-gpus) [SM 8.9 ](https://developer.nvidia.com/cuda-gpus) [SM 9.0 ](https://developer.nvidia.com/cuda-gpus) [SM 10.0 ](https://developer.nvidia.com/cuda-gpus) [SM 11.0 ](https://developer.nvidia.com/cuda-gpus) [SM 12.0 ](https://developer.nvidia.com/cuda-gpus)
## Supported OSes
Linux, Windows
## Supported CPU Architecture
x86_64, aarch64
## CUDA APIs involved
### [CCCL libcu++](https://nvidia.github.io/cccl/libcudacxx/)
cuda::pcg64, cuda::std::philox4x32, cuda::std::uniform_real_distribution, cuda::std::normal_distribution, cuda::std::poisson_distribution, cuda::std::bernoulli_distribution
### [CUDA Runtime API](http://docs.nvidia.com/cuda/cuda-runtime-api/index.html)
cudaMalloc, cudaFree, cudaMemcpy, cudaDeviceSynchronize, cudaGetDeviceProperties
## Dependencies needed to build/run
[CCCL 3.3+](https://github.com/NVIDIA/cccl). Fetched automatically via CPM at configure time (pinned to `v3.3.3`). Override with `-DCCCL_SOURCE_DIR=/path/to/cccl` to use a local checkout.
## Prerequisites
Download and install the [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) for your corresponding platform.
Make sure the dependencies mentioned in [Dependencies]() section above are installed.
## References (for more details)
[CCCL 3.3 release notes](https://github.com/NVIDIA/cccl/releases), [cuda::pcg64 header](https://github.com/NVIDIA/cccl/blob/main/libcudacxx/include/cuda/__random/pcg_engine.h), [NumPy PCG64](https://numpy.org/doc/stable/reference/random/bit_generators/pcg64.html)

View File

@ -0,0 +1,180 @@
/* Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/* This sample demonstrates the random-number facilities added to libcu++
* in CCCL 3.3: <cuda/std/random> now offers host- and device-compatible
* implementations of the standard C++ distributions (uniform, normal,
* Poisson, Bernoulli, ...), and backports the C++26 Philox counter-based
* engines. <cuda/random> adds cuda::pcg64 as an NVIDIA extension (the
* same generator NumPy uses by default).
*
* A kernel draws many samples from four different distributions on each
* thread and the host computes empirical summary statistics, comparing
* them to the theoretical mean / variance / probability.
*/
/* Includes, system */
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <vector>
/* Includes, cuda */
#include <cuda_runtime.h>
#include <helper_cuda.h>
/* Includes, cccl */
#include <cuda/random>
#include <cuda/std/random>
#define THREADS_PER_BLOCK 256
#define SAMPLES_PER_THREAD 256
/* Per-thread kernel: seed a PCG engine, draw samples from four
* distributions, and also pull Philox output through a Bernoulli dist
* to show that distributions work with any engine. */
__global__ void sample_kernel(unsigned long long base_seed,
int num_samples_per_thread,
float *uniform_out,
float *normal_out,
int *poisson_out,
int *bernoulli_out)
{
const int tid = blockIdx.x * blockDim.x + threadIdx.x;
const int total_threads = gridDim.x * blockDim.x;
cuda::pcg64 rng(base_seed + static_cast<unsigned long long>(tid));
cuda::std::uniform_real_distribution<float> uniform_dist(0.0f, 1.0f);
cuda::std::normal_distribution<float> normal_dist(0.0f, 1.0f);
cuda::std::poisson_distribution<int> poisson_dist(4.0);
cuda::std::bernoulli_distribution bernoulli_dist(0.25);
cuda::std::philox4x32 philox(static_cast<cuda::std::uint32_t>(base_seed + 17u + tid));
for (int i = 0; i < num_samples_per_thread; ++i) {
const int idx = i * total_threads + tid;
uniform_out[idx] = uniform_dist(rng);
normal_out[idx] = normal_dist(rng);
poisson_out[idx] = poisson_dist(rng);
bernoulli_out[idx] = bernoulli_dist(philox) ? 1 : 0;
}
}
template <typename T>
static void summarize(const std::vector<T> &samples, double expected_mean, double expected_var, const char *label)
{
const size_t n = samples.size();
double sum = 0.0;
for (const auto v : samples)
sum += static_cast<double>(v);
const double mean = sum / static_cast<double>(n);
double sq = 0.0;
for (const auto v : samples) {
const double d = static_cast<double>(v) - mean;
sq += d * d;
}
const double var = sq / static_cast<double>(n - 1);
printf("%-24s n=%zu mean=%.4f (exp %.4f) var=%.4f (exp %.4f)\n",
label,
n,
mean,
expected_mean,
var,
expected_var);
}
static void summarize_bernoulli(const std::vector<int> &samples, double expected_p)
{
long long ones = 0;
for (int v : samples)
ones += v;
const double p = static_cast<double>(ones) / static_cast<double>(samples.size());
printf("%-24s n=%zu p(1)=%.4f (exp %.4f)\n", "bernoulli(0.25):", samples.size(), p, expected_p);
}
int main(int argc, char **argv)
{
int num_blocks = 64;
for (int i = 1; i + 1 < argc; ++i) {
if (strcmp(argv[i], "--blocks") == 0)
num_blocks = atoi(argv[i + 1]);
}
if (num_blocks <= 0)
num_blocks = 1;
int devID = findCudaDevice(argc, (const char **)argv);
cudaDeviceProp props;
checkCudaErrors(cudaGetDeviceProperties(&props, devID));
printf("Device: %s (Compute Capability %d.%d)\n\n", props.name, props.major, props.minor);
const int total_threads = num_blocks * THREADS_PER_BLOCK;
const size_t n = static_cast<size_t>(total_threads) * SAMPLES_PER_THREAD;
printf("Drawing %zu samples per distribution (%d blocks x %d threads x %d samples/thread)\n\n",
n,
num_blocks,
THREADS_PER_BLOCK,
SAMPLES_PER_THREAD);
float *d_uniform = nullptr;
float *d_normal = nullptr;
int *d_poisson = nullptr;
int *d_bernoulli = nullptr;
checkCudaErrors(cudaMalloc(&d_uniform, n * sizeof(float)));
checkCudaErrors(cudaMalloc(&d_normal, n * sizeof(float)));
checkCudaErrors(cudaMalloc(&d_poisson, n * sizeof(int)));
checkCudaErrors(cudaMalloc(&d_bernoulli, n * sizeof(int)));
const unsigned long long seed = 0xC0FFEE00ULL;
sample_kernel<<<num_blocks, THREADS_PER_BLOCK>>>(
seed, SAMPLES_PER_THREAD, d_uniform, d_normal, d_poisson, d_bernoulli);
checkCudaErrors(cudaGetLastError());
checkCudaErrors(cudaDeviceSynchronize());
std::vector<float> uniform(n), normal(n);
std::vector<int> poisson(n), bernoulli(n);
checkCudaErrors(cudaMemcpy(uniform.data(), d_uniform, n * sizeof(float), cudaMemcpyDeviceToHost));
checkCudaErrors(cudaMemcpy(normal.data(), d_normal, n * sizeof(float), cudaMemcpyDeviceToHost));
checkCudaErrors(cudaMemcpy(poisson.data(), d_poisson, n * sizeof(int), cudaMemcpyDeviceToHost));
checkCudaErrors(cudaMemcpy(bernoulli.data(), d_bernoulli, n * sizeof(int), cudaMemcpyDeviceToHost));
summarize(uniform, /*mean=*/0.5, /*var=*/1.0 / 12.0, "uniform(0,1):");
summarize(normal, /*mean=*/0.0, /*var=*/1.0, "normal(0,1):");
summarize(poisson, /*mean=*/4.0, /*var=*/4.0, "poisson(lambda=4):");
summarize_bernoulli(bernoulli, /*p=*/0.25);
printf("\nEngines exercised: cuda::pcg64 (NumPy-compatible) and cuda::std::philox4x32 (C++26)\n");
checkCudaErrors(cudaFree(d_uniform));
checkCudaErrors(cudaFree(d_normal));
checkCudaErrors(cudaFree(d_poisson));
checkCudaErrors(cudaFree(d_bernoulli));
printf("Done\n");
return EXIT_SUCCESS;
}

View File

@ -461,8 +461,6 @@ void autoTest(char **argv)
// Run fluids Simulation
bool runFluidsSimulation(int argc, char **argv, char *ref_file)
{
// Create the CUTIL timer
sdkCreateTimer(&timer);
if (ref_file != NULL) {

View File

@ -0,0 +1,206 @@
/* Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/*
* This file provides common benchmark utilities for CUDA Tile C++
* microbenchmarks.
*/
#ifndef CUDA_TILE_BENCHMARK_H
#define CUDA_TILE_BENCHMARK_H
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
// benchmark configuration
// global settings for benchmark behavior, controllable via command line
struct BenchmarkConfig {
bool use_validation = false; // --validate enables CPU cross-validation
int warmup_iters = 5; // --warmup=N (0 to disable)
int bench_iters = 20; // --iters=N, -i N
};
inline BenchmarkConfig& bench_config() {
static BenchmarkConfig config;
return config;
}
// convenience accessors
inline bool use_validation() { return bench_config().use_validation; }
inline int warmup_iters() { return bench_config().warmup_iters; }
inline int bench_iters() { return bench_config().bench_iters; }
// Parse command line options. Call from main() before benchmarks.
inline void parse_benchmark_args(int argc, char** argv) {
for (int i = 1; i < argc; i++) {
if (strcmp(argv[i], "--validate") == 0) {
bench_config().use_validation = true;
} else if (strcmp(argv[i], "--skip-warmup") == 0) {
bench_config().warmup_iters = 0;
} else if (strncmp(argv[i], "--warmup=", 9) == 0) {
bench_config().warmup_iters = atoi(argv[i] + 9);
} else if (strncmp(argv[i], "--iters=", 8) == 0) {
bench_config().bench_iters = atoi(argv[i] + 8);
} else if (strcmp(argv[i], "-i") == 0) {
// Parse "-i N" where the next argument is the value.
if (i + 1 < argc) {
bench_config().bench_iters = atoi(argv[++i]);
} else {
fprintf(stderr, "Error: -i requires an argument\n");
exit(EXIT_FAILURE);
}
} else if (strcmp(argv[i], "--help") == 0 || strcmp(argv[i], "-h") == 0) {
printf("Benchmark options:\n");
printf(" --validate Enable CPU cross-validation\n");
printf(" --skip-warmup Disable warmup iterations\n");
printf(" --warmup=N Warmup iterations (default: 5)\n");
printf(" -i N, --iters=N Benchmark iterations (default: 20)\n");
exit(0);
} else if (argv[i][0] == '-') {
fprintf(stderr, "Error: unknown option '%s'\n", argv[i]);
fprintf(stderr, "Try '--help' for usage information.\n");
exit(EXIT_FAILURE);
}
}
// print active configuration
if (!bench_config().use_validation) {
printf("Note: CPU cross-validation disabled\n");
}
if (bench_config().warmup_iters == 0) {
printf("Note: warmup disabled, iters=%d\n", bench_config().bench_iters);
} else if (bench_config().warmup_iters != 5 || bench_config().bench_iters != 20) {
printf("Note: warmup=%d, iters=%d\n",
bench_config().warmup_iters, bench_config().bench_iters);
}
}
// CUDA error checking
#define CHECK_CUDA(call) do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d: %s\n", __FILE__, __LINE__, \
cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while(0)
// device information
inline void print_device_info() {
cudaDeviceProp prop;
CHECK_CUDA(cudaGetDeviceProperties(&prop, 0));
printf("Device: %s\n", prop.name);
int memoryClockRateKHz;
cudaError_t status = cudaDeviceGetAttribute(&memoryClockRateKHz, cudaDevAttrMemoryClockRate, 0);
if (status == cudaSuccess) {
printf("Memory Bandwidth: %.0f GB/s (theoretical peak)\n",
2.0 * memoryClockRateKHz * (prop.memoryBusWidth / 8) / 1e6);
}
}
// timing utilities
class CudaTimer {
public:
CudaTimer() {
CHECK_CUDA(cudaEventCreate(&start_));
CHECK_CUDA(cudaEventCreate(&stop_));
}
~CudaTimer() {
cudaEventDestroy(start_);
cudaEventDestroy(stop_);
}
void start() {
CHECK_CUDA(cudaEventRecord(start_));
}
void stop() {
CHECK_CUDA(cudaEventRecord(stop_));
CHECK_CUDA(cudaEventSynchronize(stop_));
}
float elapsed_ms() const {
float ms;
CHECK_CUDA(cudaEventElapsedTime(&ms, start_, stop_));
return ms;
}
private:
cudaEvent_t start_, stop_;
};
// Time a kernel launch, returning average time per iteration in milliseconds.
// Uses global bench_config() for warmup/iteration counts.
template<typename KernelFunc>
inline double time_kernel(KernelFunc kernel_launch) {
// warmup
if (warmup_iters() > 0) {
for (int i = 0; i < warmup_iters(); i++) {
kernel_launch();
}
CHECK_CUDA(cudaDeviceSynchronize());
}
// benchmark
CudaTimer timer;
timer.start();
for (int i = 0; i < bench_iters(); i++) {
kernel_launch();
}
timer.stop();
return timer.elapsed_ms() / bench_iters();
}
// benchmark result structure
struct BenchmarkResult {
const char* name;
double time_ms;
double bandwidth_gb_s;
double gflops;
bool correct;
BenchmarkResult() : name(nullptr), time_ms(0), bandwidth_gb_s(0), gflops(0), correct(false) {}
};
// result printing
inline void print_result(const BenchmarkResult& r) {
const char* status = use_validation() ? (r.correct ? "[OK]" : "[FAIL]") : "";
if (r.gflops > 0) {
printf(" %-42s: %7.3f ms, %7.1f GB/s, %6.1f GFLOPS %s\n",
r.name, r.time_ms, r.bandwidth_gb_s, r.gflops, status);
} else {
printf(" %-42s: %7.3f ms, %7.1f GB/s %s\n",
r.name, r.time_ms, r.bandwidth_gb_s, status);
}
}
#endif // CUDA_TILE_BENCHMARK_H

View File

@ -0,0 +1,96 @@
/* Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/*
* This file provides matrix multiplication benchmark helpers shared by
* tileMatmul and tileMatmulAutotuner.
*/
#ifndef CUDA_TILE_MATMUL_BENCHMARK_H
#define CUDA_TILE_MATMUL_BENCHMARK_H
#include "benchmark.h"
#include <cuda_fp16.h>
#include <cmath>
inline void fill_matmul_metrics(BenchmarkResult& result, int M, int N, int K) {
// FLOPs: 2 * M * N * K (multiply + add for each output element).
double flops = 2.0 * M * N * K;
result.gflops = flops / (result.time_ms * 1e6);
// Bandwidth: read A, read B, and write C.
size_t bytes = ((size_t)M * K + (size_t)K * N) * sizeof(__half) +
(size_t)M * N * sizeof(float);
result.bandwidth_gb_s = (bytes / 1e9) / (result.time_ms / 1000.0);
}
// CPU reference implementation (FP16 -> FP32).
inline void matmul_cpu(float* C, const __half* A, const __half* B, int M, int N, int K) {
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {
float sum = 0.0f;
for (int k = 0; k < K; k++) {
sum += __half2float(A[i * K + k]) * __half2float(B[k * N + j]);
}
C[i * N + j] = sum;
}
}
}
inline bool verify_matmul_result(const char* name,
const float* h_result,
const float* h_expected,
int M, int N) {
for (int i = 0; i < M * N; i++) {
float abs_err = std::abs(h_result[i] - h_expected[i]);
float rel_err = abs_err / (std::abs(h_expected[i]) + 1e-6f);
// FP16 has less precision, so allow a larger tolerance.
if (abs_err > 1e-2f && rel_err > 0.1f) {
printf("%s verification failed at %d: got %f, expected %f\n",
name, i, h_result[i], h_expected[i]);
return false;
}
}
return true;
}
template<typename KernelFunc, typename ValidateFunc>
inline BenchmarkResult run_benchmark(const char* name,
KernelFunc kernel_launch,
ValidateFunc validate_result,
int M, int N, int K) {
BenchmarkResult result;
result.name = name;
result.time_ms = time_kernel(kernel_launch);
fill_matmul_metrics(result, M, N, K);
result.correct = !use_validation() || validate_result();
return result;
}
#endif // CUDA_TILE_MATMUL_BENCHMARK_H

View File

@ -0,0 +1,17 @@
# GCC auto-enables _FORTIFY_SOURCE at -O1 and above, which routes printf through
# a __host__ __device__ wrapper that tile kernels can't call. Turn it off for
# the tile category.
add_compile_options(
$<$<COMPILE_LANGUAGE:CUDA>:-Xcompiler=-U_FORTIFY_SOURCE>
$<$<COMPILE_LANGUAGE:CUDA>:-Xcompiler=-D_FORTIFY_SOURCE=0>
)
add_subdirectory(helloTile)
add_subdirectory(tileVectorAdd)
add_subdirectory(tileTranspose)
add_subdirectory(tileMatmulAutotuner)
add_subdirectory(tileMatmul)
add_subdirectory(tileBmm)
add_subdirectory(tileLayerNorm)
add_subdirectory(tileRope)
add_subdirectory(tileSpMV)

74
cpp/9_CUDA_Tile/README.md Normal file
View File

@ -0,0 +1,74 @@
# 9. CUDA Tile
### [helloTile](./helloTile)
A CUDA Tile C++ sample demonstrating basic usage of tile kernels. This sample shows how to launch a tile kernel and how data can be passed between SIMT and Tile kernels through global device memory.
### [tileVectorAdd](./tileVectorAdd)
This sample demonstrates a simple vector addition using CUDA Tile C++.
The vector addition is performed by splitting the dataset into blocks
which process 1024 elements at a time. The cuda::tiles::partition_view
type is used to partition the data into chunks of size 1024. Each
block loads its respective chunk from 'a' and 'b', performs an
elementwise addition, then stores it to the corresponding chunk of
'c'. Masked loads and stores are used to ensure that the last chunk
which is partially out of bounds is correctly handled.
### [tileTranspose](./tileTranspose)
This sample demonstrates how to transpose a 2D matrix using CUDA Tile
C++. Each block handles an n x m sized chunk of the source matrix. The
block loads a chunk, transposes it locally, and stores it to the
correct position in the result matrix. A cuda::tiles::partition_view
is used to model the chunking of the source and result matrices.
### [tileMatmul](./tileMatmul)
This sample demonstrates how to write a matrix multiplication kernel with good performance in CUDA Tile C++. The kernel multiplies FP16 input tiles with FP32 accumulation using cuda::tiles::mma. The sample compares a naive implementation with an optimized implementation that applies good practices and provides the compiler with additional guidance for better code generation. The host code validates both results and uses CUDA events to compare execution time.
### [tileMatmulAutotuner](./tileMatmulAutotuner)
A CUDA Tile C++ sample demonstrating an nvrtc/nvcc autotuner over a matrix multiplication kernel. This sample shows how autotuning can help guide the choice of tile sizes and optimization hints.
### [tileBmm](./tileBmm)
This sample demonstrates a static-persistent batched matrix multiplication
(BMM) using CUDA Tile C++. Given inputs A of shape (Q, M, K) and B of
shape (Q, K, N), the kernel computes C = A x B of shape (Q, M, N). The
grid launches a fixed number of persistent blocks sized from the device's
SM count, and each block walks the (M, N, Q-chunk) tile space via a
grid-stride loop. Each iteration consumes a chunk of BLOCK_SIZE_Q batches
and issues a single rank-3 batched cuda::tiles::mma per K-step, with the
(M, N) output partitioned into BLOCK_SIZE_M x BLOCK_SIZE_N tiles via
cuda::tiles::partition_view.
### [tileLayerNorm](./tileLayerNorm)
This sample demonstrates a persistent layer-norm forward pass using
CUDA Tile C++: `y = (x - mean) * rsqrt(var + eps) * weight + bias`.
The grid launches `NUM_SMS` persistent blocks; each block walks the
row dimension with a grid-stride loop, processing `BLOCK_N` rows by
`BLOCK_D` cols per iteration. Per-row mean and inverse standard
deviation are reduced across the column dimension with `cuda::tiles`
row reductions, while the weight and bias tiles are loaded once and
broadcast across rows. Compile-time template parameters for `N`,
`D`, `NUM_SMS`, and `EPS` let the tile compiler fold the loop step,
the `(1/D)` reciprocal, partition_view extents, and the eps
broadcast.
### [tileRope](./tileRope)
This sample demonstrates a Rotary Position Embedding (RoPE) forward
pass using CUDA Tile C++. The implementation uses the split-half
(GPT-NeoX style) convention: for each token at position `s` the pair
`(q[i], q[i + D/2])` is rotated by `theta = s * 10000^(-2i / D)`. The
`cuda::tiles::partition_view` type partitions the Q and K tensors
over (heads, half_rope_dim), and a single block processes all heads
for one (batch, position) token in parallel, writing the result back
in place against precomputed cos/sin tables.
### [tileSpMV](./tileSpMV)
This sample demonstrates sparse matrix-vector multiplication (SpMV)
`y = A * x` using CUDA Tile C++.

View File

@ -0,0 +1,29 @@
cmake_minimum_required(VERSION 3.20)
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/Modules")
project(helloTile LANGUAGES C CXX CUDA)
find_package(CUDAToolkit REQUIRED)
set(CMAKE_CUDA_ARCHITECTURES 80 86 87 89 90 100 110 120)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --enable-tile")
if(ENABLE_CUDA_DEBUG)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G")
else()
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -lineinfo") # add line information to all builds for debug tools (exclusive to -G option)
endif()
# Include directories and libraries
include_directories(../../../Common)
# Source file
add_executable(helloTile helloTile.cu)
target_compile_features(helloTile PRIVATE cxx_std_20 cuda_std_20)
# Include installation configuration
include(${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/InstallSamples.cmake)
setup_samples_install()

View File

@ -0,0 +1,31 @@
# helloTile
## Description
This CUDA Tile C++ sample demonstrates basic usage of tile
kernels. This code launches a tile kernel using the triple chevron
syntax and passes data between SIMT and Tile code through global
device memory.
Error checks are performed using `cudaGetLastError` to catch kernel launch issues and `cudaDeviceSynchronize` to catch kernel execution issues.
## Expected Output
```
Hello, SIMT!
[SIMT] *x == 0
[SIMT] *x = 100
Hello, Tile!
[Tile] *x == 100
[Tile] *x = 200
Hello, Host!
[Host] *x == 200
```
## Prerequisites
- [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) version 13.3 or later.
- [CUDA Driver](https://www.nvidia.com/en-us/drivers/) version 580 or later.
- Host compiler with C++20 support.

View File

@ -0,0 +1,76 @@
/* Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/*
* This CUDA Tile C++ sample demonstrates basic usage of tile
* kernels. This code launches a tile kernel using the triple chevron
* syntax and passes data between SIMT and Tile code through global
* device memory. Error checks are performed using `cudaGetLastError`
* to catch kernel launch issues and `cudaDeviceSynchronize` to catch
* kernel execution issues.
*/
#include "helper_cuda.h"
__global__ void simtKernel(int* x) {
printf("Hello, SIMT!\n");
printf("[SIMT] *x == %i\n", *x);
*x = 100;
printf("[SIMT] *x = %i\n\n", *x);
}
__tile_global__ void tileKernel(int* x) {
printf("Hello, Tile!\n");
printf("[Tile] *x == %i\n", *x);
*x = 200;
printf("[Tile] *x = %i\n\n", *x);
}
int main() {
int* d_x = nullptr;
checkCudaErrors(cudaMalloc(&d_x, sizeof(int)));
checkCudaErrors(cudaMemset(d_x, 0, sizeof(int)));
simtKernel<<<1, 1>>>(d_x);
checkCudaErrors(cudaGetLastError());
checkCudaErrors(cudaDeviceSynchronize());
/* launches tile kernel, the threads per block parameter is omitted because it must always be 1. */
tileKernel<<<1>>>(d_x);
checkCudaErrors(cudaGetLastError());
checkCudaErrors(cudaDeviceSynchronize());
int h_x = 0;
checkCudaErrors(cudaMemcpy(&h_x, d_x, sizeof(int), cudaMemcpyDeviceToHost));
checkCudaErrors(cudaFree(d_x));
printf("Hello, Host!\n");
printf("[Host] *x == %i\n", h_x);
}

View File

@ -0,0 +1,29 @@
cmake_minimum_required(VERSION 3.20)
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/Modules")
project(tileBmm LANGUAGES C CXX CUDA)
find_package(CUDAToolkit REQUIRED)
set(CMAKE_CUDA_ARCHITECTURES 80 86 87 89 90 100 110 120)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --enable-tile")
if(ENABLE_CUDA_DEBUG)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G")
else()
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -lineinfo") # add line information to all builds for debug tools (exclusive to -G option)
endif()
# Include directories and libraries
include_directories(../../../Common)
# Source file
add_executable(tileBmm tileBmm.cu)
target_compile_features(tileBmm PRIVATE cxx_std_20 cuda_std_20)
# Include installation configuration
include(${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/InstallSamples.cmake)
setup_samples_install()

View File

@ -0,0 +1,28 @@
# tileBmm
## Description
This sample demonstrates a static-persistent batched matrix multiplication
(BMM) using CUDA Tile C++. Given inputs A of shape (Q, M, K) and B of shape
(Q, K, N), the kernel computes C = A x B of shape (Q, M, N). The grid
launches a fixed number of persistent blocks sized from the device's SM
count, and each block walks the (M, N, Q-chunk) tile space via a grid-stride
loop. The batch dimension is tiled by BLOCK_SIZE_Q so every iteration issues
a single rank-3 batched cuda::tiles::mma per K-step over tiles of shape
(BLOCK_SIZE_Q, BLOCK_SIZE_M, BLOCK_SIZE_K) and
(BLOCK_SIZE_Q, BLOCK_SIZE_K, BLOCK_SIZE_N). Grouped ordering on
(pid_m, pid_n) gives L2 reuse. The accumulator is kept in float32 for
precision, and masked loads/stores handle tiles that overhang the matrix
or batch boundaries. Inputs and outputs use __half precision.
## Expected Output
```
Success! BMM matches expected results.
```
## Prerequisites
- [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) version 13.3 or later.
- [CUDA Driver](https://www.nvidia.com/en-us/drivers/) version 580 or later.
- Host compiler with C++20 support.

View File

@ -0,0 +1,268 @@
/* Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/*
* This sample demonstrates a static-persistent batched matrix multiplication
* (BMM) using CUDA Tile C++. Given A of shape (Q, M, K) and B of shape
* (Q, K, N), the kernel computes C = A x B of shape (Q, M, N). The grid
* launches a fixed number of persistent blocks (sized from the device's SM
* count); each block walks the (M, N, Q-chunk) tile space via a grid-stride
* loop. The batch dimension is tiled by BLOCK_SIZE_Q so every iteration issues
* a single rank-3 batched cuda::tiles::mma per K-step over tiles of shape
* (BLOCK_SIZE_Q, BLOCK_SIZE_M, BLOCK_SIZE_K) x
* (BLOCK_SIZE_Q, BLOCK_SIZE_K, BLOCK_SIZE_N). Grouped ordering on
* (pid_m, pid_n) gives L2 reuse. The accumulator is kept in float32 for
* precision, and masked loads/stores handle tiles that overhang the matrix
* or batch boundaries. Inputs and outputs use __half.
*
* A SIMT kernel is used to initialize the input matrices.
*/
#include "helper_cuda.h"
#include "cuda_tile.h"
#include "cuda_fp16.h"
#include <cstdio>
#include <cstdlib>
/* SIMT initializer for A (shape Q x M x K) and B (shape Q x K x N).
* Values are bounded so the K-summed result fits comfortably in __half. */
__global__ void initializeMatrices(__half* a, __half* b,
int Q, int M, int N, int K) {
auto idx = blockIdx.x * blockDim.x + threadIdx.x;
std::size_t a_size = std::size_t(Q) * M * K;
std::size_t b_size = std::size_t(Q) * K * N;
if (idx < a_size) {
int k = idx % K;
int m = (idx / K) % M;
a[idx] = __half{float((m + k + 1) % 8) / 32.0f};
}
if (idx < b_size) {
int n = idx % N;
int k = (idx / N) % K;
b[idx] = __half{float((k + n + 1) % 8) / 32.0f};
}
}
/* Static-persistent tile kernel computing C = A @ B for batched 3D tensors.
* A: (Q, M, K), B: (Q, K, N), C: (Q, M, N). Both inputs are in their
* natural (non-transposed) layout. The grid is sized from the device SM
* count and each block walks the (M, N, Q-chunk) tile space via a
* grid-stride irange loop. Each iteration consumes a chunk of BLOCK_SIZE_Q
* batches with a single rank-3 batched mma per K-step. */
template<typename T, int BLOCK_SIZE_Q, int BLOCK_SIZE_M, int BLOCK_SIZE_N,
int BLOCK_SIZE_K, int GROUP_SIZE_M, int Q, int M, int N, int K,
int NUM_CTAS, int OCCUPANCY>
[[ using cutile :
hint(0, num_cta_in_cga=NUM_CTAS),
hint(0, occupancy=OCCUPANCY)
]]
__tile_global__ void persistent_bmm_kernel(const T* __restrict__ _a_ptr,
const T* __restrict__ _b_ptr,
T* __restrict__ _c_ptr) {
namespace ct = cuda::tiles;
/* tell the compiler the pointers are aligned (important for codegen) */
const T* a_ptr = ct::assume_aligned<16>(_a_ptr);
const T* b_ptr = ct::assume_aligned<16>(_b_ptr);
T* c_ptr = ct::assume_aligned<16>(_c_ptr);
/* accumulator tile kept in float32 for numerical precision; rank-3
* so the batched mma can fold the (q, m, k) x (q, k, n) -> (q, m, n)
* contraction in a single call. */
using AccTile = ct::tile<float,
ct::shape<BLOCK_SIZE_Q, BLOCK_SIZE_M, BLOCK_SIZE_N>>;
int bid = ct::bid().x;
int num_programs = ct::num_blocks().x;
/* tile counts include a chunked batch axis */
constexpr int num_tiles_m = (M + BLOCK_SIZE_M - 1) / BLOCK_SIZE_M;
constexpr int num_tiles_n = (N + BLOCK_SIZE_N - 1) / BLOCK_SIZE_N;
constexpr int num_tiles_q = (Q + BLOCK_SIZE_Q - 1) / BLOCK_SIZE_Q;
constexpr int total_tiles = num_tiles_m * num_tiles_n * num_tiles_q;
/* loop-invariant partition views for A (Q, M, K), B (Q, K, N), C (Q, M, N) */
auto a_layout = ct::layout_right_mapping{ct::extents{Q, M, K}};
auto pA = ct::partition_view{
ct::tensor_span{a_ptr, a_layout},
ct::shape<BLOCK_SIZE_Q, BLOCK_SIZE_M, BLOCK_SIZE_K>{}};
auto b_layout = ct::layout_right_mapping{ct::extents{Q, K, N}};
auto pB = ct::partition_view{
ct::tensor_span{b_ptr, b_layout},
ct::shape<BLOCK_SIZE_Q, BLOCK_SIZE_K, BLOCK_SIZE_N>{}};
auto c_layout = ct::layout_right_mapping{ct::extents{Q, M, N}};
auto pC = ct::partition_view{
ct::tensor_span{c_ptr, c_layout},
ct::shape<BLOCK_SIZE_Q, BLOCK_SIZE_M, BLOCK_SIZE_N>{}};
/* grid-stride loop over (pid_q_chunk, pid_m, pid_n) tiles */
for (auto current_bid : ct::irange(bid, total_tiles, num_programs)) {
/* decode the linear tile id with grouped ordering on (m, n) for L2 reuse */
int pid_q = current_bid / (num_tiles_m * num_tiles_n);
int num_pid_in_group = GROUP_SIZE_M * num_tiles_n;
int current_bid_2d = current_bid % (num_tiles_m * num_tiles_n);
int group_id = current_bid_2d / num_pid_in_group;
int first_pid_m = group_id * GROUP_SIZE_M;
int group_size_m_temp = num_tiles_m - first_pid_m;
int group_size_m = (group_size_m_temp < GROUP_SIZE_M)
? group_size_m_temp : GROUP_SIZE_M;
int pid_m = first_pid_m + (current_bid_2d % group_size_m);
int pid_n = (current_bid_2d % num_pid_in_group) / group_size_m;
auto accumulator = ct::zeros<AccTile>();
/* K-dimension accumulation loop; each iteration issues a single
* rank-3 mma across BLOCK_SIZE_Q batches. */
constexpr int num_k_tiles = (K + BLOCK_SIZE_K - 1) / BLOCK_SIZE_K;
for (auto k_tile : ct::irange(0, num_k_tiles)) {
auto a_tile = pA.load_masked(pid_q, pid_m, k_tile);
auto b_tile = pB.load_masked(pid_q, k_tile, pid_n);
accumulator = ct::mma(a_tile, b_tile, accumulator);
}
auto result = ct::element_cast<T>(accumulator);
pC.store_masked(result, pid_q, pid_m, pid_n);
}
}
int main() {
/* tile-shape template parameters: multiples of 16 (tensor-core friendly)
* that divide the test problem cleanly. BLOCK_SIZE_Q controls how
* many batches each block fuses into a single rank-3 mma. NUM_CTAS and
* OCCUPANCY are launch hints for the cutile compiler. These values
* mirror the production defaults used in the Ocean / TileGym BMM kernel
* for sm_100-class GPUs. */
constexpr int BLOCK_SIZE_Q = 1;
constexpr int BLOCK_SIZE_M = 256;
constexpr int BLOCK_SIZE_N = 256;
constexpr int BLOCK_SIZE_K = 64;
constexpr int GROUP_SIZE_M = 8;
constexpr int NUM_CTAS = 2;
constexpr int OCCUPANCY = 1;
/* problem dimensions are compile-time NTTPs so partition extents fold and
* total_tiles is constexpr inside the kernel. Sizes are kept small so the
* CPU reference comparison stays fast; the launch config above is still
* the production sm_100 set (which is tuned for much larger shapes). */
constexpr int Q = 4;
constexpr int M = 256;
constexpr int N = 256;
constexpr int K = 128;
std::size_t a_size = std::size_t(Q) * M * K;
std::size_t b_size = std::size_t(Q) * K * N;
std::size_t c_size = std::size_t(Q) * M * N;
__half* d_A = nullptr;
__half* d_B = nullptr;
__half* d_C = nullptr;
checkCudaErrors(cudaMalloc(&d_A, a_size * sizeof(__half)));
checkCudaErrors(cudaMalloc(&d_B, b_size * sizeof(__half)));
checkCudaErrors(cudaMalloc(&d_C, c_size * sizeof(__half)));
/* populate A and B with deterministic test data on the device */
int init_threads = 256;
std::size_t init_elems = (a_size > b_size) ? a_size : b_size;
int init_blocks = int((init_elems + init_threads - 1) / init_threads);
initializeMatrices<<<init_blocks, init_threads>>>(d_A, d_B, Q, M, N, K);
checkCudaErrors(cudaGetLastError());
/* compute a CPU reference using double accumulation, then cast to __half */
__half* h_A = new __half[a_size];
__half* h_B = new __half[b_size];
__half* h_C_ref = new __half[c_size];
checkCudaErrors(cudaMemcpy(h_A, d_A, a_size * sizeof(__half), cudaMemcpyDeviceToHost));
checkCudaErrors(cudaMemcpy(h_B, d_B, b_size * sizeof(__half), cudaMemcpyDeviceToHost));
for (int q = 0; q < Q; ++q) {
for (int m = 0; m < M; ++m) {
for (int n = 0; n < N; ++n) {
double acc = 0.0;
for (int k = 0; k < K; ++k) {
double av = double(float(h_A[(std::size_t(q) * M + m) * K + k]));
double bv = double(float(h_B[(std::size_t(q) * K + k) * N + n]));
acc += av * bv;
}
h_C_ref[(std::size_t(q) * M + m) * N + n] = __half{float(acc)};
}
}
}
/* launch the persistent BMM kernel: grid size mirrors the static-persistent
* formula min(NUM_SMS / NUM_CTAS, total_tiles) * OCCUPANCY -- enough
* blocks to either saturate the device or cover all tiles, whichever is
* smaller. */
cudaDeviceProp prop;
checkCudaErrors(cudaGetDeviceProperties(&prop, 0));
int num_sms = prop.multiProcessorCount;
constexpr int num_tiles_m = (M + BLOCK_SIZE_M - 1) / BLOCK_SIZE_M;
constexpr int num_tiles_n = (N + BLOCK_SIZE_N - 1) / BLOCK_SIZE_N;
constexpr int num_tiles_q = (Q + BLOCK_SIZE_Q - 1) / BLOCK_SIZE_Q;
constexpr int total_tiles = num_tiles_m * num_tiles_n * num_tiles_q;
int base_programs = num_sms / NUM_CTAS;
int grid_size = (base_programs < total_tiles ? base_programs : total_tiles)
* OCCUPANCY;
persistent_bmm_kernel<__half, BLOCK_SIZE_Q, BLOCK_SIZE_M, BLOCK_SIZE_N,
BLOCK_SIZE_K, GROUP_SIZE_M, Q, M, N, K,
NUM_CTAS, OCCUPANCY>
<<<dim3(grid_size, 1, 1)>>>(d_A, d_B, d_C);
checkCudaErrors(cudaGetLastError());
checkCudaErrors(cudaDeviceSynchronize());
__half* h_C = new __half[c_size];
checkCudaErrors(cudaMemcpy(h_C, d_C, c_size * sizeof(__half), cudaMemcpyDeviceToHost));
for (std::size_t idx = 0; idx < c_size; ++idx) {
float got = float(h_C[idx]);
float ref = float(h_C_ref[idx]);
float diff = got > ref ? got - ref : ref - got;
if (diff > 1e-1f) {
printf("Expected: h_C[%zu] == %f\n", idx, ref);
printf("Actual: h_C[%zu] == %f\n", idx, got);
return 1;
}
}
printf("Success! BMM matches expected results.\n");
checkCudaErrors(cudaFree(d_A));
checkCudaErrors(cudaFree(d_B));
checkCudaErrors(cudaFree(d_C));
delete[] h_A;
delete[] h_B;
delete[] h_C;
delete[] h_C_ref;
}

View File

@ -0,0 +1,29 @@
cmake_minimum_required(VERSION 3.20)
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/Modules")
project(tileLayerNorm LANGUAGES C CXX CUDA)
find_package(CUDAToolkit REQUIRED)
set(CMAKE_CUDA_ARCHITECTURES 80 86 87 89 90 100 110 120)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --enable-tile")
if(ENABLE_CUDA_DEBUG)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G")
else()
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -lineinfo") # add line information to all builds for debug tools (exclusive to -G option)
endif()
# Include directories and libraries
include_directories(../../../Common)
# Source file
add_executable(tileLayerNorm tileLayerNorm.cu)
target_compile_features(tileLayerNorm PRIVATE cxx_std_20 cuda_std_20)
# Include installation configuration
include(${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/InstallSamples.cmake)
setup_samples_install()

View File

@ -0,0 +1,26 @@
# tileLayerNorm
## Description
This sample demonstrates a persistent layer-norm forward pass using
CUDA Tile C++:
`y = (x - mean) * rsqrt(var + eps) * weight + bias`. The grid launches `NUM_SMS`
persistent blocks; each block walks the row dimension with a grid-stride loop,
processing `BLOCK_N` rows by `BLOCK_D` cols per iteration and
striding by `NUM_SMS * BLOCK_N` rows between iterations. Per-row
mean and inverse standard deviation are reduced across the column
dimension with `cuda::tiles` row reductions and saved to float32
side buffers, while the weight and bias tiles are loaded once and
broadcast across rows.
## Expected Output
```
Success! Persistent LayerNorm matches expected results.
```
## Prerequisites
- [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) version 13.3 or later.
- [CUDA Driver](https://www.nvidia.com/en-us/drivers/) version 580 or later.
- Host compiler with C++20 support.

View File

@ -0,0 +1,270 @@
/* Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/*
* This sample demonstrates a persistent LayerNorm forward pass using
* CUDA Tile C++: y = (x - mean) * rsqrt(var + eps) * weight + bias.
* The grid launches NUM_SMS persistent blocks; each block walks the
* row dimension with a grid-stride loop, processing BLOCK_N rows x
* BLOCK_D cols per iteration and striding by NUM_SMS * BLOCK_N rows.
* Per-row mean and rstd are reduced over the column dimension and
* (when COMPUTE_MEAN_AND_RSTD and TRAINING are enabled) saved to
* float32 side buffers. N, D, NUM_SMS, and EPS are template NTTPs so
* the tile compiler can fold the loop step, the (1/D) reciprocal,
* partition_view extents, and the (var + eps) broadcast. A SIMT
* kernel is used to initialize X, W, and B on device.
*/
#include "helper_cuda.h"
#include "cuda_tile.h"
#include "cuda_fp16.h"
#include <cstdio>
#include <cstdlib>
#include <cmath>
/* SIMT initializer for X (N x D), W (D,), B (D,) with deterministic data. */
__global__ void initializeInputs(__half* X, __half* W, __half* B,
int N, int D) {
auto idx = blockIdx.x * blockDim.x + threadIdx.x;
auto total = N * D;
if (idx < total) {
int m = idx / D;
int n = idx - m * D;
X[idx] = __half{float((m + n) % 7) - 3.5f};
}
if (idx < D) {
W[idx] = __half{1.0f + 0.1f * float(idx % 5)};
B[idx] = __half{0.1f * float(idx % 3)};
}
}
template<typename T,
int BLOCK_N,
int BLOCK_D,
bool TRAINING,
bool COMPUTE_MEAN_AND_RSTD,
int N,
int D, // compile-time so `(... / D)` and partition_view extents fold.
int NUM_SMS, // compile-time so the persistent for-loop step is constant.
float EPS> // compile-time so the (var + eps) reshape/broadcast hoists out of the for-loop
[[ using cutile : hint(0, num_cta_in_cga=1) ]]
__tile_global__ void persistent_layer_norm_fwd_kernel(
const T* __restrict__ X, // (N, D)
T* __restrict__ Y, // (N, D)
const T* __restrict__ W, // (D,)
const T* __restrict__ B, // (D,)
float* __restrict__ Mean, // (N,)
float* __restrict__ Rstd // (N,)
) {
namespace ct = cuda::tiles;
using f32_N = ct::tile<float, ct::shape<BLOCK_N>>;
X = ct::assume_aligned<16>(X);
Y = ct::assume_aligned<16>(Y);
W = ct::assume_aligned<16>(W);
B = ct::assume_aligned<16>(B);
int pid = ct::bid().x;
constexpr int upper_bound = (N + BLOCK_N - 1) / BLOCK_N;
// Partitioned views with compile-time extents (N, D are template NTTPs).
using ExtND = ct::extents<uint32_t, static_cast<uint32_t>(N), static_cast<uint32_t>(D)>;
using ExtD = ct::extents<uint32_t, static_cast<uint32_t>(D)>;
using ExtN = ct::extents<uint32_t, static_cast<uint32_t>(N)>;
auto pX = ct::partition_view(
ct::tensor_span{X, ExtND{}},
ct::shape<BLOCK_N, BLOCK_D>{});
auto pY = ct::partition_view(
ct::tensor_span{Y, ExtND{}},
ct::shape<BLOCK_N, BLOCK_D>{});
auto pW = ct::partition_view(
ct::tensor_span{W, ExtD{}},
ct::shape<BLOCK_D>{});
auto pB = ct::partition_view(
ct::tensor_span{B, ExtD{}},
ct::shape<BLOCK_D>{});
auto pMean = ct::partition_view(
ct::tensor_span{Mean, ExtN{}},
ct::shape<BLOCK_N>{});
auto pRstd = ct::partition_view(
ct::tensor_span{Rstd, ExtN{}},
ct::shape<BLOCK_N>{});
// Load weights once (hoisted out of the grid-stride loop).
auto w = ct::element_cast<float>(pW.load(0)); // (BLOCK_D,)
auto b = ct::element_cast<float>(pB.load(0)); // (BLOCK_D,)
// Broadcast weights into (BLOCK_N, BLOCK_D) by reshape to (1, BLOCK_D).
auto w_bcast = ct::reshape<ct::shape<1, BLOCK_D>>(w);
auto b_bcast = ct::reshape<ct::shape<1, BLOCK_D>>(b);
constexpr float inv_D_scalar = 1.0f / static_cast<float>(D);
auto inv_D_tile = ct::full<f32_N>(inv_D_scalar);
auto eps_tile = ct::full<f32_N>(EPS);
using TileXNxD = ct::tile<T, ct::shape<BLOCK_N, BLOCK_D>>;
for (auto current_pid : ct::irange(pid, upper_bound, NUM_SMS)) {
TileXNxD x_tile;
[[ using cutile : hint(0, latency=4) ]]
x_tile = pX.load_masked(current_pid, 0);
auto x = ct::element_cast<float>(x_tile);
f32_N mean;
f32_N rstd;
if constexpr (COMPUTE_MEAN_AND_RSTD) {
// Step 1: Compute x^2 then sum/mean. Use the loop-invariant
// `inv_D_tile` and `eps_tile` (built outside the loop) so the
// reshape + broadcast of those scalars stays hoisted.
auto x_squared = x * x;
auto avg_square_2d = ct::sum<1>(x_squared);
auto avg_square = ct::reshape<ct::shape<BLOCK_N>>(avg_square_2d) * inv_D_tile;
auto mean_2d = ct::sum<1>(x);
mean = ct::reshape<ct::shape<BLOCK_N>>(mean_2d) * inv_D_tile;
auto var = avg_square - mean * mean;
rstd = ct::rsqrt(var + eps_tile);
if constexpr (TRAINING) {
[[ using cutile : hint(0, allow_tma=false) ]]
pMean.store_masked(mean, current_pid);
[[ using cutile : hint(0, allow_tma=false) ]]
pRstd.store_masked(rstd, current_pid);
}
} else {
mean = pMean.load_masked(current_pid);
rstd = pRstd.load_masked(current_pid);
}
// Broadcast mean/rstd to (BLOCK_N, 1) then rely on implicit broadcast
// against (BLOCK_N, BLOCK_D).
auto mean_col = ct::reshape<ct::shape<BLOCK_N, 1>>(mean);
auto rstd_col = ct::reshape<ct::shape<BLOCK_N, 1>>(rstd);
auto x_hat = (x - mean_col) * rstd_col;
auto y_f32 = x_hat * w_bcast + b_bcast;
auto y_T = ct::element_cast<T>(y_f32);
[[ using cutile : hint(0, allow_tma=false) ]]
pY.store_masked(y_T, current_pid, 0);
}
}
int main() {
/* BLOCK_D == D so each persistent-loop iteration covers a full row's
* columns in one tile per row. NUM_SMS is a template NTTP, hence
* compile-time; 132 matches B200 / Hopper-class GPUs - adjust to
* match the target device's `multiProcessorCount` for best perf. */
constexpr int N = 1024, D = 256;
constexpr int BLOCK_N = 4, BLOCK_D = 256;
constexpr int NUM_SMS = 132;
constexpr float EPS = 1e-5f;
__half *d_X = nullptr, *d_Y = nullptr, *d_W = nullptr, *d_B = nullptr;
float *d_Mean = nullptr, *d_Rstd = nullptr;
checkCudaErrors(cudaMalloc(&d_X, N * D * sizeof(__half)));
checkCudaErrors(cudaMalloc(&d_Y, N * D * sizeof(__half)));
checkCudaErrors(cudaMalloc(&d_W, D * sizeof(__half)));
checkCudaErrors(cudaMalloc(&d_B, D * sizeof(__half)));
checkCudaErrors(cudaMalloc(&d_Mean, N * sizeof(float)));
checkCudaErrors(cudaMalloc(&d_Rstd, N * sizeof(float)));
int init_threads = 256, init_blocks = 1 + ((N * D - 1) / init_threads);
initializeInputs<<<init_blocks, init_threads>>>(d_X, d_W, d_B, N, D);
checkCudaErrors(cudaGetLastError());
/* NUM_SMS is a compile-time NTTP that doubles as the persistent-loop
* stride; the launch grid x must equal NUM_SMS for correctness.
* Adjust the constant (and recompile) for one block per SM on a
* device with a different SM count. */
persistent_layer_norm_fwd_kernel<__half, BLOCK_N, BLOCK_D,
/*TRAINING=*/true, /*COMPUTE_MEAN_AND_RSTD=*/true,
N, D, NUM_SMS, EPS>
<<<dim3(NUM_SMS, 1, 1)>>>(d_X, d_Y, d_W, d_B, d_Mean, d_Rstd);
checkCudaErrors(cudaGetLastError());
checkCudaErrors(cudaDeviceSynchronize());
__half* h_Y = new __half[N * D];
__half* h_Y_ref = new __half[N * D];
float* h_Mean = new float[N];
float* h_Rstd = new float[N];
float* h_Mean_ref = new float[N];
float* h_Rstd_ref = new float[N];
checkCudaErrors(cudaMemcpy(h_Y, d_Y, N * D * sizeof(__half), cudaMemcpyDeviceToHost));
checkCudaErrors(cudaMemcpy(h_Mean, d_Mean, N * sizeof(float), cudaMemcpyDeviceToHost));
checkCudaErrors(cudaMemcpy(h_Rstd, d_Rstd, N * sizeof(float), cudaMemcpyDeviceToHost));
/* CPU reference in double precision; compare with 1e-1 fp16 tolerance
* for Y and 1e-3 for the float32 Mean/Rstd outputs. */
for (int m = 0; m < N; ++m) {
double sum = 0.0, sumsq = 0.0;
for (int n = 0; n < D; ++n) {
double x = double(float((m + n) % 7) - 3.5f);
sum += x; sumsq += x * x;
}
double mu = sum / double(D);
double var = sumsq / double(D) - mu * mu;
double inv_std = 1.0 / std::sqrt(var + double(EPS));
h_Mean_ref[m] = float(mu);
h_Rstd_ref[m] = float(inv_std);
for (int n = 0; n < D; ++n) {
double x = double(float((m + n) % 7) - 3.5f);
double w = double(1.0f + 0.1f * float(n % 5));
double b = double(0.1f * float(n % 3));
h_Y_ref[m * D + n] = __half(float((x - mu) * inv_std * w + b));
}
}
for (int idx = 0; idx < N * D; ++idx) {
float expected = float(h_Y_ref[idx]), actual = float(h_Y[idx]);
if (std::fabs(expected - actual) > 1e-1f) {
printf("Mismatch h_Y[%d]: expected %f, actual %f\n", idx, expected, actual);
return 1;
}
}
for (int m = 0; m < N; ++m) {
if (std::fabs(h_Mean_ref[m] - h_Mean[m]) > 1e-3f) {
printf("Mismatch h_Mean[%d]: expected %f, actual %f\n", m, h_Mean_ref[m], h_Mean[m]);
return 1;
}
if (std::fabs(h_Rstd_ref[m] - h_Rstd[m]) > 1e-3f) {
printf("Mismatch h_Rstd[%d]: expected %f, actual %f\n", m, h_Rstd_ref[m], h_Rstd[m]);
return 1;
}
}
printf("Success! Persistent LayerNorm matches expected results.\n");
checkCudaErrors(cudaFree(d_X)); checkCudaErrors(cudaFree(d_Y));
checkCudaErrors(cudaFree(d_W)); checkCudaErrors(cudaFree(d_B));
checkCudaErrors(cudaFree(d_Mean)); checkCudaErrors(cudaFree(d_Rstd));
delete[] h_Y; delete[] h_Y_ref;
delete[] h_Mean; delete[] h_Rstd;
delete[] h_Mean_ref; delete[] h_Rstd_ref;
}

View File

@ -0,0 +1,29 @@
cmake_minimum_required(VERSION 3.20)
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/Modules")
project(tileMatmul LANGUAGES C CXX CUDA)
find_package(CUDAToolkit REQUIRED)
set(CMAKE_CUDA_ARCHITECTURES 80 86 87 89 90 100 110 120)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --enable-tile")
if(ENABLE_CUDA_DEBUG)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G")
else()
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -lineinfo") # add line information to all builds for debug tools (exclusive to -G option)
endif()
# Include directories and libraries
include_directories(../../../Common ../Benchmark_Common)
# Source file
add_executable(tileMatmul tileMatmul.cu)
target_compile_features(tileMatmul PRIVATE cxx_std_20 cuda_std_20)
# Include installation configuration
include(${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/InstallSamples.cmake)
setup_samples_install()

View File

@ -0,0 +1,55 @@
# tileMatmul
## Description
This sample demonstrates how to write a matrix multiplication kernel with good
performance in CUDA Tile C++. The kernel multiplies FP16 input tiles with FP32
accumulation using `cuda::tiles::mma`.
The sample compares a naive implementation with an optimized implementation
that applies good practices and provides the compiler with additional guidance
for better code generation. The host code uses CUDA events to capture execution time.
The optimized kernel uses neutral placeholder values for `LOAD_LATENCY` and
`STORE_LATENCY`. See [tileMatmulAutotuner](../tileMatmulAutotuner) for an
example of autotuning these values to find a suitable
configuration for your hardware.
## Running
```
# Run with default warmup and benchmark iterations. Validation is disabled by default.
./tileMatmul
# Enable CPU validation
./tileMatmul --validate
# To run faster, skip warmups (not recommended) and set iteration to 1
./tileMatmul --skip-warmup -i 1
# Show all options
./tileMatmul --help
```
## Command-Line Options
| Option | Description |
|--------|-------------|
| `--validate` | Enable CPU cross-validation. Validation is disabled by default. |
| `--skip-warmup` | Disable warmup iterations. |
| `--warmup=N` | Set warmup iterations. The default is 5. |
| `-i N`, `--iters=N` | Set benchmark iterations. The default is 20. |
## Example Output
```
Note: CPU cross-validation disabled
matmul : 0.073 ms, 115.2 GB/s, 29495.8 GFLOPS
matmul_naive : 0.497 ms, 16.9 GB/s, 4322.8 GFLOPS
```
## Prerequisites
- [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) version 13.3 or later.
- [CUDA Driver](https://www.nvidia.com/en-us/drivers/) version 580 or later.
- Host compiler with C++20 support.

View File

@ -0,0 +1,282 @@
/* Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/*
* This sample demonstrates matrix multiplication using CUDA Tile C++.
* The input matrices are split into tiles. Each Tile block computes one
* output tile by loading FP16 tiles from A and B, accumulating with
* cuda::tiles::mma in FP32, and storing an FP32 result tile.
*
* The sample compares a naive implementation with an optimized implementation.
* The optimized kernel adds compiler guidance such as pointer alignment and
* divisibility assumptions, cuda::tiles::irange, and latency hints. The host
* code validates both results against a CPU reference and uses CUDA events to
* compare execution time.
*/
#include "helper_cuda.h"
#include "matmul_benchmark.h"
#include "cuda_tile.h"
#include "cuda_fp16.h"
#include <cstdlib>
#include <cstdio>
#include <vector>
constexpr int TILE_BLOCK_M = 32;
constexpr int TILE_BLOCK_N = 64;
constexpr int TILE_BLOCK_K = 64;
constexpr int LOAD_LATENCY = 5;
constexpr int STORE_LATENCY = 5;
/*
* Baseline Tile C++ matmul. This keeps the same tensor_span, partition_view,
* and ct::mma structure as the optimized kernel, but avoids optimization
* hints, assumptions, Tile-specific loop guidance, and the divisibility
* precondition used by non-masked view loads and stores.
*/
__tile_global__ void matmul_naive(float* C,
const __half* A,
const __half* B,
int M, int N, int K) {
namespace ct = cuda::tiles;
// create tensor spans with runtime shapes (FP16 for A and B)
auto a_span = ct::tensor_span{A, ct::extents{M, K}};
auto b_span = ct::tensor_span{B, ct::extents{K, N}};
auto c_span = ct::tensor_span{C, ct::extents{M, N}};
// create partition views with compile-time tile sizes
auto a_view = ct::partition_view{a_span, ct::shape<TILE_BLOCK_M, TILE_BLOCK_K>{}};
auto b_view = ct::partition_view{b_span, ct::shape<TILE_BLOCK_K, TILE_BLOCK_N>{}};
auto c_view = ct::partition_view{c_span, ct::shape<TILE_BLOCK_M, TILE_BLOCK_N>{}};
// get block indices from the 2D grid
auto [pid_m, pid_n, dummy] = ct::bid();
// initialize FP32 accumulator
auto acc = ct::zeros<ct::tile<float, ct::shape<TILE_BLOCK_M, TILE_BLOCK_N>>>();
// loop over the K dimension in blocks
int num_k_blocks = (K + TILE_BLOCK_K - 1) / TILE_BLOCK_K;
for (int k_block = 0; k_block < num_k_blocks; ++k_block) {
ct::tile<__half, ct::shape<TILE_BLOCK_M, TILE_BLOCK_K>> a_tile;
ct::tile<__half, ct::shape<TILE_BLOCK_K, TILE_BLOCK_N>> b_tile;
// load blocks of A and B (FP16), using the default zero padding for boundary tiles
a_tile = a_view.load_masked(pid_m, k_block);
b_tile = b_view.load_masked(k_block, pid_n);
// accumulate: acc += A_block @ B_block (FP16 inputs, FP32 accumulator)
// ct::mma handles mixed precision: FP16 operands with FP32 accumulator.
acc = ct::mma(a_tile, b_tile, acc);
}
// store result (FP32), suppressing stores outside the output matrix
c_view.store_masked(acc, pid_m, pid_n);
}
/*
* optimization strategy: declare non-aliasing pointers with __restrict__.
* This gives the compiler more freedom to schedule loads and stores for A,
* B, and C because the output tile cannot overlap the input tiles.
*/
__tile_global__ void matmul(float* __restrict__ _C,
const __half* __restrict__ _A,
const __half* __restrict__ _B,
int _M, int _N, int _K) {
namespace ct = cuda::tiles;
using namespace ct::literals;
/*
* optimization strategy: communicate pointer alignment explicitly.
* Assumptions are optimization facts, not runtime checks; the returned
* value must be used, and violating the assumption is undefined behavior.
*/
float* C = ct::assume_aligned(_C, 16_ic);
const __half* A = ct::assume_aligned(_A, 16_ic);
const __half* B = ct::assume_aligned(_B, 16_ic);
/*
* optimization strategy: communicate divisible problem dimensions. This
* gives the compiler static facts about index arithmetic and avoids extra
* boundary handling in this optimized divisible-size path.
*/
auto M = ct::assume_divisible(_M, 16_ic);
auto N = ct::assume_divisible(_N, 16_ic);
auto K = ct::assume_divisible(_K, 16_ic);
auto a_span = ct::tensor_span{A, ct::extents{M, K}};
auto b_span = ct::tensor_span{B, ct::extents{K, N}};
auto c_span = ct::tensor_span{C, ct::extents{M, N}};
auto a_view = ct::partition_view{a_span, ct::shape<TILE_BLOCK_M, TILE_BLOCK_K>{}};
auto b_view = ct::partition_view{b_span, ct::shape<TILE_BLOCK_K, TILE_BLOCK_N>{}};
auto c_view = ct::partition_view{c_span, ct::shape<TILE_BLOCK_M, TILE_BLOCK_N>{}};
auto [pid_m, pid_n, dummy] = ct::bid();
auto acc = ct::zeros<ct::tile<float, ct::shape<TILE_BLOCK_M, TILE_BLOCK_N>>>();
/*
* optimization strategy: use ct::irange for the K traversal. This gives the
* compiler a Tile-aware structured loop form for iterating through the
* reduction dimension, unlike the plain C++ loop used in matmul_naive.
*/
int num_k_blocks = (K + TILE_BLOCK_K - 1) / TILE_BLOCK_K;
for (auto k_block : ct::irange(0, num_k_blocks)) {
ct::tile<__half, ct::shape<TILE_BLOCK_M, TILE_BLOCK_K>> a_tile;
ct::tile<__half, ct::shape<TILE_BLOCK_K, TILE_BLOCK_N>> b_tile;
/*
* optimization strategy: attach a latency hint to the A tile load.
* Tile optimization hints can appertain to expression statements and
* influence scheduling decisions for that construct. The non-masked
* load also lets the compiler assume every element in this tile is
* in bounds; the host only launches this kernel for divisible sizes.
*/
[[
cutile::hint(0, latency=LOAD_LATENCY),
]]
a_tile = a_view.load(pid_m, k_block);
/*
* optimization strategy: attach the same latency hint to the B tile
* load so the compiler can schedule both input streams with the
* expected memory latency in mind. As above, this relies on the
* optimized kernel's divisibility precondition.
*/
[[
cutile::hint(0, latency=LOAD_LATENCY),
]]
b_tile = b_view.load(k_block, pid_n);
acc = ct::mma(a_tile, b_tile, acc);
}
/*
* optimization strategy: attach a store latency hint to the final C tile
* store. This mirrors the load hints and gives the compiler information
* about the expected memory latency of the output path. The non-masked
* store also avoids boundary masking under the same divisibility
* precondition.
*/
[[
cutile::hint(0, latency=STORE_LATENCY),
]]
c_view.store(acc, pid_m, pid_n);
}
void run_with_size(int M, int N, int K) {
if (M <= 0 || N <= 0 || K <= 0) {
std::fprintf(stderr,
"M, N, and K must be positive (got M=%d, N=%d, K=%d).\n",
M, N, K);
std::exit(EXIT_FAILURE);
}
// allocate and initialize FP16 inputs on the host
std::vector<__half> h_A(M * K), h_B(K * N);
srand(42);
for (int i = 0; i < M * K; i++) h_A[i] = __float2half((float)rand() / RAND_MAX - 0.5f);
for (int i = 0; i < K * N; i++) h_B[i] = __float2half((float)rand() / RAND_MAX - 0.5f);
std::vector<float> h_expected;
if (use_validation()) {
// build the expected result once, then reuse it for both device kernels
h_expected.resize(M * N);
matmul_cpu(h_expected.data(), h_A.data(), h_B.data(), M, N, K);
}
__half *d_A, *d_B;
float *d_C;
checkCudaErrors(cudaMalloc(&d_A, M * K * sizeof(__half)));
checkCudaErrors(cudaMalloc(&d_B, K * N * sizeof(__half)));
checkCudaErrors(cudaMalloc(&d_C, M * N * sizeof(float)));
checkCudaErrors(cudaMemcpy(d_A, h_A.data(), M * K * sizeof(__half), cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemcpy(d_B, h_B.data(), K * N * sizeof(__half), cudaMemcpyHostToDevice));
bool passed = true;
// both kernels run only when the matmul()'s precondition is met
if (M % TILE_BLOCK_M == 0 && N % TILE_BLOCK_N == 0 && K % TILE_BLOCK_K == 0) {
dim3 grid(M / TILE_BLOCK_M, N / TILE_BLOCK_N);
auto run_kernel = [&](const char* name, auto kernel_launch) {
// clear C before each benchmarked kernel for independent validation
checkCudaErrors(cudaMemset(d_C, 0, M * N * sizeof(float)));
BenchmarkResult result = run_benchmark(name,
[&]() {
kernel_launch();
checkCudaErrors(cudaGetLastError());
},
[&]() {
std::vector<float> h_C(M * N);
checkCudaErrors(cudaMemcpy(h_C.data(), d_C, M * N * sizeof(float),
cudaMemcpyDeviceToHost));
return verify_matmul_result(name, h_C.data(), h_expected.data(), M, N);
},
M, N, K);
print_result(result);
return result.correct;
};
// run and validate the optimized kernel
passed &= run_kernel("matmul", [&]() {
matmul<<<grid, 1>>>(d_C, d_A, d_B, M, N, K);
});
// run and validate the baseline kernel
passed &= run_kernel("matmul_naive", [&]() {
matmul_naive<<<grid, 1>>>(d_C, d_A, d_B, M, N, K);
});
} else {
std::fprintf(stderr,
"Skipping M=%d, N=%d, K=%d as the optimized kernel assumes "
"dimensions divisible by TILE_BLOCK_M=%d, TILE_BLOCK_N=%d, "
"TILE_BLOCK_K=%d.\n",
M, N, K, TILE_BLOCK_M, TILE_BLOCK_N, TILE_BLOCK_K);
passed = false;
}
checkCudaErrors(cudaFree(d_A));
checkCudaErrors(cudaFree(d_B));
checkCudaErrors(cudaFree(d_C));
if (!passed) {
std::exit(EXIT_FAILURE);
}
}
int main(int argc, char** argv) {
parse_benchmark_args(argc, argv);
run_with_size(1024, 1024, 1024);
}

View File

@ -0,0 +1,67 @@
cmake_minimum_required(VERSION 3.20)
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/Modules")
project(tileMatmulAutotuner LANGUAGES C CXX CUDA)
find_package(CUDAToolkit REQUIRED)
get_filename_component(CUDA_TOOLKIT_BIN_DIR "${CMAKE_CUDA_COMPILER}" DIRECTORY)
find_program(TILEIRAS_EXECUTABLE tileiras
HINTS "${CUDA_TOOLKIT_BIN_DIR}"
REQUIRED
)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(CMAKE_CUDA_ARCHITECTURES 80 86 87 89 90 100 110 120)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --enable-tile")
if(ENABLE_CUDA_DEBUG)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G")
else()
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -lineinfo") # add line information to all builds for debug tools (exclusive to -G option)
endif()
# Include directories and libraries
include_directories(../../../Common ../Benchmark_Common)
# Source files
add_executable(tileMatmulAutotuner
matmul_autotuner.cpp
autotuner_search_space.conf
)
target_compile_features(tileMatmulAutotuner PRIVATE cxx_std_20 cuda_std_20)
target_include_directories(tileMatmulAutotuner PRIVATE ${CUDAToolkit_INCLUDE_DIRS})
list(GET CUDAToolkit_INCLUDE_DIRS 0 CUDA_INCLUDE_DIR)
file(TO_CMAKE_PATH "${CUDA_INCLUDE_DIR}" CUDA_INCLUDE_DIR_FOR_DEFINE)
file(TO_CMAKE_PATH "${TILEIRAS_EXECUTABLE}" TILEIRAS_EXECUTABLE_FOR_DEFINE)
file(TO_CMAKE_PATH "${CMAKE_CUDA_COMPILER}" NVCC_EXECUTABLE_FOR_DEFINE)
target_compile_definitions(tileMatmulAutotuner PRIVATE
CUDA_INCLUDE_PATH="${CUDA_INCLUDE_DIR_FOR_DEFINE}"
NVCC_PATH="${NVCC_EXECUTABLE_FOR_DEFINE}"
TILEIRAS_PATH="${TILEIRAS_EXECUTABLE_FOR_DEFINE}"
)
target_link_libraries(tileMatmulAutotuner PRIVATE
CUDA::cuda_driver
CUDA::cudart
CUDA::nvrtc
)
add_custom_command(TARGET tileMatmulAutotuner POST_BUILD
COMMAND ${CMAKE_COMMAND} -E copy_if_different
${CMAKE_CURRENT_SOURCE_DIR}/matmul.cu
${CMAKE_CURRENT_BINARY_DIR}
COMMAND ${CMAKE_COMMAND} -E copy_if_different
${CMAKE_CURRENT_SOURCE_DIR}/autotuner_search_space.conf
${CMAKE_CURRENT_BINARY_DIR}
)
# Include installation configuration
include(${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/InstallSamples.cmake)
setup_samples_install()

View File

@ -0,0 +1,59 @@
# tileMatmulAutotuner
## Description
A CUDA Tile C++ sample demonstrating how to autotune tile sizes and optimization hints for matrix multiplication with FP16 inputs and FP32 accumulation when compiling with nvrtc or nvcc.
The sample explores combinations of tile sizes and optimization hints, compiles the Tile C++ matrix multiplication kernel, executes it, and reports the best measured configuration. The launch grid is derived from the selected tile size and matrix dimensions.
The search space is read from `autotuner_search_space.conf` so one can edit tile sizes and hint values without rebuilding the sample.
## Running
```
# Run autotuner. Validation is disabled by default.
./tileMatmulAutotuner
# Select a backend
./tileMatmulAutotuner --backend=nvrtc
./tileMatmulAutotuner --backend=nvcc
# Enable CPU validation
./tileMatmulAutotuner --validate
# To run faster, skip warmups and set iteration to 1
./tileMatmulAutotuner --skip-warmup -i 1
# Show all options
./tileMatmulAutotuner --help
```
NOTE: When using the nvrtc backend, the libnvrtc library must be on the dynamic library search path. This may require adding the CUDA toolkit 'lib' directory to the LD_LIBRARY_PATH (for linux) or PATH (for windows) env var.
## Command-Line Options
| Option | Description |
|--------|-------------|
| `--validate` | Enable CPU cross-validation. Validation is disabled by default. |
| `--skip-warmup` | Disable warmup iterations. |
| `--warmup=N` | Set warmup iterations. The default is 5. |
| `-i N`, `--iters=N` | Set benchmark iterations. The default is 20. |
| `--backend=nvrtc\|nvcc` | Select the backend. The default is `nvrtc`. |
## Search Space Configuration
Each `tile` line adds one `TILE_BLOCK_M`, `TILE_BLOCK_N`, and `TILE_BLOCK_K` combination. The `load_latency` and `store_latency` lines list values that are combined with every `tile` entry.
```
tile 64 64 32
tile 128 64 32
load_latency 2 5 8
store_latency 2 5 8
```
Lines beginning with `#` are comments.
## Prerequisites
- [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) version 13.3 or later.
- [CUDA Driver](https://www.nvidia.com/en-us/drivers/) version 580 or later. The NVRTC backend invokes `tileiras` to compile Tile IR to cubin. JIT-compiling Tile IR to cubin with the CUDA Driver API instead requires version 590 or later.
- Host compiler with C++20 support.

View File

@ -0,0 +1,15 @@
# tileMatmulAutotuner search space
#
# The autotuner tries every tile entry with every load_latency and
# store_latency value listed below. Edit this file to experiment without
# rebuilding the sample.
# tile block_m block_n block_k
tile 64 64 32
tile 128 64 32
tile 64 128 32
tile 128 128 32
# latency hint values
load_latency 2 5 8
store_latency 2 5 8

View File

@ -0,0 +1,314 @@
/* Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#pragma once
#include <cerrno>
#include <climits>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <ctime>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
#include <helper_string.h>
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
#include <process.h>
#else
#include <unistd.h>
#endif
enum class CompilerBackend {
NVRTC,
NVCC
};
struct CompiledKernel {
std::vector<char> image;
};
struct TileConfig {
int block_m;
int block_n;
int block_k;
};
struct SearchSpace {
std::vector<TileConfig> tile_options;
std::vector<int> load_latency_options;
std::vector<int> store_latency_options;
};
static constexpr const char *kSearchSpaceFileName = "autotuner_search_space.conf";
inline const char* compilerBackendName(CompilerBackend compiler_backend) {
return compiler_backend == CompilerBackend::NVCC ? "NVCC" : "NVRTC";
}
inline int ceilDiv(int a, int b) {
return (a + b - 1) / b;
}
inline unsigned long getProcessId() {
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
return static_cast<unsigned long>(_getpid());
#else
return static_cast<unsigned long>(getpid());
#endif
}
inline std::string makeTempPath(const char *prefix, const char *suffix) {
static unsigned int counter = 0;
std::string filename = std::string(prefix) + "_" + std::to_string(getProcessId()) + "_" +
std::to_string(static_cast<long>(time(NULL))) + "_" +
std::to_string(counter++) + suffix;
return (std::filesystem::temp_directory_path() / filename).string();
}
inline std::string shellQuote(const std::string& arg) {
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
std::string quoted = "\"";
for (char c : arg) {
if (c == '"') {
quoted += "\\\"";
} else {
quoted += c;
}
}
quoted += "\"";
return quoted;
#else
std::string quoted = "'";
for (char c : arg) {
if (c == '\'') {
quoted += "'\\''";
} else {
quoted += c;
}
}
quoted += "'";
return quoted;
#endif
}
inline std::string joinShellCommand(const std::vector<std::string>& args) {
std::string cmd;
for (const auto& arg : args) {
if (!cmd.empty()) {
cmd += " ";
}
cmd += shellQuote(arg);
}
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
return "\"" + cmd + "\"";
#else
return cmd;
#endif
}
inline std::vector<char> readBinaryFile(const std::string& path) {
std::ifstream file(path, std::ios::in | std::ios::binary | std::ios::ate);
if (!file.is_open()) {
std::cerr << "\nerror: unable to open " << path << " for reading!\n";
exit(EXIT_FAILURE);
}
std::streamsize size = file.tellg();
if (size < 0) {
std::cerr << "\nerror: unable to determine size of " << path << "\n";
exit(EXIT_FAILURE);
}
std::vector<char> data(static_cast<size_t>(size));
file.seekg(0, std::ios::beg);
if (size > 0 && !file.read(data.data(), size)) {
std::cerr << "\nerror: unable to read " << path << "\n";
exit(EXIT_FAILURE);
}
return data;
}
inline std::string baseNameWithoutExtension(const std::string& path) {
size_t slash = path.find_last_of("/\\");
std::string base = (slash == std::string::npos) ? path : path.substr(slash + 1);
size_t dot = base.find_last_of('.');
if (dot != std::string::npos) {
base.resize(dot);
}
return base;
}
inline void appendTileBlockMacroOptions(std::vector<std::string>& options,
int block_m, int block_n, int block_k) {
options.push_back("-DTILE_BLOCK_M=" + std::to_string(block_m));
options.push_back("-DTILE_BLOCK_N=" + std::to_string(block_n));
options.push_back("-DTILE_BLOCK_K=" + std::to_string(block_k));
}
inline bool parsePositiveInt(const std::string& text, int *value) {
char *end = nullptr;
errno = 0;
long parsed = std::strtol(text.c_str(), &end, 10);
if (errno != 0 || end == text.c_str() || *end != '\0' ||
parsed <= 0 || parsed > INT_MAX) {
return false;
}
*value = static_cast<int>(parsed);
return true;
}
inline void searchSpaceError(const char *filename,
int line_number,
const std::string& message) {
fprintf(stderr, "Error: %s:%d: %s\n", filename, line_number, message.c_str());
exit(EXIT_FAILURE);
}
inline char *copyFilePath(const std::string& path) {
char *file_path = reinterpret_cast<char *>(malloc(path.length() + 1));
if (file_path == NULL) {
fprintf(stderr, "Error: failed to allocate memory for file path\n");
exit(EXIT_FAILURE);
}
std::memcpy(file_path, path.c_str(), path.length() + 1);
return file_path;
}
inline char *findSampleFile(const char *filename, const char *executable_path) {
if (executable_path != NULL) {
std::filesystem::path executable_dir =
std::filesystem::path(executable_path).parent_path();
if (!executable_dir.empty()) {
std::filesystem::path candidate = executable_dir / filename;
if (std::filesystem::exists(candidate)) {
return copyFilePath(candidate.string());
}
}
}
return sdkFindFilePath(filename, executable_path);
}
inline SearchSpace loadSearchSpace(const char *filename) {
std::ifstream input(filename);
if (!input.is_open()) {
fprintf(stderr, "Error: unable to open search space file %s\n", filename);
exit(EXIT_FAILURE);
}
SearchSpace search_space;
std::string line;
int line_number = 0;
while (std::getline(input, line)) {
line_number++;
size_t comment = line.find('#');
if (comment != std::string::npos) {
line.resize(comment);
}
std::istringstream tokens(line);
std::string directive;
if (!(tokens >> directive)) {
continue;
}
std::vector<std::string> values;
std::string value;
while (tokens >> value) {
values.push_back(value);
}
if (directive == "tile") {
if (values.size() != 3) {
searchSpaceError(filename, line_number,
"tile expects block_m block_n block_k");
}
TileConfig tile = {};
if (!parsePositiveInt(values[0], &tile.block_m) ||
!parsePositiveInt(values[1], &tile.block_n) ||
!parsePositiveInt(values[2], &tile.block_k)) {
searchSpaceError(filename, line_number,
"tile values must be positive integers");
}
search_space.tile_options.push_back(tile);
} else if (directive == "load_latency") {
if (values.empty()) {
searchSpaceError(filename, line_number,
"load_latency expects at least one value");
}
for (const auto& option : values) {
int latency = 0;
if (!parsePositiveInt(option, &latency)) {
searchSpaceError(filename, line_number,
"load_latency values must be positive integers");
}
search_space.load_latency_options.push_back(latency);
}
} else if (directive == "store_latency") {
if (values.empty()) {
searchSpaceError(filename, line_number,
"store_latency expects at least one value");
}
for (const auto& option : values) {
int latency = 0;
if (!parsePositiveInt(option, &latency)) {
searchSpaceError(filename, line_number,
"store_latency values must be positive integers");
}
search_space.store_latency_options.push_back(latency);
}
} else {
searchSpaceError(filename, line_number,
"unknown search space directive '" + directive + "'");
}
}
if (search_space.tile_options.empty()) {
fprintf(stderr, "Error: search space file %s does not list any tile entries\n", filename);
exit(EXIT_FAILURE);
}
if (search_space.load_latency_options.empty()) {
fprintf(stderr, "Error: search space file %s does not list any load_latency entries\n",
filename);
exit(EXIT_FAILURE);
}
if (search_space.store_latency_options.empty()) {
fprintf(stderr, "Error: search space file %s does not list any store_latency entries\n",
filename);
exit(EXIT_FAILURE);
}
return search_space;
}

View File

@ -0,0 +1,111 @@
/* Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#pragma once
#include "backend_common.h"
#include <cstdio>
#include <cstdlib>
#include <filesystem>
#include <iostream>
#include <string>
#include <system_error>
#include <vector>
inline CompiledKernel compileFileWithNVCC(const char *filename,
int sm_value,
int block_m, int block_n, int block_k,
const std::vector<std::string>& extra_flags) {
// Check CUDA include path for cuda_fp16.h
const char *include_path = CUDA_INCLUDE_PATH;
if (include_path[0] == '\0') {
printf("\n ERROR: unable to locate CUDA include directory containing cuda_fp16.h\n");
exit(EXIT_FAILURE);
}
std::filesystem::path keep_dir = makeTempPath("matmul_nvcc_keep", "");
std::error_code ec;
if (!std::filesystem::create_directory(keep_dir, ec)) {
std::cerr << "\nerror: unable to create " << keep_dir.string()
<< " (" << ec.message() << ")\n";
exit(EXIT_FAILURE);
}
std::string base = baseNameWithoutExtension(filename);
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
const char *object_suffix = ".obj";
#else
const char *object_suffix = ".o";
#endif
std::filesystem::path object_file = keep_dir / (base + object_suffix);
std::filesystem::path tile_cubin_file = keep_dir / (base + ".tile.cubin");
std::vector<std::string> args = {
NVCC_PATH,
"--enable-tile",
"-std=c++20",
"-arch=sm_" + std::to_string(sm_value),
"-lineinfo",
"-c",
filename,
"-o",
object_file.string(),
"--keep",
"--keep-dir",
keep_dir.string(),
"-I",
include_path
};
appendTileBlockMacroOptions(args, block_m, block_n, block_k);
for (const auto& flag : extra_flags) {
args.push_back(flag);
}
std::string cmd = joinShellCommand(args);
std::cerr << "\nCompiling file with NVCC\n";
int ret = system(cmd.c_str());
if (ret != 0) {
fprintf(stderr, "Error: nvcc compilation failed with code %d\n", ret);
fprintf(stderr, "Command: %s\n", cmd.c_str());
std::filesystem::remove_all(keep_dir);
exit(EXIT_FAILURE);
}
if (!std::filesystem::exists(tile_cubin_file)) {
fprintf(stderr, "Error: nvcc did not produce expected Tile cubin %s\n",
tile_cubin_file.string().c_str());
std::filesystem::remove_all(keep_dir);
exit(EXIT_FAILURE);
}
CompiledKernel kernel;
kernel.image = readBinaryFile(tile_cubin_file.string());
std::filesystem::remove_all(keep_dir);
return kernel;
}

View File

@ -0,0 +1,186 @@
/* Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#pragma once
#include "backend_common.h"
#include <cstdio>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#include <nvrtc.h>
#define NVRTC_SAFE_CALL(Name, x) \
do { \
nvrtcResult result = x; \
if (result != NVRTC_SUCCESS) { \
std::cerr << "\nerror: " << Name << " failed with error " << \
nvrtcGetErrorString(result); \
exit(EXIT_FAILURE); \
} \
} while(0)
inline std::vector<char> compileTileIRToCubin(const char *tileIR,
size_t tileIRSize,
int sm_value) {
// Generate unique temporary file names using PID, timestamp, and a local counter.
std::string tileir_file = makeTempPath("tileir", ".bc");
std::string cubin_file = makeTempPath("cubin", ".cubin");
// Write TileIR to a temporary file because tileiras consumes files rather than memory buffers.
FILE* fp = fopen(tileir_file.c_str(), "wb");
if (!fp) {
fprintf(stderr, "Error: failed to open %s for writing\n", tileir_file.c_str());
exit(EXIT_FAILURE);
}
fwrite(tileIR, 1, tileIRSize, fp);
fclose(fp);
/*
* Ideally we would use the Driver API to compile TileIR to cubin with the
* latest driver installed. However, to avoid the hassle of upgrading the
* driver, we use tileiras for now, which is handily available in CUDA
* Toolkit.
*
* The Driver API path would look like:
* (requires #include <helper_cuda_drvapi.h>)
*
* CUmodule module;
* CUjit_option options[] = {
* CU_JIT_GENERATE_LINE_INFO
* };
* void* optionValues[] = {
* (void*)(uintptr_t)1, // line info
* };
* unsigned int numOptions = sizeof(options) / sizeof(options[0]);
* checkCudaErrors(cuModuleLoadDataEx(&module, tileIR, numOptions,
* options, optionValues));
*/
// Compile TileIR to cubin using tileiras from the configured CUDA Toolkit.
// This happens before benchmarking so the timed path matches the NVCC backend.
std::string cmd = joinShellCommand({
TILEIRAS_PATH,
"-arch=sm_" + std::to_string(sm_value),
tileir_file,
"-o",
cubin_file
});
int ret = system(cmd.c_str());
if (ret != 0) {
fprintf(stderr, "Error: tileiras compilation failed with code %d\n", ret);
fprintf(stderr, "Command: %s\n", cmd.c_str());
remove(tileir_file.c_str());
remove(cubin_file.c_str());
exit(EXIT_FAILURE);
}
// Read the generated cubin into memory so later benchmark iterations do not depend on temp files.
std::vector<char> cubin = readBinaryFile(cubin_file);
// Remove temporary files after the cubin has been captured.
remove(tileir_file.c_str());
remove(cubin_file.c_str());
return cubin;
}
inline CompiledKernel compileFileWithNVRTC(const char *filename,
int sm_value,
int block_m, int block_n, int block_k,
const std::vector<std::string>& extra_flags) {
// Check include path for cuda_fp16.h
const char *ptr = CUDA_INCLUDE_PATH;
if (ptr[0] == '\0') {
printf("\n ERROR: unable to locate CUDA include directory containing cuda_fp16.h\n");
exit(EXIT_FAILURE);
}
std::vector<std::string> option_storage = {
"-I",
ptr,
"-std=c++20",
"-enable-tile",
"-lineinfo",
"-arch=compute_" + std::to_string(sm_value)
};
appendTileBlockMacroOptions(option_storage, block_m, block_n, block_k);
for (const auto& flag : extra_flags) {
option_storage.push_back(flag);
}
std::vector<const char*> argv_vec;
for (const auto& option : option_storage) {
argv_vec.push_back(option.c_str());
}
const char **argv = argv_vec.data();
int argc = static_cast<int>(argv_vec.size());
std::cerr << "\nCompiling file with NVRTC\n";
std::ifstream inputFile(filename, std::ios::in | std::ios::binary |
std::ios::ate);
if (!inputFile.is_open()) {
std::cerr << "\nerror: unable to open " << filename << " for reading!\n";
exit(EXIT_FAILURE);
}
std::streampos pos = inputFile.tellg();
size_t inputSize = pos;
char * memBlock = new char [inputSize + 1];
inputFile.seekg (0, std::ios::beg);
inputFile.read (memBlock, inputSize);
inputFile.close();
memBlock[inputSize] = '\x0';
// Compile the source string to PTX and Tile IR.
nvrtcProgram prog;
NVRTC_SAFE_CALL("nvrtcCreateProgram", nvrtcCreateProgram(&prog, memBlock,
"testprog", 0, NULL, NULL));
nvrtcResult res = nvrtcCompileProgram(prog, argc, argv);
// Dump the NVRTC compilation log
size_t logSize;
NVRTC_SAFE_CALL("nvrtcGetProgramLogSize", nvrtcGetProgramLogSize(prog, &logSize));
char* log = (char*)malloc(sizeof(char) * logSize + 1);
NVRTC_SAFE_CALL("nvrtcGetProgramLog", nvrtcGetProgramLog(prog, log));
log[logSize] = '\x0';
std::cerr << "\n compilation log ---\n";
std::cerr << log;
std::cerr << "\n end log ---\n\n";
free(log);
NVRTC_SAFE_CALL("nvrtcCompileProgram", res);
// Fetch Tile IR and compile it to cubin before benchmarking.
size_t tileIRSize;
NVRTC_SAFE_CALL("nvrtcGetTileIRSize", nvrtcGetTileIRSize(prog, &tileIRSize));
std::vector<char> tileIR(tileIRSize);
NVRTC_SAFE_CALL("nvrtcGetTileIR", nvrtcGetTileIR(prog, tileIR.data()));
CompiledKernel kernel;
kernel.image = compileTileIRToCubin(tileIR.data(), tileIR.size(), sm_value);
NVRTC_SAFE_CALL("nvrtcDestroyProgram", nvrtcDestroyProgram(&prog));
delete[] memBlock;
return kernel;
}

View File

@ -0,0 +1,101 @@
/* Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/*
* CUDA Tile C++ matrix multiplication kernel used by tileMatmulAutotuner.
*
* This sample implements a tiled FP16 -> FP32 matrix multiplication with
* ct::partition_view and ct::mma. The autotuner compiles this file repeatedly
* with TILE_BLOCK_M, TILE_BLOCK_N, TILE_BLOCK_K, LOAD_LATENCY, and
* STORE_LATENCY defined on the compiler command line.
*
* Approach:
* - Uses ct::tensor_span and ct::partition_view for blocked access.
* - Uses a K-dimension accumulation loop with ct::mma.
* - Loads FP16 inputs into tiles and accumulates into FP32.
*/
#include "cuda_tile.h"
#include <cuda_fp16.h>
namespace ct = cuda::tiles;
extern "C" __tile_global__ void matmul_tile(float* __restrict__ _C,
const __half* __restrict__ _A,
const __half* __restrict__ _B,
int _M, int _N, int _K) {
float* C = ct::assume_aligned<16>(_C);
const __half* A = ct::assume_aligned<16>(_A);
const __half* B = ct::assume_aligned<16>(_B);
auto M = ct::assume_divisible<16>(_M);
auto N = ct::assume_divisible<16>(_N);
auto K = ct::assume_divisible<16>(_K);
// Create tensor spans with runtime shapes (FP16 for A and B)
auto a_span = ct::tensor_span{A, ct::extents{M, K}};
auto b_span = ct::tensor_span{B, ct::extents{K, N}};
auto c_span = ct::tensor_span{C, ct::extents{M, N}};
// Create partition views with compile-time block sizes
auto a_view = ct::partition_view{a_span, ct::shape<TILE_BLOCK_M, TILE_BLOCK_K>{}};
auto b_view = ct::partition_view{b_span, ct::shape<TILE_BLOCK_K, TILE_BLOCK_N>{}};
auto c_view = ct::partition_view{c_span, ct::shape<TILE_BLOCK_M, TILE_BLOCK_N>{}};
// get block indices from the 2D grid
auto [pid_m, pid_n, dummy] = ct::bid();
// initialize FP32 accumulator
auto acc = ct::zeros<ct::tile<float, ct::shape<TILE_BLOCK_M, TILE_BLOCK_N>>>();
// loop over the K dimension in blocks
int num_k_blocks = (K + TILE_BLOCK_K - 1) / TILE_BLOCK_K;
for (auto k_block : ct::irange(0, num_k_blocks)) {
ct::tile<__half, ct::shape<TILE_BLOCK_M, TILE_BLOCK_K>> a_tile;
ct::tile<__half, ct::shape<TILE_BLOCK_K, TILE_BLOCK_N>> b_tile;
// load blocks of A and B (FP16)
[[
cutile::hint(0, latency=LOAD_LATENCY),
]]
a_tile = a_view.load(pid_m, k_block);
[[
cutile::hint(0, latency=LOAD_LATENCY),
]]
b_tile = b_view.load(k_block, pid_n);
// accumulate: acc += A_block @ B_block (FP16 inputs, FP32 accumulator)
// ct::mma handles mixed precision: FP16 operands with FP32 accumulator.
acc = ct::mma(a_tile, b_tile, acc);
}
// store result (FP32)
[[
cutile::hint(0, latency=STORE_LATENCY),
]]
c_view.store(acc, pid_m, pid_n);
}

View File

@ -0,0 +1,331 @@
/* Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/*
* CUDA Tile C++ matrix multiplication autotuner.
*
* This sample is the host-side driver for the tiled matrix multiplication
* kernel in matmul.cu. It compiles that kernel repeatedly with different
* TILE_BLOCK_M, TILE_BLOCK_N, TILE_BLOCK_K, LOAD_LATENCY, and STORE_LATENCY
* values configured in a search-space file, derives the launch grid from the
* selected tile size, and reports the fastest configuration for the requested
* problem size.
*
* Backend flow:
* - NVRTC compiles matmul.cu to TileIR, then invokes tileiras to produce a
* cubin image.
* - NVCC compiles matmul.cu as a standalone source file and reuses the
* generated Tile cubin artifact.
* - Both paths load the resulting cubin with the CUDA Driver API and launch
* the same matmul_tile entry point.
*/
#include <algorithm>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <iostream>
#include <random>
#include <string>
#include <vector>
#include <cuda.h>
#include <cuda_fp16.h>
#include "backend_common.h"
#include "backend_nvcc.h"
#include "backend_nvrtc.h"
#include "matmul_benchmark.h"
#include <helper_cuda_drvapi.h>
// global SM value (compute capability)
static int smValue = 0;
static constexpr const char *kMatmulKernelName = "matmul_tile";
CompilerBackend parseCompilerBackendValue(const char *value) {
if (std::strcmp(value, "nvrtc") == 0) {
return CompilerBackend::NVRTC;
}
if (std::strcmp(value, "nvcc") == 0) {
return CompilerBackend::NVCC;
}
fprintf(stderr, "Error: unsupported backend '%s'\n", value);
fprintf(stderr, "Expected 'nvrtc' or 'nvcc'.\n");
exit(EXIT_FAILURE);
}
void printCompilerOptions() {
printf("Backend options:\n");
printf(" --backend=nvrtc|nvcc Select backend (default: NVRTC)\n");
printf("\n");
}
CompilerBackend parseCompilerBackendArgs(int argc, char** argv, std::vector<char*>& benchmark_argv) {
CompilerBackend compiler_backend = CompilerBackend::NVRTC;
benchmark_argv.clear();
benchmark_argv.push_back(argv[0]);
for (int i = 1; i < argc; i++) {
if (std::strcmp(argv[i], "--help") == 0 || std::strcmp(argv[i], "-h") == 0) {
printCompilerOptions();
benchmark_argv.push_back(argv[i]);
} else if (std::strncmp(argv[i], "--backend=", 10) == 0) {
compiler_backend = parseCompilerBackendValue(argv[i] + 10);
} else if (std::strcmp(argv[i], "--backend") == 0) {
if (i + 1 >= argc) {
fprintf(stderr, "Error: %s requires an argument\n", argv[i]);
exit(EXIT_FAILURE);
}
compiler_backend = parseCompilerBackendValue(argv[++i]);
} else {
benchmark_argv.push_back(argv[i]);
}
}
return compiler_backend;
}
void setSMValue() {
CUdevice device;
int major = 0, minor = 0;
// initialize the CUDA Driver API
checkCudaErrors(cuInit(0));
// get the first device (device 0)
checkCudaErrors(cuDeviceGet(&device, 0));
checkCudaErrors(cuDeviceGetAttribute(&major, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, device));
checkCudaErrors(cuDeviceGetAttribute(&minor, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, device));
printf("GPU Compute Capability: %d.%d\n", major, minor);
smValue = major * 10 + minor;
}
CompiledKernel compileFile(const char *filename,
int block_m, int block_n, int block_k,
CompilerBackend compiler_backend,
const std::vector<std::string>& extra_flags = {}) {
if (compiler_backend == CompilerBackend::NVCC) {
return compileFileWithNVCC(filename, smValue, block_m, block_n, block_k, extra_flags);
}
return compileFileWithNVRTC(filename, smValue, block_m, block_n, block_k, extra_flags);
}
void loadAndExecuteKernel(const CompiledKernel& compiled_kernel,
CUdeviceptr d_A, CUdeviceptr d_B, CUdeviceptr d_C,
int M, int N, int K,
unsigned int gridDimX, unsigned int gridDimY, unsigned int sMem) {
CUmodule module;
CUfunction kernel_addr;
void* args[] = {
(void*)&d_C,
(void*)&d_A,
(void*)&d_B,
(void*)&M,
(void*)&N,
(void*)&K
};
checkCudaErrors(cuModuleLoadData(&module, compiled_kernel.image.data()));
checkCudaErrors(cuModuleGetFunction(&kernel_addr, module, kMatmulKernelName));
checkCudaErrors(cuLaunchKernel(kernel_addr,
gridDimX, gridDimY, 1, // grid dim
1, 1, 1, // block dim
sMem, 0, // shared mem, stream
args, // arguments
NULL));
checkCudaErrors(cuCtxSynchronize());
// cleanup
checkCudaErrors(cuModuleUnload(module));
}
void autotuner(int M, int N, int K,
const char *kernel_file,
const SearchSpace& search_space,
CompilerBackend compiler_backend) {
printf("\n=== Matrix: C[%dx%d] = A[%dx%d] x B[%dx%d] (FP16->FP32) ===\n",
M, N, M, K, K, N);
printf(" FLOPs: %.2f GFLOP\n", 2.0 * M * N * K / 1e9);
// allocate and initialize (FP16 for A and B)
std::vector<__half> h_A(M * K), h_B(K * N);
std::vector<float> h_C;
srand(42);
for (int i = 0; i < M * K; i++) h_A[i] = __float2half((float)rand() / RAND_MAX - 0.5f);
for (int i = 0; i < K * N; i++) h_B[i] = __float2half((float)rand() / RAND_MAX - 0.5f);
// compute the CPU reference unless validation is disabled
if (use_validation()) {
h_C.resize(M * N);
matmul_cpu(h_C.data(), h_A.data(), h_B.data(), M, N, K);
}
// allocate device memory
CUdeviceptr d_A, d_B, d_C;
checkCudaErrors(cuMemAlloc(&d_A, M * K * sizeof(__half)));
checkCudaErrors(cuMemAlloc(&d_B, K * N * sizeof(__half)));
checkCudaErrors(cuMemAlloc(&d_C, M * N * sizeof(float)));
checkCudaErrors(cuMemcpyHtoD(d_A, h_A.data(), M * K * sizeof(__half)));
checkCudaErrors(cuMemcpyHtoD(d_B, h_B.data(), K * N * sizeof(__half)));
// helper lambda to clear output
auto clear_output = [&]() {
std::vector<float> zeros(M * N, 0.0f);
checkCudaErrors(cuMemcpyHtoD(d_C, zeros.data(), M * N * sizeof(float)));
};
struct AutotuneResult {
int block_m;
int block_n;
int block_k;
int grid_x;
int grid_y;
int load_latency;
int store_latency;
BenchmarkResult result;
};
std::vector<AutotuneResult> autotune_results;
size_t config_count = 0;
size_t total_configs = search_space.tile_options.size() *
search_space.load_latency_options.size() *
search_space.store_latency_options.size();
for (const auto& tile : search_space.tile_options) {
int grid_x = ceilDiv(M, tile.block_m);
int grid_y = ceilDiv(N, tile.block_n);
for (int load_lat : search_space.load_latency_options) {
for (int store_lat : search_space.store_latency_options) {
config_count++;
printf(" [%zu/%zu] ", config_count, total_configs);
std::vector<std::string> compile_flags = {
"-DLOAD_LATENCY=" + std::to_string(load_lat),
"-DSTORE_LATENCY=" + std::to_string(store_lat)
};
CompiledKernel compiled_kernel = compileFile(kernel_file,
tile.block_m, tile.block_n, tile.block_k,
compiler_backend,
compile_flags);
clear_output();
std::string config_name = "bm=" + std::to_string(tile.block_m) +
",bn=" + std::to_string(tile.block_n) +
",bk=" + std::to_string(tile.block_k) +
",gx=" + std::to_string(grid_x) +
",gy=" + std::to_string(grid_y) +
",ld=" + std::to_string(load_lat) +
",st=" + std::to_string(store_lat);
auto result = run_benchmark(config_name.c_str(),
[&]() {
loadAndExecuteKernel(compiled_kernel, d_A, d_B, d_C, M, N, K,
grid_x, grid_y, 0);
},
[&]() {
std::vector<float> h_result(M * N);
checkCudaErrors(cuMemcpyDtoH(h_result.data(), d_C,
M * N * sizeof(float)));
return verify_matmul_result(config_name.c_str(),
h_result.data(), h_C.data(), M, N);
},
M, N, K);
print_result(result);
autotune_results.push_back({tile.block_m, tile.block_n, tile.block_k,
grid_x, grid_y, load_lat, store_lat, result});
}
}
}
// find the best configuration by GFLOPS
auto best = std::max_element(autotune_results.begin(), autotune_results.end(),
[](const AutotuneResult& a, const AutotuneResult& b) {
return a.result.gflops < b.result.gflops;
});
printf("\n *** BEST CONFIGURATION ***\n");
printf(" BLOCK_M=%d, BLOCK_N=%d, BLOCK_K=%d\n",
best->block_m, best->block_n, best->block_k);
printf(" LOAD_LATENCY=%d, STORE_LATENCY=%d, grid_x=%d, grid_y=%d\n",
best->load_latency, best->store_latency, best->grid_x, best->grid_y);
printf(" Performance: %.1f GFLOPS, %.3f ms, %.1f GB/s\n",
best->result.gflops, best->result.time_ms, best->result.bandwidth_gb_s);
checkCudaErrors(cuMemFree(d_A));
checkCudaErrors(cuMemFree(d_B));
checkCudaErrors(cuMemFree(d_C));
}
int main(int argc, char** argv) {
std::vector<char*> benchmark_argv;
CompilerBackend compiler_backend = parseCompilerBackendArgs(argc, argv, benchmark_argv);
parse_benchmark_args(static_cast<int>(benchmark_argv.size()), benchmark_argv.data());
print_device_info();
// initialize CUDA and get compute capability
setSMValue();
CUcontext context;
CUdevice cuDevice = 0;
checkCudaErrors(cuInit(0));
checkCudaErrors(cuDeviceGet(&cuDevice, 0));
checkCudaErrors(cuCtxCreate(&context, NULL, 0, cuDevice));
printf("\nMatrix Multiplication Autotuner (FP16 inputs, FP32 accumulate)\n");
printf("==============================================================\n");
printf("Backend: %s\n", compilerBackendName(compiler_backend));
char *kernel_file = findSampleFile("matmul.cu", argv[0]);
if (kernel_file == NULL) {
fprintf(stderr, "Error: unable to locate matmul.cu\n");
return 1;
}
char *search_space_file = findSampleFile(kSearchSpaceFileName, argv[0]);
if (search_space_file == NULL) {
fprintf(stderr, "Error: unable to locate %s\n", kSearchSpaceFileName);
free(kernel_file);
return 1;
}
SearchSpace search_space = loadSearchSpace(search_space_file);
printf("Search space: %s\n", search_space_file);
printf("Tuning for M=1024, N=4096, K=1024\n");
autotuner(1024, 4096, 1024, kernel_file, search_space, compiler_backend);
free(kernel_file);
free(search_space_file);
return 0;
}

View File

@ -0,0 +1,29 @@
cmake_minimum_required(VERSION 3.20)
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/Modules")
project(tileRope LANGUAGES C CXX CUDA)
find_package(CUDAToolkit REQUIRED)
set(CMAKE_CUDA_ARCHITECTURES 80 86 87 89 90 100 110 120)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --enable-tile")
if(ENABLE_CUDA_DEBUG)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G")
else()
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -lineinfo") # add line information to all builds for debug tools (exclusive to -G option)
endif()
# Include directories and libraries
include_directories(../../../Common)
# Source file
add_executable(tileRope tileRope.cu)
target_compile_features(tileRope PRIVATE cxx_std_20 cuda_std_20)
# Include installation configuration
include(${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/InstallSamples.cmake)
setup_samples_install()

View File

@ -0,0 +1,29 @@
# tileRope
## Description
This sample demonstrates a Rotary Position Embedding (RoPE) forward pass
using CUDA Tile C++. RoPE injects positional information into the query
and key projections of an attention layer by rotating pairs of features
in the head dimension by per-position angles. This implementation uses
the split-half convention: for each token at position `s` the pair
`(q[i], q[i + D/2])` is rotated by `theta = s * 10000^(-2i / D)`, so
`q[i]' = q[i]*cos(theta) - q[i+D/2]*sin(theta)` and
`q[i+D/2]' = q[i]*sin(theta) + q[i+D/2]*cos(theta)`. The
`cuda::tiles::partition_view` type is used to partition each (batch,
position) token's Q and K tensors into 2D tiles over (heads,
half_rope_dim), and a single block processes all heads for one token
in parallel, writing the result back in place. A SIMT kernel is used
to initialize the inputs and the cos/sin tables.
## Expected Output
```
Success! RoPE matches expected results.
```
## Prerequisites
- [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) version 13.3 or later.
- [CUDA Driver](https://www.nvidia.com/en-us/drivers/) version 580 or later.
- Host compiler with C++20 support.

View File

@ -0,0 +1,274 @@
/* Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/*
* This sample demonstrates a Rotary Position Embedding (RoPE) forward pass
* using CUDA Tile C++. RoPE injects positional information into the query
* and key projections of an attention layer by rotating pairs of features
* in the head dimension by per-position angles. This implementation uses
* the split-half convention: for each token at position 's' the pair
* (q[i], q[i + D/2]) is rotated by theta = s * 10000^(-2i / D), yielding
* q[i]' = q[i] * cos(theta) - q[i + D/2] * sin(theta)
* q[i + D/2]' = q[i] * sin(theta) + q[i + D/2] * cos(theta)
* Each block handles one (batch, position) token and processes all heads
* in parallel using 2D tiles over (heads, half_rope_dim). The kernel
* writes back to q and k in place. A SIMT kernel is used to initialize
* the inputs and the cos/sin tables.
*/
#include "helper_cuda.h"
#include "cuda_tile.h"
#include "cuda_fp16.h"
#include <cstdio>
#include <cstdlib>
#include <cmath>
/* Compile-time sample shape: one block per (batch, position) token. */
constexpr int BATCH = 1;
constexpr int Q_HEADS = 8;
constexpr int K_HEADS = 8;
constexpr int SEQ_LEN = 64;
constexpr int HEAD_DIM = 64;
constexpr int HALF_ROPE_DIM = HEAD_DIM / 2;
constexpr int BLOCK_QH = Q_HEADS;
constexpr int BLOCK_KH = K_HEADS;
constexpr int BLOCK_HD = HALF_ROPE_DIM;
constexpr int COS_BS = 1;
constexpr std::size_t Q_SIZE = (std::size_t)BATCH * Q_HEADS * SEQ_LEN * HEAD_DIM;
constexpr std::size_t K_SIZE = (std::size_t)BATCH * K_HEADS * SEQ_LEN * HEAD_DIM;
constexpr std::size_t COS_SIZE = (std::size_t)COS_BS * SEQ_LEN * HALF_ROPE_DIM;
constexpr std::size_t cmax(std::size_t a, std::size_t b) { return a > b ? a : b; }
constexpr std::size_t INIT_N = cmax(Q_SIZE, cmax(K_SIZE, COS_SIZE));
/* Initializes q, k and the cos/sin tables on device. The cos/sin tables
* are laid out as (COS_BS, SEQ_LEN, HALF_ROPE_DIM): one entry per
* (batch, position, frequency-index) triple. */
__global__ void initializeInputs(__half* q, __half* k, __half* cos, __half* sin) {
std::size_t tid = (std::size_t)blockIdx.x * blockDim.x + threadIdx.x;
if (tid < Q_SIZE) {
int d = (int)(tid % HEAD_DIM);
q[tid] = __half{float(d % 11) / 10.0f - 0.5f};
}
if (tid < K_SIZE) {
int d = (int)(tid % HEAD_DIM);
k[tid] = __half{float(d % 13) / 10.0f - 0.5f};
}
if (tid < COS_SIZE) {
int i = (int)(tid % HALF_ROPE_DIM);
int s = (int)((tid / HALF_ROPE_DIM) % SEQ_LEN);
float exponent = -2.0f * float(i) / float(HEAD_DIM);
float theta = float(s) * powf(10000.0f, exponent);
cos[tid] = __half{cosf(theta)};
sin[tid] = __half{sinf(theta)};
}
}
/* RoPE forward kernel - processes all heads at once using 2D tiles. One
* block handles one (batch, seq) position. Tile 0 along the head_dim axis
* spans [0:BLOCK_HD) (the first half) and tile 1 spans [BLOCK_HD:2*BLOCK_HD)
* (the second half), so the rotation pairs are (q[i], q[i + D/2]). */
template <typename T, int BATCH_, int Q_HEADS_, int K_HEADS_,
int BLOCK_QH_, int BLOCK_KH_, int BLOCK_HD_,
int HALF_ROPE_DIM_, int HEAD_DIM_, int COS_BS_, int SEQ_LEN_>
__tile_global__ void rope(T* __restrict__ q,
T* __restrict__ k,
T* __restrict__ cos,
T* __restrict__ sin) {
namespace ct = cuda::tiles;
q = ct::assume_aligned<16>(q);
k = ct::assume_aligned<16>(k);
cos = ct::assume_aligned<16>(cos);
sin = ct::assume_aligned<16>(sin);
int pid = ct::bid().x;
int batch_idx = pid / SEQ_LEN_;
int row_idx = pid % SEQ_LEN_;
int cos_batch_idx = (COS_BS_ == 1) ? 0 : batch_idx;
auto pCos = ct::partition_view(ct::tensor_span{cos, ct::extents{COS_BS_, SEQ_LEN_, HALF_ROPE_DIM_}},
ct::shape<1, 1, BLOCK_HD_>{});
auto cos_loaded = pCos.load(cos_batch_idx, row_idx, 0);
auto cos_row = ct::reshape<ct::shape<1, BLOCK_HD_>>(cos_loaded);
auto pSin = ct::partition_view(ct::tensor_span{sin, ct::extents{COS_BS_, SEQ_LEN_, HALF_ROPE_DIM_}},
ct::shape<1, 1, BLOCK_HD_>{});
auto sin_loaded = pSin.load(cos_batch_idx, row_idx, 0);
auto sin_row = ct::reshape<ct::shape<1, BLOCK_HD_>>(sin_loaded);
/* Process Q. Tile indices 0 and 1 along the head_dim axis cover
* [0:2*BLOCK_HD) == [0:rope_dim); elements past rope_dim are unchanged. */
auto pQ = ct::partition_view(ct::tensor_span{q, ct::extents{BATCH_, Q_HEADS_, SEQ_LEN_, HEAD_DIM_}},
ct::shape<1, BLOCK_QH_, 1, BLOCK_HD_>{});
auto q_tile_1_loaded = pQ.load(batch_idx, 0, row_idx, 0);
auto q_tile_1 = ct::reshape<ct::shape<BLOCK_QH_, BLOCK_HD_>>(q_tile_1_loaded);
auto q_tile_2_loaded = pQ.load(batch_idx, 0, row_idx, 1);
auto q_tile_2 = ct::reshape<ct::shape<BLOCK_QH_, BLOCK_HD_>>(q_tile_2_loaded);
auto cos_bcast_q = ct::broadcast<ct::shape<BLOCK_QH_, BLOCK_HD_>>(cos_row);
auto sin_bcast_q = ct::broadcast<ct::shape<BLOCK_QH_, BLOCK_HD_>>(sin_row);
/* y = [x1, x2] * [cos, cos] + [-x2, x1] * [sin, sin] */
auto new_q_tile_1 = q_tile_1 * cos_bcast_q - q_tile_2 * sin_bcast_q;
auto new_q_tile_2 = q_tile_2 * cos_bcast_q + q_tile_1 * sin_bcast_q;
auto new_q_tile_1_reshaped = ct::reshape<ct::shape<1, BLOCK_QH_, 1, BLOCK_HD_>>(new_q_tile_1);
auto new_q_tile_2_reshaped = ct::reshape<ct::shape<1, BLOCK_QH_, 1, BLOCK_HD_>>(new_q_tile_2);
pQ.store(new_q_tile_1_reshaped, batch_idx, 0, row_idx, 0);
pQ.store(new_q_tile_2_reshaped, batch_idx, 0, row_idx, 1);
/* Process K (in place). */
auto pK = ct::partition_view(ct::tensor_span{k, ct::extents{BATCH_, K_HEADS_, SEQ_LEN_, HEAD_DIM_}},
ct::shape<1, BLOCK_KH_, 1, BLOCK_HD_>{});
auto k_tile_1_loaded = pK.load(batch_idx, 0, row_idx, 0);
auto k_tile_1 = ct::reshape<ct::shape<BLOCK_KH_, BLOCK_HD_>>(k_tile_1_loaded);
auto k_tile_2_loaded = pK.load(batch_idx, 0, row_idx, 1);
auto k_tile_2 = ct::reshape<ct::shape<BLOCK_KH_, BLOCK_HD_>>(k_tile_2_loaded);
auto cos_bcast_k = ct::broadcast<ct::shape<BLOCK_KH_, BLOCK_HD_>>(cos_row);
auto sin_bcast_k = ct::broadcast<ct::shape<BLOCK_KH_, BLOCK_HD_>>(sin_row);
auto new_k_tile_1 = k_tile_1 * cos_bcast_k - k_tile_2 * sin_bcast_k;
auto new_k_tile_2 = k_tile_2 * cos_bcast_k + k_tile_1 * sin_bcast_k;
auto new_k_tile_1_reshaped = ct::reshape<ct::shape<1, BLOCK_KH_, 1, BLOCK_HD_>>(new_k_tile_1);
auto new_k_tile_2_reshaped = ct::reshape<ct::shape<1, BLOCK_KH_, 1, BLOCK_HD_>>(new_k_tile_2);
pK.store(new_k_tile_1_reshaped, batch_idx, 0, row_idx, 0);
pK.store(new_k_tile_2_reshaped, batch_idx, 0, row_idx, 1);
}
/* CPU reference matching the split-half convention used by the kernel:
* (q[i], q[i + D/2]) is rotated by the angle stored at cos/sin[cb, s, i]. */
static bool verify(const __half* h_in,
const __half* h_out,
const __half* h_cos,
const __half* h_sin,
int heads,
const char* name) {
for (int b = 0; b < BATCH; ++b) {
for (int h = 0; h < heads; ++h) {
for (int s = 0; s < SEQ_LEN; ++s) {
for (int i = 0; i < HALF_ROPE_DIM; ++i) {
std::size_t base = (((std::size_t)b * heads + h) * SEQ_LEN + s) * HEAD_DIM;
std::size_t i1 = base + i;
std::size_t i2 = base + i + HALF_ROPE_DIM;
int cb = (COS_BS == 1) ? 0 : b;
std::size_t ci = ((std::size_t)cb * SEQ_LEN + s) * HALF_ROPE_DIM + i;
double q1 = (double)(float)h_in[i1];
double q2 = (double)(float)h_in[i2];
double c = (double)(float)h_cos[ci];
double si = (double)(float)h_sin[ci];
__half exp1 = __half{(float)(q1 * c - q2 * si)};
__half exp2 = __half{(float)(q2 * c + q1 * si)};
float diff1 = std::fabs((float)h_out[i1] - (float)exp1);
float diff2 = std::fabs((float)h_out[i2] - (float)exp2);
if (diff1 > 1e-1f || diff2 > 1e-1f) {
printf("Mismatch in %s at (b=%d, h=%d, s=%d, i=%d):\n", name, b, h, s, i);
printf(" Expected: %s[%d,%d,%d,%d]=%f, %s[%d,%d,%d,%d]=%f\n",
name, b, h, s, i, (float)exp1,
name, b, h, s, i + HALF_ROPE_DIM, (float)exp2);
printf(" Actual: %s[%d,%d,%d,%d]=%f, %s[%d,%d,%d,%d]=%f\n",
name, b, h, s, i, (float)h_out[i1],
name, b, h, s, i + HALF_ROPE_DIM, (float)h_out[i2]);
return false;
}
}
}
}
}
return true;
}
int main() {
__half* d_q = nullptr;
__half* d_k = nullptr;
__half* d_cos = nullptr;
__half* d_sin = nullptr;
checkCudaErrors(cudaMalloc(&d_q, Q_SIZE * sizeof(__half)));
checkCudaErrors(cudaMalloc(&d_k, K_SIZE * sizeof(__half)));
checkCudaErrors(cudaMalloc(&d_cos, COS_SIZE * sizeof(__half)));
checkCudaErrors(cudaMalloc(&d_sin, COS_SIZE * sizeof(__half)));
int threads_per_block = 256;
int num_blocks = (int)((INIT_N + threads_per_block - 1) / threads_per_block);
initializeInputs<<<num_blocks, threads_per_block>>>(d_q, d_k, d_cos, d_sin);
checkCudaErrors(cudaGetLastError());
/* Snapshot the inputs before the in-place kernel mutates them. */
__half* h_q_in = new __half[Q_SIZE];
__half* h_k_in = new __half[K_SIZE];
__half* h_cos = new __half[COS_SIZE];
__half* h_sin = new __half[COS_SIZE];
checkCudaErrors(cudaMemcpy(h_q_in, d_q, Q_SIZE * sizeof(__half), cudaMemcpyDeviceToHost));
checkCudaErrors(cudaMemcpy(h_k_in, d_k, K_SIZE * sizeof(__half), cudaMemcpyDeviceToHost));
checkCudaErrors(cudaMemcpy(h_cos, d_cos, COS_SIZE * sizeof(__half), cudaMemcpyDeviceToHost));
checkCudaErrors(cudaMemcpy(h_sin, d_sin, COS_SIZE * sizeof(__half), cudaMemcpyDeviceToHost));
rope<__half, BATCH, Q_HEADS, K_HEADS, BLOCK_QH, BLOCK_KH, BLOCK_HD,
HALF_ROPE_DIM, HEAD_DIM, COS_BS, SEQ_LEN>
<<<BATCH * SEQ_LEN>>>(d_q, d_k, d_cos, d_sin);
checkCudaErrors(cudaGetLastError());
checkCudaErrors(cudaDeviceSynchronize());
__half* h_q_out = new __half[Q_SIZE];
__half* h_k_out = new __half[K_SIZE];
checkCudaErrors(cudaMemcpy(h_q_out, d_q, Q_SIZE * sizeof(__half), cudaMemcpyDeviceToHost));
checkCudaErrors(cudaMemcpy(h_k_out, d_k, K_SIZE * sizeof(__half), cudaMemcpyDeviceToHost));
if (!verify(h_q_in, h_q_out, h_cos, h_sin, Q_HEADS, "Q")) return 1;
if (!verify(h_k_in, h_k_out, h_cos, h_sin, K_HEADS, "K")) return 1;
printf("Success! RoPE matches expected results.\n");
checkCudaErrors(cudaFree(d_q));
checkCudaErrors(cudaFree(d_k));
checkCudaErrors(cudaFree(d_cos));
checkCudaErrors(cudaFree(d_sin));
delete[] h_q_in;
delete[] h_k_in;
delete[] h_cos;
delete[] h_sin;
delete[] h_q_out;
delete[] h_k_out;
}

View File

@ -0,0 +1,29 @@
cmake_minimum_required(VERSION 3.20)
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/Modules")
project(tileSpMV LANGUAGES C CXX CUDA)
find_package(CUDAToolkit REQUIRED)
set(CMAKE_CUDA_ARCHITECTURES 80 86 87 89 90 100 110 120)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --enable-tile")
if(ENABLE_CUDA_DEBUG)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G")
else()
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -lineinfo") # add line information to all builds for debug tools (exclusive to -G option)
endif()
# Include directories and libraries
include_directories(../../../Common)
# Source file
add_executable(tileSpMV tileSpMV.cu)
target_compile_features(tileSpMV PRIVATE cxx_std_20 cuda_std_20)
# Include installation configuration
include(${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/InstallSamples.cmake)
setup_samples_install()

View File

@ -0,0 +1,44 @@
# tileSpMV
## Description
This sample demonstrates sparse matrix-vector multiplication (SpMV)
`y = A * x` using CUDA Tile C++.
The matrix is built directly on the host in Sliced ELLPACK (SELL)
format — the format the Tile kernel actually reads. SELL is the
same idea as ELLPACK applied per-slice: rows are grouped into
slices of `ROWS` consecutive rows (sorted by length to minimize
padding within a slice) and stored column-major so that *the k-th
nonzero of every row in the slice* occupies a contiguous span of
`ROWS` elements in memory.
Each CTA processes one slice using a 2D tile of `shape<ROWS, COLS>`:
- Dimension 0 (`ROWS`): the rows of the slice (one tile row per
matrix row in the slice)
- Dimension 1 (`COLS`): the next `COLS` nonzeros of every row in the
slice, processed simultaneously
The kernel computes partial products against the x-vector (an
irreducible gather), accumulates into a 2D tile, reduces along the
column dimension with `cuda::tiles::sum(acc, 1_ic)` to produce one
sum per row, and scatters the per-row sums to `y` using the slice
permutation array.
The sample generates a single random sparse matrix and verifies the
Tile kernel's output against a CPU reference SpMV.
## Expected Output
```
Random sparse matrix: rows=100000, cols=100000, nnz=..., avg nnz/row=...
Tile configuration: ROWS=64, COLS=16 (... slices)
Success! Tile SpMV matches the CPU reference.
```
## Prerequisites
- [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) version 13.3 or later.
- [CUDA Driver](https://www.nvidia.com/en-us/drivers/) version 580 or later.
- Host compiler with C++20 support.

View File

@ -0,0 +1,494 @@
/* Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/*
* This sample demonstrates sparse matrix-vector multiplication
* (SpMV) using CUDA Tile C++.
*
* y = A * x
*
* The matrix is built directly on the host in Sliced ELLPACK (SELL)
* format — the format the Tile kernel actually reads. SELL is the
* same idea as ELLPACK applied per-slice: rows are grouped into
* slices of SLICE_ROWS consecutive rows (sorted by length to
* minimize padding) and stored column-major so that "the k-th
* nonzero of every row in the slice" occupies contiguous memory.
*
* slice s contains rows row_perm[s*SLICE_ROWS .. s*SLICE_ROWS + SLICE_ROWS)
* sell_data[slice_offsets[s] + k * SLICE_ROWS + r] is the k-th
* nonzero of the r-th row of slice s, with padding (column = 0,
* value = 0.0f) when the row has fewer than slice_widths[s]
* nonzeros
*
* Each CTA processes one slice using a 2D tile of shape<ROWS, COLS>:
*
* - Dimension 0 (ROWS): the rows of the slice (one tile row per
* matrix row in the slice)
* - Dimension 1 (COLS): the next COLS nonzeros of every row in
* the slice, processed simultaneously
*
* Because of the column-major slice layout, "load the next COLS
* nonzeros for every row" is a contiguous, fully coalesced load.
* This is the property that distinguishes SELL from a row-by-row
* CSR kernel: CSR forces the kernel to gather column indices and
* values from disjoint per-row address streams. The kernel computes
* partial products against the x-vector (an irreducible gather),
* accumulates into a 2D tile, then reduces along the column
* dimension with 'cuda::tiles::sum(acc, 1_ic)' to produce one sum
* per row, and scatters the sums to y using row_perm.
*/
#include "helper_cuda.h"
#include "cuda_tile.h"
#include <algorithm>
#include <cmath>
#include <cstddef>
#include <cstdio>
#include <numeric>
#include <random>
#include <vector>
namespace ct = cuda::tiles;
//=============================================================================
// Tile SpMV kernel: 2D SELL SpMV
//
// Each CTA processes one slice of ROWS rows. The inner loop walks
// the slice's nonzeros COLS at a time. Because the SELL arrays are
// laid out column-major within a slice, the per-iteration load of
// 'sell_col_indices' and 'sell_values' is a contiguous load of
// ROWS*COLS contiguous elements — i.e. perfectly coalesced — even
// though the underlying rows have wildly different lengths.
//=============================================================================
template <int ROWS, int COLS, int OCCUPANCY = 8>
[[cutile::hint(0, occupancy = OCCUPANCY)]]
__tile_global__ void spmvSell(int num_rows,
const int* __restrict__ sell_col_indices,
const float* __restrict__ sell_values,
const int* __restrict__ slice_offsets,
const int* __restrict__ slice_widths,
const int* __restrict__ row_perm,
const float* __restrict__ vector_x,
float* __restrict__ vector_y) {
using namespace ct::literals;
using Tile2D = ct::tile<float, ct::shape<ROWS, COLS>>;
using RowI = ct::tile<int, ct::shape<ROWS>>;
using ColI = ct::tile<int, ct::shape<COLS>>;
sell_col_indices = ct::assume_aligned<16>(sell_col_indices);
sell_values = ct::assume_aligned<16>(sell_values);
vector_x = ct::assume_aligned<16>(vector_x);
int slice = ct::bid().x;
int row_base = slice * ROWS;
auto local_row = ct::iota<RowI>();
auto row_valid = (row_base + local_row) < num_rows;
/* destination row in y for each lane of the slice */
auto actual_row = ct::load_masked(row_perm + row_base + local_row,
row_valid, 0);
int offset = slice_offsets[slice];
int width = slice_widths[slice];
/* Build 2D index tiles. row_2d broadcasts the in-slice row index
* along the column dimension; col_base_2d broadcasts the COLS
* lanes along the row dimension. The SELL element address is
* offset + (k + col) * ROWS + r
* which becomes the body of the inner loop below. */
auto row_2d = ct::broadcast(
ct::reshape(local_row, ct::shape<ROWS, 1>{}), ct::shape<ROWS, COLS>{});
auto col_base = ct::iota<ColI>();
auto row_valid_2d = ct::broadcast(
ct::reshape(row_valid, ct::shape<ROWS, 1>{}), ct::shape<ROWS, COLS>{});
auto col_base_2d = ct::broadcast(
ct::reshape(col_base, ct::shape<1, COLS>{}), ct::shape<ROWS, COLS>{});
Tile2D acc = ct::zeros<Tile2D>();
/* Loop-split: 'full_width' iterations need no per-element column
* mask (every lane is within 'width'), and the optional trailing
* iteration uses a mask. Eliminating the mask from the hot loop
* saves a predicate evaluation per element per iteration. */
int full_width = (width / COLS) * COLS;
#pragma unroll 1
for (int k = 0; k < full_width; k += COLS) {
auto sell_idx = offset + (k + col_base_2d) * ROWS + row_2d;
auto cols_2d = ct::load_masked(sell_col_indices + sell_idx,
row_valid_2d, 0);
auto vals_2d = ct::load_masked(sell_values + sell_idx,
row_valid_2d, 0.0f);
auto x_2d = ct::load_masked(vector_x + cols_2d, row_valid_2d, 0.0f);
acc = acc + vals_2d * x_2d;
}
if (full_width < width) {
auto col_offsets_2d = full_width + col_base_2d;
auto valid = row_valid_2d & (col_offsets_2d < width);
auto sell_idx = offset + col_offsets_2d * ROWS + row_2d;
auto cols_2d = ct::load_masked(sell_col_indices + sell_idx, valid, 0);
auto vals_2d = ct::load_masked(sell_values + sell_idx, valid, 0.0f);
auto x_2d = ct::load_masked(vector_x + cols_2d, valid, 0.0f);
acc = acc + vals_2d * x_2d;
}
/* Reduce along the column dimension to get one sum per slice row,
* then scatter to the destination row in y. */
auto row_sums = ct::sum(acc, 1_ic);
ct::store_masked(vector_y + actual_row,
ct::reshape(row_sums, ct::shape<ROWS>{}), row_valid);
}
//=============================================================================
// Tile shape configuration
//
// The kernel is templated on (ROWS, COLS); ROWS is also the slice
// size in the SELL packing. We use a single shape sized for the
// random matrix generated below (~16 nonzeros per row on average).
//=============================================================================
constexpr int SLICE_ROWS = 64;
constexpr int TILE_COLS = 16;
//=============================================================================
// Sliced ELLPACK (SELL) matrix
//
// Layout:
// slice s covers row_perm[s*SLICE_ROWS .. s*SLICE_ROWS + SLICE_ROWS)
// slice_widths[s] = max( nnz_per_row[row] for row in slice s )
// slice_offsets[s] = sum_{t<s} slice_widths[t] * SLICE_ROWS
// sell_col_indices[slice_offsets[s] + k * SLICE_ROWS + r] =
// column index of the k-th nonzero of slice s, row r
// (or 0 if that slot is padding)
// sell_values[...] same but for the value
//=============================================================================
struct SellMatrix {
int num_rows = 0;
int num_cols = 0;
int num_slices = 0;
int nnz_total = 0; /* non-padding entries only */
std::size_t total_sell_entries = 0; /* including padding */
std::vector<int> row_perm; /* size num_slices * SLICE_ROWS */
std::vector<int> slice_offsets; /* size num_slices */
std::vector<int> slice_widths; /* size num_slices */
std::vector<int> row_lengths; /* size num_slices * SLICE_ROWS, padded */
std::vector<int> sell_col_indices;
std::vector<float> sell_values;
int* d_row_perm = nullptr;
int* d_slice_offsets = nullptr;
int* d_slice_widths = nullptr;
int* d_sell_col_indices = nullptr;
float* d_sell_values = nullptr;
void uploadToDevice() {
checkCudaErrors(cudaMalloc(&d_row_perm, row_perm.size() * sizeof(int)));
checkCudaErrors(cudaMalloc(&d_slice_offsets,
slice_offsets.size() * sizeof(int)));
checkCudaErrors(cudaMalloc(&d_slice_widths,
slice_widths.size() * sizeof(int)));
checkCudaErrors(cudaMalloc(&d_sell_col_indices,
sell_col_indices.size() * sizeof(int)));
checkCudaErrors(cudaMalloc(&d_sell_values,
sell_values.size() * sizeof(float)));
checkCudaErrors(cudaMemcpy(d_row_perm, row_perm.data(),
row_perm.size() * sizeof(int),
cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemcpy(d_slice_offsets, slice_offsets.data(),
slice_offsets.size() * sizeof(int),
cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemcpy(d_slice_widths, slice_widths.data(),
slice_widths.size() * sizeof(int),
cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemcpy(d_sell_col_indices, sell_col_indices.data(),
sell_col_indices.size() * sizeof(int),
cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemcpy(d_sell_values, sell_values.data(),
sell_values.size() * sizeof(float),
cudaMemcpyHostToDevice));
}
void freeDevice() {
if (d_row_perm) checkCudaErrors(cudaFree(d_row_perm));
if (d_slice_offsets) checkCudaErrors(cudaFree(d_slice_offsets));
if (d_slice_widths) checkCudaErrors(cudaFree(d_slice_widths));
if (d_sell_col_indices) checkCudaErrors(cudaFree(d_sell_col_indices));
if (d_sell_values) checkCudaErrors(cudaFree(d_sell_values));
d_row_perm = nullptr;
d_slice_offsets = nullptr;
d_slice_widths = nullptr;
d_sell_col_indices = nullptr;
d_sell_values = nullptr;
}
};
/* Build a SellMatrix from per-row column-index and value lists. Rows
* are sorted by ascending length before slicing so that each slice's
* longest row is close to its shortest — this minimizes the amount
* of zero-padding required inside the slice. */
static SellMatrix packSell(int num_rows, int num_cols,
const std::vector<int>& row_lengths,
const std::vector<int>& row_cols_flat,
const std::vector<float>& row_vals_flat) {
SellMatrix S;
S.num_rows = num_rows;
S.num_cols = num_cols;
S.num_slices = (num_rows + SLICE_ROWS - 1) / SLICE_ROWS;
/* prefix sums into the flattened row arrays */
std::vector<int> prefix(num_rows + 1, 0);
for (int r = 0; r < num_rows; ++r) {
prefix[r + 1] = prefix[r] + row_lengths[r];
}
S.nnz_total = prefix[num_rows];
/* sort row indices by ascending length */
std::vector<int> perm(num_rows);
std::iota(perm.begin(), perm.end(), 0);
std::sort(perm.begin(), perm.end(), [&](int a, int b) {
return row_lengths[a] < row_lengths[b];
});
/* pad the permutation up to a whole number of slices; the padding
* slots map to row 0, but those lanes are masked out by row_valid
* in the kernel and contribute nothing */
std::size_t padded_rows =
static_cast<std::size_t>(S.num_slices) * SLICE_ROWS;
S.row_perm.assign(padded_rows, 0);
S.row_lengths.assign(padded_rows, 0);
for (int i = 0; i < num_rows; ++i) {
S.row_perm[i] = perm[i];
S.row_lengths[i] = row_lengths[perm[i]];
}
/* per-slice width = max row length within the slice */
S.slice_widths.assign(S.num_slices, 0);
for (int s = 0; s < S.num_slices; ++s) {
int w = 0;
for (int r = 0; r < SLICE_ROWS; ++r) {
w = std::max(w, S.row_lengths[s * SLICE_ROWS + r]);
}
S.slice_widths[s] = w;
}
/* per-slice offset = prefix sum of slice_width * SLICE_ROWS */
S.slice_offsets.assign(S.num_slices, 0);
std::size_t running = 0;
for (int s = 0; s < S.num_slices; ++s) {
S.slice_offsets[s] = static_cast<int>(running);
running += static_cast<std::size_t>(S.slice_widths[s]) * SLICE_ROWS;
}
S.total_sell_entries = running;
/* pack into column-major slice layout */
S.sell_col_indices.assign(S.total_sell_entries, 0);
S.sell_values.assign(S.total_sell_entries, 0.0f);
for (int s = 0; s < S.num_slices; ++s) {
int offset = S.slice_offsets[s];
for (int r = 0; r < SLICE_ROWS; ++r) {
int global_idx = s * SLICE_ROWS + r;
if (global_idx >= num_rows) continue;
int row = S.row_perm[global_idx];
int row_start = prefix[row];
int row_len = row_lengths[row];
for (int k = 0; k < row_len; ++k) {
std::size_t dst = static_cast<std::size_t>(offset)
+ static_cast<std::size_t>(k) * SLICE_ROWS + r;
S.sell_col_indices[dst] = row_cols_flat[row_start + k];
S.sell_values[dst] = row_vals_flat[row_start + k];
}
/* the remaining k in [row_len, slice_widths[s]) stays as the
* zero-init padding written by the assign() calls above */
}
}
return S;
}
//=============================================================================
// Random matrix generator (produces SellMatrix directly)
//
// Each row has a Poisson-distributed number of nonzeros; the column
// indices within a row are uniform random, then sorted and
// de-duplicated. packSell() handles the slice layout.
//=============================================================================
static SellMatrix generateRandom(int num_rows, int num_cols,
int avg_nnz_per_row, unsigned seed) {
std::mt19937 rng(seed);
std::uniform_int_distribution<int> col_dist(0, num_cols - 1);
std::uniform_real_distribution<float> val_dist(-1.0f, 1.0f);
std::poisson_distribution<int> len_dist(static_cast<double>(avg_nnz_per_row));
std::vector<int> cols_flat;
std::vector<float> vals_flat;
std::vector<int> row_lengths;
row_lengths.reserve(num_rows);
std::vector<int> scratch_cols;
for (int r = 0; r < num_rows; ++r) {
int len = std::min(num_cols, std::max(1, len_dist(rng)));
scratch_cols.clear();
scratch_cols.reserve(len);
for (int k = 0; k < len; ++k) {
scratch_cols.push_back(col_dist(rng));
}
std::sort(scratch_cols.begin(), scratch_cols.end());
scratch_cols.erase(std::unique(scratch_cols.begin(), scratch_cols.end()),
scratch_cols.end());
for (int c : scratch_cols) {
cols_flat.push_back(c);
vals_flat.push_back(val_dist(rng));
}
row_lengths.push_back(static_cast<int>(scratch_cols.size()));
}
return packSell(num_rows, num_cols, row_lengths, cols_flat, vals_flat);
}
//=============================================================================
// CPU reference SpMV — reads SELL directly so the sample has no
// dependency on CSR or any external sparse-matrix library.
//=============================================================================
static void cpuSpMV(const SellMatrix& S, const std::vector<float>& x,
std::vector<float>& y) {
y.assign(S.num_rows, 0.0f);
for (int s = 0; s < S.num_slices; ++s) {
int offset = S.slice_offsets[s];
int width = S.slice_widths[s];
for (int r = 0; r < SLICE_ROWS; ++r) {
int global_idx = s * SLICE_ROWS + r;
if (global_idx >= S.num_rows) continue;
int dst_row = S.row_perm[global_idx];
int row_len = S.row_lengths[global_idx];
float sum = 0.0f;
for (int k = 0; k < row_len; ++k) {
std::size_t src = static_cast<std::size_t>(offset)
+ static_cast<std::size_t>(k) * SLICE_ROWS + r;
sum += S.sell_values[src] * x[S.sell_col_indices[src]];
}
y[dst_row] = sum;
(void)width;
}
}
}
/* Compare device result to the CPU reference. SpMV is performed in
* single precision and the device kernel reduces in a different
* order than the CPU reference, so we accept differences within a
* relative tolerance OR a small absolute tolerance — whichever is
* larger. */
static bool verify(const std::vector<float>& reference,
const std::vector<float>& result,
float rel_tol = 1e-2f, float abs_tol = 1e-4f) {
float max_err = 0.0f;
int bad_idx = -1;
for (std::size_t i = 0; i < reference.size(); ++i) {
float diff = std::fabs(reference[i] - result[i]);
float allowed = std::max(abs_tol, rel_tol * std::fabs(reference[i]));
float over = diff - allowed;
if (over > max_err) {
max_err = over;
bad_idx = static_cast<int>(i);
}
}
if (max_err > 0.0f) {
printf("Verification FAILED at index %d (ref=%g, got=%g, diff=%g)\n",
bad_idx, reference[bad_idx], result[bad_idx],
std::fabs(reference[bad_idx] - result[bad_idx]));
return false;
}
return true;
}
//=============================================================================
// Main
//=============================================================================
int main() {
/* Random sparse matrix: ~16 nonzeros per row on average, sized to
* match the chosen tile shape (SLICE_ROWS = 64, TILE_COLS = 16). */
SellMatrix S = generateRandom(/*num_rows=*/100000, /*num_cols=*/100000,
/*avg_nnz_per_row=*/16, /*seed=*/0xA5A5);
printf("Random sparse matrix: rows=%d, cols=%d, nnz=%d, "
"avg nnz/row=%.1f\n",
S.num_rows, S.num_cols, S.nnz_total,
static_cast<double>(S.nnz_total) / S.num_rows);
printf("Tile configuration: ROWS=%d, COLS=%d (%d slices)\n",
SLICE_ROWS, TILE_COLS, S.num_slices);
/* host inputs */
std::vector<float> h_x(S.num_cols);
std::mt19937 rng(0xC0FFEE);
std::uniform_real_distribution<float> x_dist(-1.0f, 1.0f);
for (float& v : h_x) v = x_dist(rng);
/* CPU reference */
std::vector<float> ref_y;
cpuSpMV(S, h_x, ref_y);
/* device allocations */
S.uploadToDevice();
float* d_x = nullptr;
float* d_y = nullptr;
checkCudaErrors(cudaMalloc(&d_x, S.num_cols * sizeof(float)));
checkCudaErrors(cudaMalloc(&d_y, S.num_rows * sizeof(float)));
checkCudaErrors(cudaMemcpy(d_x, h_x.data(), S.num_cols * sizeof(float),
cudaMemcpyHostToDevice));
/* Launch the SELL Tile kernel: one CTA per slice. */
spmvSell<SLICE_ROWS, TILE_COLS><<<S.num_slices>>>(
S.num_rows, S.d_sell_col_indices, S.d_sell_values,
S.d_slice_offsets, S.d_slice_widths, S.d_row_perm, d_x, d_y);
checkCudaErrors(cudaGetLastError());
checkCudaErrors(cudaDeviceSynchronize());
/* copy result back and verify */
std::vector<float> h_y(S.num_rows);
checkCudaErrors(cudaMemcpy(h_y.data(), d_y, S.num_rows * sizeof(float),
cudaMemcpyDeviceToHost));
S.freeDevice();
checkCudaErrors(cudaFree(d_x));
checkCudaErrors(cudaFree(d_y));
if (!verify(ref_y, h_y)) {
return 1;
}
printf("Success! Tile SpMV matches the CPU reference.\n");
return 0;
}

View File

@ -0,0 +1,29 @@
cmake_minimum_required(VERSION 3.20)
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/Modules")
project(tileTranspose LANGUAGES C CXX CUDA)
find_package(CUDAToolkit REQUIRED)
set(CMAKE_CUDA_ARCHITECTURES 80 86 87 89 90 100 110 120)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --enable-tile")
if(ENABLE_CUDA_DEBUG)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G")
else()
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -lineinfo") # add line information to all builds for debug tools (exclusive to -G option)
endif()
# Include directories and libraries
include_directories(../../../Common)
# Source file
add_executable(tileTranspose tileTranspose.cu)
target_compile_features(tileTranspose PRIVATE cxx_std_20 cuda_std_20)
# Include installation configuration
include(${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/InstallSamples.cmake)
setup_samples_install()

View File

@ -0,0 +1,21 @@
# tileTranspose
## Description
This sample demonstrates how to transpose a 2D matrix using CUDA Tile
C++. Each block handles an n x m sized chunk of the source matrix. The
block loads a chunk, transposes it locally, and stores it to the
correct position in the result matrix. A cuda::tiles::partition_view
is used to model the chunking of the source and result matrices.
## Expected Output
```
Success! Matrix transpose matches expected results.
```
## Prerequisites
- [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) version 13.3 or later.
- [CUDA Driver](https://www.nvidia.com/en-us/drivers/) version 580 or later.
- Host compiler with C++20 support.

View File

@ -0,0 +1,126 @@
/* Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/**
* This sample demonstrates how to transpose a 2D matrix using CUDA
* Tile C++. Each block handles an n x m sized chunk of the source
* matrix. The block loads a chunk, transposes it locally, and stores
* it to the correct position in the result matrix. A
* cuda::tiles::partition_view is used to model the chunking of the
* source and result matrices.
*/
#include "helper_cuda.h"
#include "cuda_tile.h"
#include <cstdio>
constexpr int CHUNK_N = 128;
constexpr int CHUNK_M = 256;
/* Declares a tile kernel with '__restrict__' pointers (important for performance) */
__tile_global__ void transpose(float* __restrict__ a,
float* __restrict__ b,
std::size_t n,
std::size_t m) {
/* set up the namespace */
namespace ct = cuda::tiles;
using namespace ct::literals;
/* indicate to the compiler that the pointers are aligned (important for optimizations) */
a = ct::assume_aligned(a, 16_ic);
b = ct::assume_aligned(b, 16_ic);
/* get the block index for the x and y dimension */
auto [idx, idy, idz] = ct::bid();
/* create tensor spans representing n x m and m x n row major matrices */
ct::tensor_span a_span{a, ct::extents{n, m}};
ct::tensor_span b_span{b, ct::extents{m, n}};
/* create partition views over the arrays */
auto view_a = ct::partition_view{a_span, ct::shape<CHUNK_N, CHUNK_M>{}};
auto view_b = ct::partition_view{b_span, ct::shape<CHUNK_M, CHUNK_N>{}};
/* load the tile from the input partition */
auto tile_a = view_a.load_masked(idx, idy);
/* transpose the tile locally */
auto tile_transposed = ct::transpose(tile_a);
/* store the tile to the correct output partition */
view_b.store_masked(tile_transposed, idy, idx);
}
int main() {
int n = 800;
int m = 400;
float* h_a = new float[n * m];
for (int idx = 0; idx != n * m; ++idx) {
h_a[idx] = idx;
}
float* d_a = nullptr;
float* d_b = nullptr;
int num_blocks_n = 1 + (n - 1) / CHUNK_N;
int num_blocks_m = 1 + (m - 1) / CHUNK_M;
checkCudaErrors(cudaMalloc(&d_a, n * m * sizeof(float)));
checkCudaErrors(cudaMemcpy(d_a, h_a, n * m * sizeof(float), cudaMemcpyHostToDevice));
checkCudaErrors(cudaMalloc(&d_b, n * m * sizeof(float)));
transpose<<<dim3(num_blocks_n, num_blocks_m)>>>(d_a, d_b, n, m);
checkCudaErrors(cudaGetLastError());
checkCudaErrors(cudaDeviceSynchronize());
float* h_b = new float[n * m];
checkCudaErrors(cudaMemcpy(h_b, d_b, n * m * sizeof(float), cudaMemcpyDeviceToHost));
for (int idx = 0; idx != n; ++idx) {
for (int jdx = 0; jdx != m; ++jdx) {
float expected = h_a[idx * m + jdx];
float actual = h_b[jdx * n + idx];
if (expected != actual) {
printf("Expected: h_b[%i][%i] == %f\n", jdx, idx, expected);
printf("Actual: h_b[%i][%i] == %f\n", jdx, idx, actual);
return 1;
}
}
}
printf("Success! Matrix transpose matches expected results.\n");
checkCudaErrors(cudaFree(d_a));
checkCudaErrors(cudaFree(d_b));
delete[] h_a;
delete[] h_b;
}

View File

@ -0,0 +1,29 @@
cmake_minimum_required(VERSION 3.20)
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/Modules")
project(tileVectorAdd LANGUAGES C CXX CUDA)
find_package(CUDAToolkit REQUIRED)
set(CMAKE_CUDA_ARCHITECTURES 80 86 87 89 90 100 110 120)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --enable-tile")
if(ENABLE_CUDA_DEBUG)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G")
else()
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -lineinfo") # add line information to all builds for debug tools (exclusive to -G option)
endif()
# Include directories and libraries
include_directories(../../../Common)
# Source file
add_executable(tileVectorAdd tileVectorAdd.cu)
target_compile_features(tileVectorAdd PRIVATE cxx_std_20 cuda_std_20)
# Include installation configuration
include(${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/InstallSamples.cmake)
setup_samples_install()

View File

@ -0,0 +1,24 @@
# tileVectorAdd
## Description
This sample demonstrates a simple vector addition using CUDA Tile C++.
The vector addition is performed by splitting the dataset into blocks
which process 1024 elements at a time. The cuda::tiles::partition_view
type is used to partition the data into chunks of size 1024. Each
block loads its respective chunk from 'a' and 'b', performs an
elementwise addition, then stores it to the corresponding chunk of
'c'. Masked loads and stores are used to ensure that the last chunk
which is partially out of bounds is correctly handled.
## Expected Output
```
Success! Vector addition matches expected results.
```
## Prerequisites
- [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) version 13.3 or later.
- [CUDA Driver](https://www.nvidia.com/en-us/drivers/) version 580 or later.
- Host compiler with C++20 support.

View File

@ -0,0 +1,136 @@
/* Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of NVIDIA CORPORATION nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/*
* This sample demonstrates a simple vector addition using CUDA Tile C++.
* The vector addition is performed by splitting the dataset into blocks
* which process 1024 elements at a time. The cuda::tiles::partition_view
* type is used to partition the data into chunks of size 1024. Each block loads
* its respective chunk from 'a' and 'b', performs an elementwise addition,
* then stores it to the corresponding chunk of 'c'. Masked loads and stores
* are used to ensure that the last chunk which is partially out of bounds is
* correctly handled.
*
* A SIMT kernel is used to initialize the input vectors.
*/
#include "helper_cuda.h"
#include "cuda_tile.h"
#include "cuda_fp16.h"
#include <cstdio>
__global__ void initializeVectors(__half* a, __half* b, std::size_t n) {
auto idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
a[idx] = __half{0.5 * idx};
b[idx] = __half{1.5 * idx};
}
}
/* Declares a tile kernel with '__restrict__' pointers (important for performance) */
__tile_global__ void vectorAdd(__half* __restrict__ a,
__half* __restrict__ b,
__half* __restrict__ c,
std::size_t n) {
/* set up the namespace */
namespace ct = cuda::tiles;
using namespace ct::literals;
/* indicate to the compiler that the pointers are aligned (important for optimizations) */
a = ct::assume_aligned(a, 16_ic);
b = ct::assume_aligned(b, 16_ic);
c = ct::assume_aligned(c, 16_ic);
/* get the block index in the x dimension */
auto idx = ct::bid().x;
/* create tensor spans representing arrays of length 'n' based on the points 'a', 'b', and 'c' */
ct::tensor_span a_span{a, ct::extents{n}};
ct::tensor_span b_span{b, ct::extents{n}};
ct::tensor_span c_span{c, ct::extents{n}};
/* create partition views over the full arrays, partitioned into chunks of 1024 */
auto view_a = ct::partition_view{a_span, ct::shape{1024_ic}};
auto view_b = ct::partition_view{b_span, ct::shape{1024_ic}};
auto view_c = ct::partition_view{c_span, ct::shape{1024_ic}};
/* load the tiles from the input partitions */
auto tile_a = view_a.load_masked(idx);
auto tile_b = view_b.load_masked(idx);
/* add the tiles together, elementwise */
auto tile_c = tile_a + tile_b;
/* store the result tile to the output partition */
view_c.store_masked(tile_c, idx);
}
int main() {
__half* d_a = nullptr;
__half* d_b = nullptr;
__half* d_c = nullptr;
int N = 8000;
int chunk_size = 1024;
int num_blocks = 1 + ((N - 1) / chunk_size);
checkCudaErrors(cudaMalloc(&d_a, N * sizeof(__half)));
checkCudaErrors(cudaMalloc(&d_b, N * sizeof(__half)));
checkCudaErrors(cudaMalloc(&d_c, N * sizeof(__half)));
initializeVectors<<<num_blocks, chunk_size>>>(d_a, d_b, N);
checkCudaErrors(cudaGetLastError());
vectorAdd<<<num_blocks>>>(d_a, d_b, d_c, N);
checkCudaErrors(cudaGetLastError());
checkCudaErrors(cudaDeviceSynchronize());
__half* h_c = new __half[N];
checkCudaErrors(cudaMemcpy(h_c, d_c, N * sizeof(__half), cudaMemcpyDeviceToHost));
for (int idx = 0; idx != N; ++idx) {
if (h_c[idx] != __half{2 * idx}) {
printf("Expected: h_c[%i] == %i\n", idx, 2 * idx);
printf("Actual: h_c[%i] == %f\n", idx, float(h_c[idx]));
return 1;
}
}
printf("Success! Vector addition matches expected results.\n");
checkCudaErrors(cudaFree(d_a));
checkCudaErrors(cudaFree(d_b));
checkCudaErrors(cudaFree(d_c));
delete[] h_c;
}

View File

@ -31,3 +31,6 @@ if(BUILD_TEGRA)
set(CMAKE_FOLDER "8_Platform_Specific/Tegra")
add_subdirectory(8_Platform_Specific/Tegra)
endif()
set(CMAKE_FOLDER "9_CUDA_Tile")
add_subdirectory(9_CUDA_Tile)

View File

@ -106,10 +106,19 @@ When returning a zero-copy view, the caller must close the buffers after use (e.
- CUDA Toolkit 13.0 or newer
- Python 3.10 or newer
- `cuda-python` package (13.0.0+)
- `cuda-core` package (>=0.6.0)
- `cuda-core` package (>=1.0.0)
- `numpy` package (>=2.3.2)
- `pillow` package (10.0.0+)
### Platform Support:
This sample relies on `ManagedMemoryResource` with **concurrent host access**
to managed allocations while GPU kernels are in flight. That behavior
requires the device property `concurrent_managed_access=True`, which is only
supported on Linux with HMM (Pascal and newer). On Windows (WDDM/MCDM/TCC)
the property is `False`, so the sample exits early with a waive message and
exit code `2` instead of attempting a run that would crash the process.
## Installation
```bash

View File

@ -142,8 +142,8 @@ def blur_image_unified_memory(
mr = ManagedMemoryResource(options)
# Allocate unified memory buffers for source and destination images
src_buf = mr.allocate(n_bytes, stream)
dst_buf = mr.allocate(n_bytes, stream)
src_buf = mr.allocate(n_bytes, stream=stream)
dst_buf = mr.allocate(n_bytes, stream=stream)
try:
# Synchronize to ensure allocations are complete before CPU access
stream.sync()
@ -197,6 +197,14 @@ def main():
3. Unified memory with cuda.core.ManagedMemoryResource
4. Kernel launch with cuda.core.launch and LaunchConfig
"""
if sys.platform == "win32":
print(
"This sample relies on ManagedMemoryResource with concurrent host "
"access, which is not supported on Windows "
"(concurrent_managed_access=False). Waiving this sample."
)
sys.exit(2)
print("=" * 60)
print("Image Blur with Unified Memory (cuda.core)")
print("=" * 60)

View File

@ -1,6 +1,6 @@
# Image Blur with Unified Memory Sample Requirements
cuda-python>=13.0.0
cuda-core>=0.6.0
cuda-core>=1.0.0
numpy>=2.3.2
pillow>=10.0.0

View File

@ -39,7 +39,7 @@ Copy image arrays between CPU and GPU memory using the modern `cuda.core` API wi
### From `cupy`:
- `cp.from_dlpack()` - Create GPU array view from DLPack capsule
- `cp.cuda.ExternalStream()` - Use external CUDA stream
- `cp.cuda.Stream.from_external()` - Use external CUDA stream
### From `cuda_samples_utils`:
@ -58,8 +58,8 @@ Copy image arrays between CPU and GPU memory using the modern `cuda.core` API wi
- Python 3.10 or newer
- NumPy 2.3.2 or newer (required for DLPack support)
- `cuda-python` package (>=13.0.0+)
- `cuda-core` package (>=0.6.0)
- `cupy-cuda13x` package (13.0.0+)
- `cuda-core` package (>=1.0.0)
- `cupy-cuda13x` package (14.0.0+)
## Installation
@ -73,8 +73,8 @@ pip install -r requirements.txt
The requirements.txt installs:
- `numpy` (2.3.2+, required for DLPack)
- `cuda-python` (>=13.0.0+)
- `cuda-core` (>=0.6.0)
- `cupy-cuda13x` (13.0.0+)
- `cuda-core` (>=1.0.0)
- `cupy-cuda13x` (14.0.0+)
## How to Run

View File

@ -195,7 +195,7 @@ def main():
print(f"[Image array copy of {H}x{W}x{C} image]")
# Step 2: Configure CuPy to use our CUDA stream (for interoperability)
cp.cuda.ExternalStream(int(stream.handle)).use()
cp.cuda.Stream.from_external(stream).use()
# Step 3: Create a test image on CPU
print("Creating sample image...")

View File

@ -2,5 +2,5 @@
numpy>=2.3.2
cuda-python>=13.0.0
cuda-core>=0.6.0
cupy-cuda13x>=13.0.0
cuda-core>=1.0.0
cupy-cuda13x>=14.0.0

View File

@ -90,7 +90,7 @@ Query and display detailed properties of all CUDA-capable devices in your system
- CUDA Toolkit 13.0 or newer (recommended; matches `cuda-python` 13.x)
- Python 3.10 or newer
- `cuda-python` package (>=13.0.0)
- `cuda-core` package (>=0.6.0)
- `cuda-core` package (>=1.0.0)
## Installation
@ -103,7 +103,7 @@ pip install -r requirements.txt
The requirements.txt installs:
- `cuda-python` (>=13.0.0)
- `cuda-core` (>=0.6.0)
- `cuda-core` (>=1.0.0)
## How to Run

View File

@ -138,7 +138,7 @@ def print_device_info(dev_id, device):
print(f"Device {dev_id}: {device.name}")
# cuda.bindings workaround: runtime version not in cuda.core
driver_major, driver_minor = system.get_driver_version()
driver_major, driver_minor = system.get_user_mode_driver_version()
err, runtime_version = cudart.cudaRuntimeGetVersion()
if err != cudart.cudaError_t.cudaSuccess:
raise RuntimeError(f"Failed to get CUDA runtime version: {err}")

View File

@ -1,4 +1,4 @@
# Device Query Sample Requirements
cuda-python>=13.0.0
cuda-core>=0.6.0
cuda-core>=1.0.0

View File

@ -2,6 +2,6 @@
numpy>=2.3.2
cuda-python>=13.0.0
cuda-core>=0.6.0
cupy-cuda13x>=13.0.0
cuda-core>=1.0.0
cupy-cuda13x>=14.0.0
nvtx

View File

@ -57,7 +57,7 @@ def timer(message):
@contextlib.contextmanager
def gpu_timer(message, stream):
"""GPU timing context manager using cuda.core CUDA events."""
event_options = EventOptions(enable_timing=True)
event_options = EventOptions(timing_enabled=True)
start_event = stream.record(options=event_options)
yield
end_event = stream.record(options=event_options)

View File

@ -3,5 +3,5 @@
numpy>=2.3.2
cuda-python>=13.0.0
cuda-core>=0.6.0
cupy-cuda13x>=13.0.0
cuda-core>=1.0.0
cupy-cuda13x>=14.0.0

View File

@ -68,7 +68,7 @@ CUDA Python (cuda.core), Numba CUDA, Kernel Compilation, Printf in Kernels, Mult
- CUDA Toolkit 13.0 or newer
- Python 3.10 or newer
- `cuda-python` package (13.0+)
- `cuda-core` package (>=0.6.0)
- `cuda-core` package (>=1.0.0)
- `numba-cuda` package (0.24.0+, for Pythonic kernel authoring)
Download and install:

View File

@ -1,7 +1,7 @@
# Simple Printf Sample - Requirements
cuda-python>=13.0.0
cuda-core>=0.6.0
cuda-core>=1.0.0
# Numba JIT uses nvJitLink from pip; keep in step with cuda-bindings (e.g. 13.2.x).
nvidia-nvjitlink>=13.2.0
numba-cuda>=0.24.0
numba-cuda>=0.29.0

View File

@ -63,7 +63,7 @@ Import stable symbols from the top-level `cuda.core` package (not `cuda.core.exp
- CUDA Toolkit 13.0 or newer (matches `cuda-python` 13.x)
- Python 3.10 or newer
- `cuda-python` (>=13.0.0)
- `cuda-core` (>=0.6.0)
- `cuda-core` (>=1.0.0)
## Installation
@ -77,7 +77,7 @@ pip install -r requirements.txt
The `requirements.txt` installs:
- `cuda-python` (>=13.0.0)
- `cuda-core` (>=0.6.0)
- `cuda-core` (>=1.0.0)
## How to Run

View File

@ -1,4 +1,4 @@
# System Information Sample Requirements
cuda-python>=13.0.0
cuda-core>=0.6.0
cuda-core>=1.0.0

View File

@ -43,11 +43,8 @@ import sys
try:
from cuda.core import system
from cuda.core.system import (
CUDA_BINDINGS_NVML_IS_COMPATIBLE,
GpuP2PCapsIndex,
TemperatureSensors,
)
from cuda.core.system import CUDA_BINDINGS_NVML_IS_COMPATIBLE
from cuda.core.system.typing import GpuP2PCapsIndex
except ImportError as e:
print(f"Error: Required package not found: {e}")
print("Please install from requirements.txt:")
@ -75,10 +72,11 @@ def format_bytes(nbytes: int) -> str:
def print_driver_info() -> None:
print_header("Driver / NVML")
major, minor = system.get_driver_version()
print(f"CUDA driver version: {major}.{minor}")
print(f"CUDA driver version (full): {system.get_driver_version_full()}")
major, minor = system.get_user_mode_driver_version()
print(f"CUDA driver version (user-mode): {major}.{minor}")
if CUDA_BINDINGS_NVML_IS_COMPATIBLE:
kmd = system.get_kernel_mode_driver_version()
print(f"CUDA driver version (kernel-mode): {'.'.join(str(x) for x in kmd)}")
print(f"NVML version: {system.get_nvml_version()}")
try:
print(f"Driver branch: {system.get_driver_branch()}")
@ -106,7 +104,7 @@ def print_device_info(device: "system.Device") -> None:
except Exception as e: # noqa: BLE001
print(f"Architecture: unavailable ({e})")
try:
print(f"Brand: {device.brand.name}")
print(f"Brand: {device.brand}")
except Exception as e: # noqa: BLE001
print(f"Brand: unavailable ({e})")
@ -133,7 +131,7 @@ def print_device_info(device: "system.Device") -> None:
# Temperature (GPU sensor)
try:
temp_c = device.temperature.sensor(TemperatureSensors.TEMPERATURE_GPU)
temp_c = device.temperature.get_sensor()
print(f"Temperature (GPU sensor): {temp_c} C")
except Exception as e: # noqa: BLE001
print(f"Temperature: unavailable ({e})")
@ -158,12 +156,8 @@ def print_topology(devices: list) -> None:
except Exception as e: # noqa: BLE001
level_name = f"unavailable ({e})"
try:
read = system.get_p2p_status(
d0, d1, GpuP2PCapsIndex.P2P_CAPS_INDEX_READ
)
write = system.get_p2p_status(
d0, d1, GpuP2PCapsIndex.P2P_CAPS_INDEX_WRITE
)
read = system.get_p2p_status(d0, d1, GpuP2PCapsIndex.READ)
write = system.get_p2p_status(d0, d1, GpuP2PCapsIndex.WRITE)
read_name = read.name
write_name = write.name
except Exception as e: # noqa: BLE001

View File

@ -58,8 +58,8 @@ Import stable symbols from the top-level package (not `cuda.core.experimental`).
- CUDA Toolkit 13.0 or newer (matches `cuda-python` 13.x)
- Python 3.10 or newer
- `cuda-python` (>=13.0.0)
- `cuda-core` (>=0.6.0)
- `cupy-cuda13x` (>=13.0.0)
- `cuda-core` (>=1.0.0)
- `cupy-cuda13x` (>=14.0.0)
## Installation
@ -73,8 +73,8 @@ pip install -r requirements.txt
The requirements.txt installs:
- `cuda-python` (>=13.0.0)
- `cuda-core` (>=0.6.0)
- `cupy-cuda13x` (>=13.0.0)
- `cuda-core` (>=1.0.0)
- `cupy-cuda13x` (>=14.0.0)
## How to Run

View File

@ -1,5 +1,5 @@
# Vector Addition Sample Requirements
cuda-python>=13.0.0
cuda-core>=0.6.0
cupy-cuda13x>=13.0.0
cuda-core>=1.0.0
cupy-cuda13x>=14.0.0

View File

@ -0,0 +1,129 @@
# binarySearch (Python)
## Description
This sample demonstrates the parallel binary-search algorithms
exposed by **cuda.compute** (from the `cuda-cccl` package). Given
a sorted `d_data` array and a batch of `d_values` to locate, one
device-wide call returns the insertion index for every value:
- `cuda.compute.lower_bound` writes, for each value, the lowest index
where it could be inserted into `d_data` without breaking the sort
order. Equivalent to `numpy.searchsorted(..., side="left")`.
- `cuda.compute.upper_bound` is the analogous upper form, equivalent
to `numpy.searchsorted(..., side="right")`.
The sample runs both algorithms on two curated inputs: one with
distinct elements (where `lower_bound` and `upper_bound` agree on
any value not in the data) and one with duplicates (where they
diverge on present values). Results are verified against
`numpy.searchsorted`.
## What You'll Learn
- How to call `cuda.compute.lower_bound` / `upper_bound` with CuPy
arrays
- The semantic difference between `lower_bound` and `upper_bound`,
especially for inputs containing duplicates
- How the output dtype (`np.uintp`) is used for indices
## Key Libraries
- [`cuda.compute`](https://nvidia.github.io/cccl/python.html) (from the `cuda-cccl` package) - device algorithms
- [`cuda.core`](https://nvidia.github.io/cuda-python/cuda-core/latest/) - device setup
- `cupy` - device buffers
- `numpy` - host-side reference via `numpy.searchsorted`
## Key APIs
### From `cuda.compute`
- `cuda.compute.lower_bound(d_data, num_items, d_values, num_values, d_out)`
- `cuda.compute.upper_bound(d_data, num_items, d_values, num_values, d_out)`
### From `cuda_samples_utils`
- `print_gpu_info()` - print device name and compute capability
## Requirements
### Hardware
- NVIDIA GPU with Compute Capability 7.0 or higher
- Minimum GPU memory: 512 MB
### Software
- CUDA Toolkit 13.0 or newer
- Python 3.10 or newer
- `cuda-cccl` (>=1.0.0)
- `cuda-core` (>=1.0.0)
- `cupy-cuda13x` (>=14.0.0)
If the CUDA toolkit is not on your `PATH`, set `CUDA_HOME` so that
cuda.compute's JIT path can locate its dependencies:
```bash
export CUDA_HOME=/usr/local/cuda
```
## Installation
Install the required packages from `requirements.txt`:
```bash
cd /path/to/cuda-samples/python/2_CoreConcepts/binarySearch
pip install -r requirements.txt
```
The `requirements.txt` installs:
- `cuda-cccl` (>=1.0.0) - ships the `cuda.compute` module
- `cuda-core` (>=1.0.0)
- `cupy-cuda13x` (>=14.0.0)
- `numpy` (>=1.24.0)
## How to Run
### Basic usage
```bash
cd cuda-samples/python/2_CoreConcepts/binarySearch
python binarySearch.py
```
### With custom parameters
```bash
python binarySearch.py --device 1
```
## Expected Output
```
Device: <Your GPU Name>
Compute Capability: <X.Y>
Case 1: distinct data, mixed queries
data = [1, 3, 5, 7, 9]
values = [0, 3, 4, 10]
lower_bound: got [0, 1, 2, 5] expected [0, 1, 2, 5] OK
upper_bound: got [0, 2, 2, 5] expected [0, 2, 2, 5] OK
Case 2: duplicates in data
data = [1, 3, 3, 5, 7, 9]
values = [3, 3, 5, 8]
lower_bound: got [1, 1, 3, 5] expected [1, 1, 3, 5] OK
upper_bound: got [3, 3, 4, 5] expected [3, 3, 4, 5] OK
Done
```
**Note:** Device name and compute capability will vary based on your GPU.
## Files
- `binarySearch.py` - Python implementation
- `README.md` - This file
- `requirements.txt` - Sample dependencies
- `../../Utilities/cuda_samples_utils.py` - Common utilities (imported by this sample)

View File

@ -0,0 +1,147 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
"""
This sample demonstrates the parallel binary-search algorithms exposed
by cuda.compute (from the cuda-cccl package). Given a sorted
``d_data`` array and a batch of ``d_values`` to locate, cuda.compute:
- ``cuda.compute.lower_bound(d_data, num_items, d_values, num_values, d_out)``
writes, for each value, the lowest index where it could be inserted
into d_data without breaking the sort order. Matches
``numpy.searchsorted(..., side="left")``.
- ``cuda.compute.upper_bound(d_data, num_items, d_values, num_values, d_out)``
is the analogous upper form, matching ``side="right"``.
The sample runs both algorithms on a curated sorted input with
duplicates so the lower/upper distinction is visible, verifies the
results against ``numpy.searchsorted``, and prints both sets of
indices side-by-side.
"""
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent.parent / "Utilities"))
try:
import cuda.compute
import cupy as cp
import numpy as np
from cuda.core import Device
from cuda_samples_utils import print_gpu_info # noqa: E402
except ImportError as e:
print(f"Error: Required package not found: {e}")
print("Please install from requirements.txt:")
print(" pip install -r requirements.txt")
sys.exit(1)
def run_binary_search(h_data: np.ndarray, h_values: np.ndarray) -> bool:
d_data = cp.asarray(h_data)
d_values = cp.asarray(h_values)
d_lb = cp.empty(len(h_values), dtype=np.uintp)
d_ub = cp.empty(len(h_values), dtype=np.uintp)
cuda.compute.lower_bound(
d_data=d_data,
num_items=len(d_data),
d_values=d_values,
num_values=len(d_values),
d_out=d_lb,
)
cuda.compute.upper_bound(
d_data=d_data,
num_items=len(d_data),
d_values=d_values,
num_values=len(d_values),
d_out=d_ub,
)
got_lb = cp.asnumpy(d_lb)
got_ub = cp.asnumpy(d_ub)
expected_lb = np.searchsorted(h_data, h_values, side="left").astype(np.uintp)
expected_ub = np.searchsorted(h_data, h_values, side="right").astype(np.uintp)
ok_lb = np.array_equal(got_lb, expected_lb)
ok_ub = np.array_equal(got_ub, expected_ub)
print(f" data = {h_data.tolist()}")
print(f" values = {h_values.tolist()}")
print(
f" lower_bound: got {got_lb.tolist()} "
f"expected {expected_lb.tolist()} {'OK' if ok_lb else 'FAIL'}"
)
print(
f" upper_bound: got {got_ub.tolist()} "
f"expected {expected_ub.tolist()} {'OK' if ok_ub else 'FAIL'}"
)
return ok_lb and ok_ub
def main():
import argparse
parser = argparse.ArgumentParser(
description="Parallel upper_bound / lower_bound via cuda.compute"
)
parser.add_argument("--device", type=int, default=0, help="CUDA device id")
args = parser.parse_args()
device = Device(args.device)
device.set_current()
print_gpu_info(device)
print()
ok = True
# Case 1: values both inside and outside the data range; no duplicates
# in the data. lower_bound and upper_bound agree on values not present.
print("Case 1: distinct data, mixed queries")
h_data1 = np.array([1, 3, 5, 7, 9], dtype=np.int32)
h_values1 = np.array([0, 3, 4, 10], dtype=np.int32)
ok &= run_binary_search(h_data1, h_values1)
print()
# Case 2: duplicates in the data so lower_bound and upper_bound diverge
# on present values.
print("Case 2: duplicates in data")
h_data2 = np.array([1, 3, 3, 5, 7, 9], dtype=np.int32)
h_values2 = np.array([3, 3, 5, 8], dtype=np.int32)
ok &= run_binary_search(h_data2, h_values2)
print()
if ok:
print("Done")
return 0
print("FAILED")
return 1
if __name__ == "__main__":
sys.exit(main())

View File

@ -0,0 +1,4 @@
cuda-cccl[cu13]>=1.0.0
cuda-core>=1.0.0
cupy-cuda13x>=14.0.0
numpy>=1.24.0

View File

@ -139,7 +139,7 @@ def run_sample(num_elements: int = 1024 * 1024, device_id: int = 0) -> bool:
try:
# Make CuPy use our stream
cp.cuda.ExternalStream(int(stream.handle)).use()
cp.cuda.Stream.from_external(stream).use()
# Compile kernels
program = Program(
@ -216,7 +216,7 @@ def run_sample(num_elements: int = 1024 * 1024, device_id: int = 0) -> bool:
test3 = verify_array_result(d_partial, expected_partial)
# Performance timing
event_opts = EventOptions(enable_timing=True)
event_opts = EventOptions(timing_enabled=True)
iterations = 100
stream.sync()

View File

@ -1,6 +1,6 @@
# Block-wise Array Sum Requirements
cuda-python>=13.0.0
cuda-core>=0.6.0
cupy-cuda13x>=13.0.0
cuda-core>=1.0.0
cupy-cuda13x>=14.0.0
numpy>=2.3.2

View File

@ -0,0 +1,130 @@
# cudaComputeLambdas (Python)
## Description
This sample demonstrates how **cuda.compute** (from the
`cuda-cccl` package) accepts plain Python callables, including
lambdas, as the operators that drive device-wide reductions,
transforms, and scans. Internally `cuda.compute` JIT-compiles the
callable through Numba for the GPU, so you can iterate on the
operator in pure Python and still get a fused device-wide kernel.
The sample exercises three algorithm families:
1. `cuda.compute.reduce_into` - sum via `lambda a, b: a + b`.
2. `cuda.compute.unary_transform` - elementwise `y = x*x + 1` via a
lambda.
3. `cuda.compute.inclusive_scan` - prefix sum over only the even
values, driven by a regular Python function as the binary
operator.
## What You'll Learn
- Passing a Python `lambda` directly as the operator to a cuda.compute
device algorithm
- Using a regular Python `def` function for the same purpose when the
op is non-trivial
- The three core algorithm families in cuda.compute: reductions,
transforms, and scans
- How cuda.compute auto-compiles the op to LTO-IR via Numba
## Key Libraries
- [`cuda.compute`](https://nvidia.github.io/cccl/python.html) (from the `cuda-cccl` package) - device algorithms and JIT-compiled Python ops
- [`cuda.core`](https://nvidia.github.io/cuda-python/cuda-core/latest/) - device setup
- `cupy` - device buffers
- `numpy` - scalar init values and host-side verification
## Key APIs
### From `cuda.compute`
- `cuda.compute.reduce_into(d_in, d_out, num_items, op, h_init)` - device-wide reduction
- `cuda.compute.unary_transform(d_in, d_out, num_items, op)` - elementwise unary transform
- `cuda.compute.inclusive_scan(d_in, d_out, op, init_value, num_items)` - inclusive prefix scan
### From `cuda_samples_utils`
- `print_gpu_info()` - print device name and compute capability
## Requirements
### Hardware
- NVIDIA GPU with Compute Capability 7.0 or higher
### Software
- CUDA Toolkit 13.0 or newer (cuda.compute compiles ops to LTO-IR via
Numba, which needs the toolkit's `nvvm` and `libdevice`).
- Python 3.10 or newer
- `cuda-cccl` (>=1.0.0)
- `cuda-core` (>=1.0.0)
- `cupy-cuda13x` (>=14.0.0)
- `numba-cuda` (pulled in transitively by `cuda-cccl`)
If the CUDA toolkit is not on your `PATH`, set `CUDA_HOME` so Numba
can locate `libdevice`:
```bash
export CUDA_HOME=/usr/local/cuda
```
## Installation
Install the required packages from `requirements.txt`:
```bash
cd /path/to/cuda-samples/python/2_CoreConcepts/cudaComputeLambdas
pip install -r requirements.txt
```
The `requirements.txt` installs:
- `cuda-cccl` (>=1.0.0) - ships the `cuda.compute` module
- `cuda-core` (>=1.0.0)
- `cupy-cuda13x` (>=14.0.0)
- `numpy` (>=1.24.0)
## How to Run
### Basic usage
```bash
cd cuda-samples/python/2_CoreConcepts/cudaComputeLambdas
python cudaComputeLambdas.py
```
### With custom parameters
```bash
python cudaComputeLambdas.py --device 1
```
## Expected Output
```
Device: <Your GPU Name>
Compute Capability: <X.Y>
reduce_into(lambda a,b: a+b) over 1..10 -> 55 (expected 55) OK
unary_transform(lambda x: x*x + 1):
got = [1, 2, 5, 10, 17, 26, 37, 50]
expected = [1, 2, 5, 10, 17, 26, 37, 50] OK
inclusive_scan(add-evens-only) over [1,2,3,4,5,6]:
got = [0, 2, 2, 6, 6, 12]
expected = [0, 2, 2, 6, 6, 12] OK
Done
```
**Note:** Device name and compute capability will vary based on your GPU.
## Files
- `cudaComputeLambdas.py` - Python implementation
- `README.md` - This file
- `requirements.txt` - Sample dependencies
- `../../Utilities/cuda_samples_utils.py` - Common utilities (imported by this sample)

View File

@ -0,0 +1,179 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
"""
cuda.compute: Python lambdas as device-wide operators
This sample demonstrates how cuda.compute 1.0 (from the cuda-cccl
package) accepts plain Python callables, including lambdas, as the
operators that drive device-wide reductions, transforms, and scans.
Internally cuda.compute JIT-compiles the callable with Numba for the
device, so you can iterate on the operator in pure Python and still
get a fused GPU kernel.
The sample exercises three algorithm families with Python lambdas /
regular functions:
1. cuda.compute.reduce_into - sum via a lambda.
2. cuda.compute.unary_transform - elementwise y = x*x + 1 via a lambda.
3. cuda.compute.inclusive_scan - prefix sum over only the even values,
using a regular Python function as the binary operator.
"""
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent.parent / "Utilities"))
try:
import cuda.compute
import cupy as cp
import numpy as np
from cuda.core import Device
from cuda_samples_utils import print_gpu_info # noqa: E402
except ImportError as e:
print(f"Error: Required package not found: {e}")
print("Please install from requirements.txt:")
print(" pip install -r requirements.txt")
sys.exit(1)
def demo_reduce_lambda() -> bool:
"""reduce_into driven by a lambda."""
dtype = np.int32
h_init = np.array([0], dtype=dtype)
d_in = cp.arange(1, 11, dtype=dtype) # 1..10
d_out = cp.empty(1, dtype=dtype)
cuda.compute.reduce_into(
d_in=d_in,
d_out=d_out,
num_items=int(d_in.size),
op=lambda a, b: a + b,
h_init=h_init,
)
got = int(d_out.get()[0])
expected = int(d_in.get().sum())
ok = got == expected
print(
f"reduce_into(lambda a,b: a+b) over 1..10 -> {got} "
f"(expected {expected}) {'OK' if ok else 'FAIL'}"
)
return ok
def demo_unary_transform_lambda() -> bool:
"""unary_transform driven by a lambda: y = x*x + 1."""
d_in = cp.arange(8, dtype=cp.int32)
d_out = cp.empty_like(d_in)
cuda.compute.unary_transform(
d_in=d_in,
d_out=d_out,
num_items=int(d_in.size),
op=lambda x: x * x + 1,
)
got = d_out.get()
expected = (d_in.get().astype(np.int64) ** 2 + 1).astype(np.int32)
ok = np.array_equal(got, expected)
print(
f"unary_transform(lambda x: x*x + 1):\n"
f" got = {got.tolist()}\n"
f" expected = {expected.tolist()} {'OK' if ok else 'FAIL'}"
)
return ok
def demo_scan_custom_op() -> bool:
"""inclusive_scan with a Python function that sums only even values.
This shows the same pattern that also works for reduce/transform:
the Python callable is JIT-compiled for the device by cuda.compute.
"""
dtype = np.int32
d_in = cp.array([1, 2, 3, 4, 5, 6], dtype=dtype)
d_out = cp.empty_like(d_in)
h_init = np.array([0], dtype=dtype)
def add_evens(a, b):
# Treat odd operands as zero; scan accumulates only even values.
return (a if a % 2 == 0 else 0) + (b if b % 2 == 0 else 0)
cuda.compute.inclusive_scan(
d_in=d_in,
d_out=d_out,
op=add_evens,
init_value=h_init,
num_items=int(d_in.size),
)
got = d_out.get()
# Host reference: running sum of even-only projection of the input.
h_in = d_in.get()
proj = np.where(h_in % 2 == 0, h_in, 0)
expected = np.cumsum(proj).astype(dtype)
ok = np.array_equal(got, expected)
print(
f"inclusive_scan(add-evens-only) over [1,2,3,4,5,6]:\n"
f" got = {got.tolist()}\n"
f" expected = {expected.tolist()} {'OK' if ok else 'FAIL'}"
)
return ok
def main():
import argparse
parser = argparse.ArgumentParser(
description="Drive cuda.compute device algorithms with Python lambdas / callables"
)
parser.add_argument("--device", type=int, default=0, help="CUDA device id")
args = parser.parse_args()
device = Device(args.device)
device.set_current()
print_gpu_info(device)
print()
ok = True
ok &= demo_reduce_lambda()
print()
ok &= demo_unary_transform_lambda()
print()
ok &= demo_scan_custom_op()
print()
if ok:
print("Done")
return 0
print("FAILED")
return 1
if __name__ == "__main__":
sys.exit(main())

View File

@ -0,0 +1,4 @@
cuda-cccl[cu13]>=1.0.0
cuda-core>=1.0.0
cupy-cuda13x>=14.0.0
numpy>=1.24.0

View File

@ -62,8 +62,8 @@ rebuilding.
- CUDA Toolkit 13.0 or newer (matches `cuda-python` 13.x)
- Python 3.10 or newer
- `cuda-python` (>=13.0.0)
- `cuda-core` (>=0.6.0)
- `cupy-cuda13x` (>=13.0.0)
- `cuda-core` (>=1.0.0)
- `cupy-cuda13x` (>=14.0.0)
## Installation
@ -77,8 +77,8 @@ pip install -r requirements.txt
The `requirements.txt` installs:
- `cuda-python` (>=13.0.0)
- `cuda-core` (>=0.6.0)
- `cupy-cuda13x` (>=13.0.0)
- `cuda-core` (>=1.0.0)
- `cupy-cuda13x` (>=14.0.0)
## How to Run

View File

@ -182,7 +182,7 @@ def main() -> int:
stream = device.create_stream()
# Tell CuPy to order its allocations on our stream so buffer initialization
# below is serialized with the kernels we launch.
cp.cuda.ExternalStream(int(stream.handle)).use()
cp.cuda.Stream.from_external(stream).use()
graph_builder = graph = None
try:
@ -214,9 +214,9 @@ def main() -> int:
t_individual = run_pipeline_individual(
stream, kernels, config, buffers, N, n_iters=args.iters
)
assert cp.allclose(r3, expected, rtol=1e-5, atol=1e-5), (
"Individual pipeline produced incorrect results"
)
assert cp.allclose(
r3, expected, rtol=1e-5, atol=1e-5
), "Individual pipeline produced incorrect results"
print(
f"\nIndividual launches: {args.iters} iters in {t_individual:.4f}s"
f" ({t_individual * 1e6 / args.iters:.2f} us/iter)"
@ -228,9 +228,9 @@ def main() -> int:
run_pipeline_graph(stream, graph, n_iters=5) # warm up
t_graph = run_pipeline_graph(stream, graph, n_iters=args.iters)
assert cp.allclose(r3, expected, rtol=1e-5, atol=1e-5), (
"Graph pipeline produced incorrect results"
)
assert cp.allclose(
r3, expected, rtol=1e-5, atol=1e-5
), "Graph pipeline produced incorrect results"
print(
f"Graph replay: {args.iters} iters in {t_graph:.4f}s"
f" ({t_graph * 1e6 / args.iters:.2f} us/iter)"

View File

@ -1,5 +1,5 @@
# CUDA Graphs Sample Requirements
cuda-python>=13.0.0
cuda-core>=0.6.0
cupy-cuda13x>=13.0.0
cuda-core>=1.0.0
cupy-cuda13x>=14.0.0

View File

@ -28,7 +28,7 @@ This sample demonstrates CuPy integration with cuda.core streams:
stream = device.create_stream()
# Use with CuPy operations
cp.cuda.ExternalStream(int(stream.handle)).use()
cp.cuda.Stream.from_external(stream).use()
```
## Key APIs
@ -43,7 +43,7 @@ cp.cuda.ExternalStream(int(stream.handle)).use()
- `cp.fft.rfft()` - Real-to-complex FFT (GPU-accelerated via cuFFT)
- `cp.fft.rfftfreq()` - Generate frequency bins for rfft
- `cp.cuda.ExternalStream()` - Interop with cuda.core streams
- `cp.cuda.Stream.from_external()` - Interop with cuda.core streams
### From NumPy:
@ -115,7 +115,7 @@ VERIFICATION
GPU vs CPU FFT magnitude: Test PASSED
Frequency Detection Accuracy:
440 Hz:
440 Hz: [OK]
...
Done

View File

@ -180,7 +180,7 @@ def run_fft_analysis(
print(f"Compute Capability: sm_{device.arch}")
# Make CuPy use our cuda.core stream
cp.cuda.ExternalStream(int(stream.handle)).use()
cp.cuda.Stream.from_external(stream).use()
# Define test signal: composite of multiple frequencies
test_frequencies = [440.0, 880.0, 1320.0, 2000.0, 5000.0] # Hz
@ -208,7 +208,7 @@ def run_fft_analysis(
print("GPU FFT (cuFFT)")
print("-" * 60)
event_opts = EventOptions(enable_timing=True)
event_opts = EventOptions(timing_enabled=True)
# Warmup
d_fft_result = cp.fft.rfft(d_signal)
@ -291,7 +291,7 @@ def run_fft_analysis(
all_found = True
for expected_freq in test_frequencies:
found = any(abs(f - expected_freq) < 10 for f in detected_freqs)
status = "" if found else ""
status = "[OK]" if found else "[FAIL]"
print(f" {expected_freq:6.0f} Hz: {status}")
all_found = all_found and found

View File

@ -1,6 +1,6 @@
# FFT Signal Analysis Requirements
cuda-python>=13.0.0
cuda-core>=0.6.0
cupy-cuda13x>=13.0.0
cuda-core>=1.0.0
cupy-cuda13x>=14.0.0
numpy>=2.3.2

View File

@ -417,7 +417,7 @@ def run_critical_alone(
Establishes the pure compute time with every SM on the device available.
"""
stream = device.create_stream()
out = device.allocate(critical_n * 4)
out = device.allocate(critical_n * 4, stream=stream)
total_sm = device.resources.sm.sm_count
try:
opts = EventOptions(timing_enabled=True)
@ -461,7 +461,7 @@ def run_baseline(
"""Both kernels on the primary context, two non-blocking streams."""
long_stream = device.create_stream()
critical_stream = device.create_stream()
out = device.allocate(critical_n * 4)
out = device.allocate(critical_n * 4, stream=critical_stream)
total_sm = device.resources.sm.sm_count
try:
return _run_one(
@ -513,7 +513,7 @@ def run_green_context(
long_stream = ctx_long.create_stream()
critical_stream = ctx_crit.create_stream()
out = device.allocate(critical_n * 4)
out = device.allocate(critical_n * 4, stream=critical_stream)
return _run_one(
device,

View File

@ -1,3 +1,3 @@
cuda-python>=13.0.0
cuda-core>=0.7.0
cuda-core>=1.0.0
numpy>=2.3.2

View File

@ -64,8 +64,8 @@ linking modes are verified against a NumPy reference.
- CUDA Toolkit 13.0 or newer (matches `cuda-python` 13.x)
- Python 3.10 or newer
- `cuda-python` (>=13.0.0)
- `cuda-core` (>=0.6.0)
- `cupy-cuda13x` (>=13.0.0)
- `cuda-core` (>=1.0.0)
- `cupy-cuda13x` (>=14.0.0)
## Installation
@ -79,8 +79,8 @@ pip install -r requirements.txt
The `requirements.txt` installs:
- `cuda-python` (>=13.0.0)
- `cuda-core` (>=0.6.0)
- `cupy-cuda13x` (>=13.0.0)
- `cuda-core` (>=1.0.0)
- `cupy-cuda13x` (>=14.0.0)
## How to Run

View File

@ -140,9 +140,7 @@ def link_lto(device):
main_obj = Program(MAIN_SRC, "c++", options=prog_opts).compile("ltoir")
user_obj = Program(USER_SRC, "c++", options=prog_opts).compile("ltoir")
linker_opts = LinkerOptions(
arch=f"sm_{device.arch}", link_time_optimization=True
)
linker_opts = LinkerOptions(arch=f"sm_{device.arch}", link_time_optimization=True)
linker = Linker(main_obj, user_obj, options=linker_opts)
return linker.link("cubin")
@ -175,7 +173,9 @@ def main() -> int:
description="JIT + LTO linking of two device modules with cuda.core"
)
parser.add_argument(
"--elements", type=int, default=1 << 16,
"--elements",
type=int,
default=1 << 16,
help="Number of float32 elements (default: 65536)",
)
parser.add_argument("--device", type=int, default=0, help="CUDA device id")
@ -186,7 +186,7 @@ def main() -> int:
print_gpu_info(device)
stream = device.create_stream()
cp.cuda.ExternalStream(int(stream.handle)).use()
cp.cuda.Stream.from_external(stream).use()
try:
N = args.elements

View File

@ -1,5 +1,5 @@
# JIT + LTO Linking Sample Requirements
cuda-python>=13.0.0
cuda-core>=0.6.0
cupy-cuda13x>=13.0.0
cuda-core>=1.0.0
cupy-cuda13x>=14.0.0

View File

@ -90,6 +90,15 @@ elapsed_ms = (end_event - start_event) / n_iterations
- Python 3.10 or newer
- See `requirements.txt` for Python packages
### Platform Support:
The benchmark loops in this sample read kernel results back from
`ManagedMemoryResource` allocations between launches, which requires the
device property `concurrent_managed_access=True`. This is only supported on
Linux with HMM (Pascal and newer). On Windows (WDDM/MCDM/TCC) the property
is `False`, so the sample exits early with a waive message and exit code
`2`.
## Installation
```bash
@ -115,8 +124,8 @@ Compute Capability: X.X
Compiling CUDA kernels with cuda.core.Program...
Target architecture: sm_XX
vector_add kernel compiled
reduce_sum kernel compiled
[OK] vector_add kernel compiled
[OK] reduce_sum kernel compiled
============================================================
VECTOR ADDITION - Launch Configuration Tuning
@ -132,11 +141,11 @@ Block Size: 64 | Blocks: 156250 | Time: X.XXXX ± X.XXXX ms
...
------------------------------------------------------------
OPTIMAL: block_size=XXX (X.XXXX ms)
✗ WORST: block_size=XXX (X.XXXX ms)
[OK] OPTIMAL: block_size=XXX (X.XXXX ms)
[FAIL] WORST: block_size=XXX (X.XXXX ms)
Speedup: X.XXx
Results verified correct!
[OK] Results verified correct!
...

View File

@ -148,7 +148,7 @@ def benchmark_kernel_1d(
stream.sync()
# Timed runs with CUDA events
event_opts = EventOptions(enable_timing=True)
event_opts = EventOptions(timing_enabled=True)
start_event = device.create_event(options=event_opts)
end_event = device.create_event(options=event_opts)
@ -178,7 +178,7 @@ def print_gpu_info(device):
def allocate_managed_array(mr, stream, n_elements, dtype=np.float32):
"""Allocate device-preferred unified memory and return buffer with numpy view."""
n_bytes = n_elements * np.dtype(dtype).itemsize
buffer = mr.allocate(n_bytes, stream)
buffer = mr.allocate(n_bytes, stream=stream)
stream.sync()
# Zero-copy numpy view via DLPack (holds reference to buffer)
@ -240,11 +240,11 @@ def demo_vector_add_tuning(device, stream, mr, kernel):
print("-" * 60)
print(
f"\n OPTIMAL: block_size={best['block_size']} "
f"\n[OK] OPTIMAL: block_size={best['block_size']} "
f"({best['mean_time_ms']:.4f} ms)"
)
print(
f"✗ WORST: block_size={worst['block_size']} "
f"[FAIL] WORST: block_size={worst['block_size']} "
f"({worst['mean_time_ms']:.4f} ms)"
)
print(f" Speedup: {worst['mean_time_ms']/best['mean_time_ms']:.2f}x")
@ -253,7 +253,7 @@ def demo_vector_add_tuning(device, stream, mr, kernel):
stream.sync()
expected = np_a + np_b
if np.allclose(np_c, expected):
print("\n Results verified correct!")
print("\n[OK] Results verified correct!")
return results
finally:
@ -316,7 +316,7 @@ def demo_reduction_tuning(device, stream, mr, kernel):
worst = max(results, key=lambda x: x["mean_time_ms"])
print("-" * 60)
print(f"\n OPTIMAL: block_size={best['block_size']}")
print(f"\n[OK] OPTIMAL: block_size={best['block_size']}")
print(
f" Speedup over worst: {worst['mean_time_ms']/best['mean_time_ms']:.2f}x"
)
@ -341,6 +341,14 @@ def main():
3. Benchmarking different thread block configurations
4. Finding optimal threads-per-block for various kernel types
"""
if sys.platform == "win32":
print(
"This sample relies on ManagedMemoryResource with concurrent host "
"access, which is not supported on Windows "
"(concurrent_managed_access=False). Waiving this sample."
)
sys.exit(2)
print("=" * 60)
print("Launch Configuration Tuning (cuda.core)")
print("Finding the Best Block Size for Your Kernel")
@ -365,10 +373,10 @@ def main():
print(f" Target architecture: {arch}")
vec_add_kernel = compile_kernel(device, VECTOR_ADD_KERNEL, "vector_add")
print(" vector_add kernel compiled")
print(" [OK] vector_add kernel compiled")
reduction_kernel = compile_kernel(device, REDUCTION_KERNEL, "reduce_sum")
print(" reduce_sum kernel compiled")
print(" [OK] reduce_sum kernel compiled")
# Run demonstrations
demo_vector_add_tuning(device, stream, mr, vec_add_kernel)

View File

@ -2,5 +2,5 @@
# Requires Python 3.10+, CUDA Toolkit 13.0+
cuda-python>=13.0.0
cuda-core>=0.6.0
cuda-core>=1.0.0
numpy>=2.3.2

Some files were not shown because too many files have changed in this diff Show More