cuda-samples/Samples/3_CUDA_Features
2024-05-26 18:30:30 +08:00
..
bf16TensorCoreGemm Fixing correctness of bf16TensorCoreGemm 2024-05-26 18:30:30 +08:00
binaryPartitionCG Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
bindlessTexture Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
cdpAdvancedQuicksort Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
cdpBezierTessellation Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
cdpQuadtree Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
cdpSimplePrint Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
cdpSimpleQuicksort Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
cudaCompressibleMemory Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
cudaTensorCoreGemm Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
dmmaTensorCoreGemm Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
globalToShmemAsyncCopy Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
graphConditionalNodes Updating graphConditionalNodes orphan directory 2024-04-10 19:44:42 +00:00
graphMemoryFootprint Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
graphMemoryNodes Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
immaTensorCoreGemm Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
jacobiCudaGraphs Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
memMapIPCDrv Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
newdelete Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
ptxjit Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
simpleCudaGraphs Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
StreamPriorities Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
tf32TensorCoreGemm Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
warpAggregatedAtomicsCG Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00
README.md Updating samples for CUDA 12.4 2024-03-05 20:53:50 +00:00

3. CUDA Features

bf16TensorCoreGemm

A CUDA sample demonstrating __nv_bfloat16 (e8m7) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure.

binaryPartitionCG

This sample is a simple code that illustrates binary partition cooperative groups and reduce within the thread block.

bindlessTexture

This example demonstrates use of cudaSurfaceObject, cudaTextureObject, and MipMap support in CUDA. A GPU with Compute Capability SM 3.0 is required to run the sample.

cdpAdvancedQuicksort

This sample demonstrates an advanced quicksort implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.

cdpBezierTessellation

This sample demonstrates bezier tessellation of lines implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.

cdpQuadtree

This sample demonstrates Quad Trees implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.

cdpSimplePrint

This sample demonstrates simple printf implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.

cdpSimpleQuicksort

This sample demonstrates simple quicksort implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.

cudaCompressibleMemory

This sample demonstrates the compressible memory allocation using cuMemMap API.

cudaTensorCoreGemm

CUDA sample demonstrating a GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced in CUDA 9.

This sample demonstrates the use of the new CUDA WMMA API employing the Tensor Cores introduced in the Volta chip family for faster matrix operations.

In addition to that, it demonstrates the use of the new CUDA function attribute cudaFuncAttributeMaxDynamicSharedMemorySize that allows the application to reserve an extended amount of shared memory than it is available by default.

dmmaTensorCoreGemm

CUDA sample demonstrates double precision GEMM computation using the Double precision Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure. Further, this sample also demonstrates how to use cooperative groups async copy interface over a group for performing gmem to shmem async loads.

globalToShmemAsyncCopy

This sample implements matrix multiplication which uses asynchronous copy of data from global to shared memory when on compute capability 8.0 or higher. Also demonstrates arrive-wait barrier for synchronization.

graphMemoryFootprint

This sample demonstrates how graph memory nodes re-use virtual addresses and physical memory.

graphMemoryNodes

A demonstration of memory allocations and frees within CUDA graphs using Graph APIs and Stream Capture APIs.

immaTensorCoreGemm

CUDA sample demonstrating a integer GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API for integer introduced in CUDA 10. This sample demonstrates the use of the CUDA WMMA API employing the Tensor Cores introduced in the Volta chip family for faster matrix operations. In addition to that, it demonstrates the use of the new CUDA function attribute cudaFuncAttributeMaxDynamicSharedMemorySize that allows the application to reserve an extended amount of shared memory than it is available by default.

jacobiCudaGraphs

Demonstrates Instantiated CUDA Graph Update with Jacobi Iterative Method using cudaGraphExecKernelNodeSetParams() and cudaGraphExecUpdate() approach.

memMapIPCDrv

This CUDA Driver API sample is a very basic sample that demonstrates Inter Process Communication using cuMemMap APIs with one process per GPU for computation. Requires Compute Capability 3.0 or higher and a Linux Operating System, or a Windows Operating System

newdelete

This sample demonstrates dynamic global memory allocation through device C++ new and delete operators and virtual function declarations available with CUDA 4.0.

ptxjit

This sample uses the Driver API to just-in-time compile (JIT) a Kernel from PTX code. Additionally, this sample demonstrates the seamless interoperability capability of the CUDA Runtime and CUDA Driver API calls. For CUDA 5.5, this sample shows how to use cuLink* functions to link PTX assembly using the CUDA driver at runtime.

simpleCudaGraphs

A demonstration of CUDA Graphs creation, instantiation and launch using Graphs APIs and Stream Capture APIs.

StreamPriorities

This sample demonstrates basic use of stream priorities.

tf32TensorCoreGemm

A CUDA sample demonstrating tf32 (e8m10) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure.

warpAggregatedAtomicsCG

This sample demonstrates how using Cooperative Groups (CG) to perform warp aggregated atomics to single and multiple counters, a useful technique to improve performance when many threads atomically add to a single or multiple counters.

graphConditionalNodes

Demonstrate the use of CUDA Graph conditional nodes available starting in CUDA 12.4.