6aac4717b8
Fix two obvious errors, the first one is that five tasks were submitted to pipeline at the same time and task 4 conflicts with task 0, the remaining two are copy errors |
||
---|---|---|
.. | ||
bf16TensorCoreGemm | ||
binaryPartitionCG | ||
bindlessTexture | ||
cdpAdvancedQuicksort | ||
cdpBezierTessellation | ||
cdpQuadtree | ||
cdpSimplePrint | ||
cdpSimpleQuicksort | ||
cudaCompressibleMemory | ||
cudaTensorCoreGemm | ||
dmmaTensorCoreGemm | ||
globalToShmemAsyncCopy | ||
graphConditionalNodes | ||
graphMemoryFootprint | ||
graphMemoryNodes | ||
immaTensorCoreGemm | ||
jacobiCudaGraphs | ||
memMapIPCDrv | ||
newdelete | ||
ptxjit | ||
simpleCudaGraphs | ||
StreamPriorities | ||
tf32TensorCoreGemm | ||
warpAggregatedAtomicsCG | ||
README.md |
3. CUDA Features
bf16TensorCoreGemm
A CUDA sample demonstrating __nv_bfloat16 (e8m7) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure.
binaryPartitionCG
This sample is a simple code that illustrates binary partition cooperative groups and reduce within the thread block.
bindlessTexture
This example demonstrates use of cudaSurfaceObject, cudaTextureObject, and MipMap support in CUDA. A GPU with Compute Capability SM 3.0 is required to run the sample.
cdpAdvancedQuicksort
This sample demonstrates an advanced quicksort implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.
cdpBezierTessellation
This sample demonstrates bezier tessellation of lines implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.
cdpQuadtree
This sample demonstrates Quad Trees implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.
cdpSimplePrint
This sample demonstrates simple printf implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.
cdpSimpleQuicksort
This sample demonstrates simple quicksort implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.
cudaCompressibleMemory
This sample demonstrates the compressible memory allocation using cuMemMap API.
cudaTensorCoreGemm
CUDA sample demonstrating a GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced in CUDA 9.
This sample demonstrates the use of the new CUDA WMMA API employing the Tensor Cores introduced in the Volta chip family for faster matrix operations.
In addition to that, it demonstrates the use of the new CUDA function attribute cudaFuncAttributeMaxDynamicSharedMemorySize that allows the application to reserve an extended amount of shared memory than it is available by default.
dmmaTensorCoreGemm
CUDA sample demonstrates double precision GEMM computation using the Double precision Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure. Further, this sample also demonstrates how to use cooperative groups async copy interface over a group for performing gmem to shmem async loads.
globalToShmemAsyncCopy
This sample implements matrix multiplication which uses asynchronous copy of data from global to shared memory when on compute capability 8.0 or higher. Also demonstrates arrive-wait barrier for synchronization.
graphMemoryFootprint
This sample demonstrates how graph memory nodes re-use virtual addresses and physical memory.
graphMemoryNodes
A demonstration of memory allocations and frees within CUDA graphs using Graph APIs and Stream Capture APIs.
immaTensorCoreGemm
CUDA sample demonstrating a integer GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API for integer introduced in CUDA 10. This sample demonstrates the use of the CUDA WMMA API employing the Tensor Cores introduced in the Volta chip family for faster matrix operations. In addition to that, it demonstrates the use of the new CUDA function attribute cudaFuncAttributeMaxDynamicSharedMemorySize that allows the application to reserve an extended amount of shared memory than it is available by default.
jacobiCudaGraphs
Demonstrates Instantiated CUDA Graph Update with Jacobi Iterative Method using cudaGraphExecKernelNodeSetParams() and cudaGraphExecUpdate() approach.
memMapIPCDrv
This CUDA Driver API sample is a very basic sample that demonstrates Inter Process Communication using cuMemMap APIs with one process per GPU for computation. Requires Compute Capability 3.0 or higher and a Linux Operating System, or a Windows Operating System
newdelete
This sample demonstrates dynamic global memory allocation through device C++ new and delete operators and virtual function declarations available with CUDA 4.0.
ptxjit
This sample uses the Driver API to just-in-time compile (JIT) a Kernel from PTX code. Additionally, this sample demonstrates the seamless interoperability capability of the CUDA Runtime and CUDA Driver API calls. For CUDA 5.5, this sample shows how to use cuLink* functions to link PTX assembly using the CUDA driver at runtime.
simpleCudaGraphs
A demonstration of CUDA Graphs creation, instantiation and launch using Graphs APIs and Stream Capture APIs.
StreamPriorities
This sample demonstrates basic use of stream priorities.
tf32TensorCoreGemm
A CUDA sample demonstrating tf32 (e8m10) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure.
warpAggregatedAtomicsCG
This sample demonstrates how using Cooperative Groups (CG) to perform warp aggregated atomics to single and multiple counters, a useful technique to improve performance when many threads atomically add to a single or multiple counters.
graphConditionalNodes
Demonstrate the use of CUDA Graph conditional nodes available starting in CUDA 12.4.