cuda-samples/CHANGELOG.md

## Changelog

### CUDA 12.5

### CUDA 12.4
* Added graphConditionalNodes Sample

### CUDA 12.3
* Added cuDLA samples
* Fixed jitLto regression

### CUDA 12.2
* libNVVM samples received updates
* Fixed jitLto Case issues
* Enabled HOST_COMPILER flag to the makefiles for GCC which is untested but may still work.

### CUDA 12.1
* Added new sample for Large Kernels

### CUDA 12.0
* Added new flags for JIT compiling
* Removed deprecated APIs in Hopper Architecture

### CUDA 11.6
* Added new folder structure for samples
* Added support of Visual Studio 2022 to all samples supported on [Windows](#windows-1).
* All CUDA samples are now only available on [GitHub](https://github.com/nvidia/cuda-samples). They are no longer available via CUDA toolkit.

### CUDA 11.5
* Added `cuDLAHybridMode`. Demonstrate usage of cuDLA in hybrid mode.
* Added `cuDLAStandaloneMode`. Demonstrate usage of cuDLA in standalone mode.
* Added `cuDLAErrorReporting`. Demonstrate DLA error detection via CUDA.
* Added `graphMemoryNodes`. Demonstrates memory allocations and frees within CUDA graphs using Graph APIs and Stream Capture APIs.
* Added `graphMemoryFootprint`. Demonstrates how graph memory nodes re-use virtual addresses and physical memory.
* All samples from CUDA toolkit are now available on [GitHub](https://github.com/nvidia/cuda-samples).

### CUDA 11.4 update 1
* Added support for VS Code on linux platform.

### CUDA 11.4
* Added `cdpQuadtree`. Demonstrates Quad Trees implementation using CUDA Dynamic Parallelism.
* Updated `simpleVulkan`, `simpleVulkanMMAP` and `vulkanImageCUDA`. Demonstrates use of SPIR-V shaders.

### CUDA 11.3
*  Added `streamOrderedAllocationIPC`. Demonstrates Inter Process Communication using one process per GPU for computation.
*  Added `simpleCUBLAS_LU`. Demonstrates batched matrix LU decomposition using cuBLAS API `cublas<t>getrfBatched()`
*  Updated `simpleVulkan`. Demonstrates use of timeline semaphore.
*  Updated multiple samples to use pinned memory using `cudaMallocHost()`.

### CUDA 11.2
*  Added `streamOrderedAllocation`. Demonstrates stream ordered memory allocation on a GPU using cudaMallocAsync and cudaMemPool family of APIs.
*  Added `streamOrderedAllocationP2P`. Demonstrates peer-to-peer access of stream ordered memory allocated using cudaMallocAsync and cudaMemPool family of APIs.
*  Dropped Visual Studio 2015 support from all the windows supported samples.
*  FreeImage is no longer distributed with the CUDA Samples. On Windows, see the [Dependencies](./README.md#freeimage) section for more details on how to set up FreeImage. On Linux, it is recommended to install FreeImage with your distribution's package manager.
*  All the samples using CUDA Pipeline & Arrive-wait barriers are been updated to use new `cuda::pipeline` and `cuda::barrier` interfaces.
*  Updated all the samples to build with parallel build option `--threads` of `nvcc` cuda compiler.
*  Added `cudaNvSciNvMedia`. Demonstrates CUDA-NvMedia interop via NvSciBuf/NvSciSync APIs.
*  Added `simpleGL`. Demonstrates interoperability between CUDA and OpenGL.

### CUDA 11.1
*  Added `watershedSegmentationNPP`. Demonstrates how to use the NPP watershed segmentation function.
*  Added `batchedLabelMarkersAndLabelCompressionNPP`. Demonstrates how to use the NPP label markers generation and label compression functions based on a Union Find (UF) algorithm including both single image and batched image versions.
*  Dropped Visual Studio 2012, 2013 support from all the windows supported samples.
*  Added kernel performing warp aggregated atomic max in multi buckets using cg::labeled_partition & cg::reduce in `warpAggregatedAtomicsCG`.
*  Added extended CG shuffle mechanics to `shfl_scan` sample.
*  Added `cudaOpenMP`. Demonstrates how to use OpenMP API to write an application for multiple GPUs.
*  Added `simpleZeroCopy`. Demonstrates how to use zero copy, kernels can read and write directly to pinned system memory.

### CUDA 11.0
*  Added `dmmaTensorCoreGemm`. Demonstrates double precision GEMM computation using the Double precision Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores.
*  Added `bf16TensorCoreGemm`. Demonstrates __nv_bfloat16 (e8m7) GEMM computation using the __nv_bfloat16 WMMA API introduced with CUDA 11 in Ampere chip family tensor cores.
*  Added `tf32TensorCoreGemm`. Demonstrates tf32 (e8m10) GEMM computation using the tf32 WMMA API introduced with CUDA 11 in Ampere chip family tensor cores.
*  Added `globalToShmemAsyncCopy`. Demonstrates async copy of data from global to shared memory when on compute capability 8.0 or higher. Also demonstrates arrive-wait barrier for synchronization.
*  Added `simpleAWBarrier`. Demonstrates arrive wait barriers.
*  Added `simpleAttributes`. Demonstrates the stream attributes that affect L2 locality.
*  Added warp aggregated atomic multi bucket increments kernel using labeled_partition cooperative groups in `warpAggregatedAtomicsCG` which can be used on compute capability 7.0 and above GPU architectures.
*  Added `binaryPartitionCG`. Demonstrates  binary partition cooperative groups and reduction within the thread block.
*  Added two new reduction kernels in `reduction` one which demonstrates reduce_add_sync intrinstic supported on compute capability 8.0 and another which uses cooperative_groups::reduce function which does thread_block_tile level reduction introduced from CUDA 11.0.
*  Added `cudaCompressibleMemory`. Demonstrates compressible memory allocation using cuMemMap API.
*  Added `simpleVulkanMMAP`. Demonstrates Vulkan CUDA Interop via cuMemMap APIs.
*  Added `concurrentKernels`. Demonstrates the use of CUDA streams for concurrent execution of several kernels on a GPU.
*  Dropped Mac OSX support from all samples.

### CUDA 10.2
*  Added `simpleD3D11`. Demonstrates CUDA-D3D11 External Resource Interoperability APIs for updating D3D11 buffers from CUDA and synchronization between D3D11 and CUDA with Keyed Mutexes.
*  Added `simpleDrvRuntime`. Demonstrates CUDA Driver and Runtime APIs working together to load fatbinary of a CUDA kernel.
*  Added `vectorAddMMAP`. Demonstrates how cuMemMap API allows the user to specify the physical properties of their memory while retaining the contiguous nature of their access.
*  Added `memMapIPCDrv`. Demonstrates Inter Process Communication using cuMemMap APIs.
*  Added `cudaNvSci`. Demonstrates CUDA-NvSciBuf/NvSciSync Interop.
*  Added `jacobiCudaGraphs`. Demonstrates Instantiated CUDA Graph Update with Jacobi Iterative Method using different approaches.
*  Added `cuSolverSp_LinearSolver`. Demonstrates cuSolverSP's LU, QR and Cholesky factorization.
*  Added `MersenneTwisterGP11213`. Demonstrates the Mersenne Twister random number generator GP11213 in cuRAND.

### CUDA 10.1 Update 2
*  Added `vulkanImageCUDA`. Demonstrates how to perform Vulkan image - CUDA Interop.
*  Added `nvJPEG_encoder`. Demonstrates encoding of jpeg images using NVJPEG Library.
*  Added Windows OS support to `nvJPEG` sample.
*  Added `boxFilterNPP`. Demonstrates how to use NPP FilterBox function to perform a box filter.
*  Added `cannyEdgeDetectorNPP`. Demonstrates the nppiFilterCannyBorder_8u_C1R Canny Edge Detection image filter function.

### CUDA 10.1 Update 1
*  Added `NV12toBGRandResize`. Demonstrates how to convert and resize NV12 frames to BGR planars frames using CUDA in batch.
*  Added `EGLStream_CUDA_Interop`. Demonstrates data exchange between CUDA and EGL Streams.
*  Added `cuSolverDn_LinearSolver`. Demonstrates cuSolverDN's LU, QR and Cholesky factorization.
*  Added support of Visual Studio 2019 to all samples supported on [Windows](./README.md#windows-1).

### CUDA 10.1
*  Added `immaTensorCoreGemm`. Demonstrates integer GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API for integers employing the Tensor Cores.
*  Added `simpleIPC`. Demonstrates Inter Process Communication with one process per GPU for computation.
*  Added `nvJPEG`. Demonstrates single and batched decoding of jpeg images using NVJPEG Library.
*  Added `bandwidthTest`. It measures the memcopy bandwidth of the GPU and memcpy bandwidth across PCI-e.
*  Added `reduction`. Demonstrates several important optimization strategies for Data-Parallel Algorithms like reduction.
*  Update all the samples to support CUDA 10.1.


### CUDA 10.0
*  Added `simpleCudaGraphs`. Demonstrates CUDA Graphs creation, instantiation and launch using Graphs APIs and Stream Capture APIs.
*  Added `conjugateGradientCudaGraphs`. Demonstrates conjugate gradient solver on GPU using CUBLAS and CUSPARSE library calls captured and called using CUDA Graph APIs.
*  Added `simpleVulkan`. Demonstrates Vulkan - CUDA Interop.
*  Added `simpleD3D12`. Demonstrates DX12 - CUDA Interop.
*  Added `UnifiedMemoryPerf`. Demonstrates performance comparision of various memory types involved in system.
*  Added `p2pBandwidthLatencyTest`. Demonstrates Peer-To-Peer (P2P) data transfers between pairs of GPUs and computes latency and bandwidth.
*  Added `systemWideAtomics`. Demonstrates system wide atomic instructions.
*  Added `simpleCUBLASXT`. Demonstrates CUBLAS-XT library which performs GEMM operations over multiple GPUs.
*  Added Windows OS support to `conjugateGradientMultiDeviceCG` sample.
*  Removed support of Visual Studio 2010 from all samples.


### CUDA 9.2

This is the first release of CUDA Samples on GitHub:
*  Added `vectorAdd_nvrtc`. Demonstrates runtime compilation library using NVRTC of a simple vectorAdd kernel.
*  Added `warpAggregatedAtomicsCG`. Demonstrates warp aggregated atomics using Cooperative Groups.
*  Added `deviceQuery`. Enumerates the properties of the CUDA devices present in the system.
*  Added `matrixMul`. Demonstrates a matrix multiplication using shared memory through tiled approach.
*  Added `matrixMulDrv`. Demonstrates a matrix multiplication using shared memory through tiled approach, uses CUDA Driver API.
*  Added `cudaTensorCoreGemm`. Demonstrates a GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced in CUDA 9, as well as the new Tensor Cores introduced in the Volta chip family.
*  Added `simpleVoteIntrinsics` which uses *_sync equivalent of the vote intrinsics _any, _all added since CUDA 9.0.
*  Added `shfl_scan` which uses *_sync equivalent of the shfl intrinsics added since CUDA 9.0.
*  Added `conjugateGradientMultiBlockCG`. Demonstrates a conjugate gradient solver on GPU using Multi Block Cooperative Groups.
*  Added `conjugateGradientMultiDeviceCG`. Demonstrates a conjugate gradient solver on multiple GPUs using Multi Device Cooperative Groups, also uses unified memory prefetching and usage hints APIs.
*  Added `simpleCUBLAS`. Demonstrates how perform GEMM operations using CUBLAS library.
*  Added `simpleCUFFT`. Demonstrates how perform FFT operations using CUFFT library.