Merge branch 'master' into cuda_a_dev

2025-12-16 10:37:48 +08:00 · 2025-04-09 08:33:37 -07:00 · 2025-04-09 08:33:37 -07:00 · 278f4adbd2
commit 278f4adbd2
parent d00076a7c1 0345908807
24 changed files with 389 additions and 1279 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -2,6 +2,12 @@
 ### CUDA 12.9
 * Updated toolchain for cross-compilation for Tegra Linux platforms.
 * Repository has been updated with consistent code formatting across all samples
 * Many small code tweaks and bug fixes (see commit history for details)
 * Removed the following outdated samples:
  * `1_Utilities`
    * `bandwidthTest` - this sample was out of date and did not produce accurate results. For bandwidth
    testing of NVIDIA GPU platforms, please refer to [NVBandwidth](https://github.com/NVIDIA/nvbandwidth)
 ### CUDA 12.8
 * Updated build system across the repository to CMake. Removed Visual Studio project files and Makefiles.
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -0,0 +1,103 @@
 # Contributing to the CUDA Samples
 Thank you for your interest in contributing to the CUDA Samples!
 ## Getting Started
 1. **Fork & Clone the Repository**:
   Fork the reporistory and clone the fork. For more information, check [GitHub's documentation on forking](https://docs.github.com/en/github/getting-started-with-github/fork-a-repo) and [cloning a repository](https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/cloning-a-repository).
 ## Making Changes
 1. **Create a New Branch**:
   ```bash
   git checkout -b your-feature-branch
   ```
 2. **Make Changes**.
 3. **Build and Test**:
   Ensure changes don't break existing functionality by building and running tests.
   For more details on building and testing, refer to the [Building and Testing](#building-and-testing) section below.
 4. **Commit Changes**:
   ```bash
   git commit -m "Brief description of the change"
   ```
 ## Building and Testing
 For information on building a running tests on the samples, please refer to the main [README](README.md)
 ## Creating a Pull Request
 1. Push changes to your fork
 2. Create a pull request targeting the `master` branch of the original CUDA Samples repository. Refer to [GitHub's documentation](https://docs.github.com/en/github/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests) for more information on creating a pull request.
 3. Describe the purpose and context of the changes in the pull request description.
 ## Code Formatting (pre-commit hooks)
 The CUDA Samples repository uses [pre-commit](https://pre-commit.com/) to execute all code linters and formatters. These
 tools ensure a consistent coding style throughout the project. Using pre-commit ensures that linter
 versions and options are aligned for all developers. Additionally, there is a CI check in place to
 enforce that committed code follows our standards.
 The linters used by the CUDA Samples are listed in `.pre-commit-config.yaml`.
 For example, C++ and CUDA code is formatted with [`clang-format`](https://clang.llvm.org/docs/ClangFormat.html).
 To use `pre-commit`, install via `conda` or `pip`:
 ```bash
 conda config --add channels conda-forge
 conda install pre-commit
 ```
 ```bash
 pip install pre-commit
 ```
 Then run pre-commit hooks before committing code:
 ```bash
 pre-commit run
 ```
 By default, pre-commit runs on staged files (only changes and additions that will be committed).
 To run pre-commit checks on all files, execute:
 ```bash
 pre-commit run --all-files
 ```
 Optionally, you may set up the pre-commit hooks to run automatically when you make a git commit. This can be done by running:
 ```bash
 pre-commit install
 ```
 Now code linters and formatters will be run each time you commit changes.
 You can skip these checks with `git commit --no-verify` or with the short version `git commit -n`, althoguh please note
 that this may result in pull requests being rejected if subsequent checks fail.
 ## Review Process
 Once submitted, maintainers will be automatically assigned to review the pull request. They might suggest changes or improvements. Constructive feedback is a part of the collaborative process, aimed at ensuring the highest quality code.
 For constructive feedback and effective communication during reviews, we recommend following [Conventional Comments](https://conventionalcomments.org/).
 Further recommended reading for successful PR reviews:
 - [How to Do Code Reviews Like a Human (Part One)](https://mtlynch.io/human-code-reviews-1/)
 - [How to Do Code Reviews Like a Human (Part Two)](https://mtlynch.io/human-code-reviews-2/)
 ## Thank You
 Your contributions enhance the CUDA Samples for the entire community. We appreciate your effort and collaboration!
--- a/README.md
+++ b/README.md
@ -149,11 +149,13 @@ This Python3 script finds all executables in a subdirectory you choose, matching
 the following command line arguments:
 | Switch     | Purpose                                                                                                        | Example                 |
-| -------- | -------------------------------------------------------------------------------------------------------------- | ----------------------- |
+| ---------- | -------------------------------------------------------------------------------------------------------------- | ----------------------- |
 | --dir      | Specify the root directory to search for executables (recursively)                                             | --dir ./build/Samples   |
 | --config   | JSON configuration file for executable arguments                                                               | --config test_args.json |
 | --output   | Output directory for test results (stdout saved to .txt files - directory will be created if it doesn't exist) | --output ./test         |
 | --args     | Global arguments to pass to all executables (not currently used)                                               | --args arg_1 arg_2 ...  |
 | --parallel | Number of applications to execute in parallel.                                                                 | --parallel 8            |
 Application configurations are loaded from `test_args.json` and matched against executable names (discarding the `.exe` extension on Windows).
@ -281,18 +283,18 @@ and system configuration):
 ```
 Test Summary:
-Ran 181 tests
+Ran 199 test runs for 180 executables.
-All tests passed!
+All test runs passed!
 ```
 If some samples fail, you will see something like this:
 ```
 Test Summary:
-Ran 181 tests
+Ran 199 test runs for 180 executables.
-Failed tests (2):
+Failed runs (2):
-  volumeFiltering: returned 1
+  bicubicTexture (run 1/5): Failed (code 1)
-  postProcessGL: returned 1
+  Mandelbrot (run 1/2): Failed (code 1)
 ```
 You can inspect the stdout logs in the output directory (generally `APM_<application_name>.txt` or `APM_<application_name>.run<n>.txt`) to help
--- a/Samples/0_Introduction/simpleIPC/simpleIPC.cu
+++ b/Samples/0_Introduction/simpleIPC/simpleIPC.cu
@ -99,8 +99,21 @@ static void childProcess(int id)
    std::vector<void *>      ptrs;
    std::vector<cudaEvent_t> events;
    std::vector<char>        verification_buffer(DATA_SIZE);
    char                     pidString[20] = {0};
    char                     lshmName[40]  = {0};
-    if (sharedMemoryOpen(shmName, sizeof(shmStruct), &info) != 0) {
+    // Use parent process ID to create a unique shared memory name for Linux multi-process
 #ifdef __linux__
    pid_t pid;
    pid = getppid();
    snprintf(pidString, sizeof(pidString), "%d", pid);
 #endif
    strcat(lshmName, shmName);
    strcat(lshmName, pidString);
    printf("CP: lshmName = %s\n", lshmName);
    if (sharedMemoryOpen(lshmName, sizeof(shmStruct), &info) != 0) {
        printf("Failed to create shared memory slab\n");
        exit(EXIT_FAILURE);
    }
@ -195,10 +208,23 @@ static void parentProcess(char *app)
    std::vector<void *>      ptrs;
    std::vector<cudaEvent_t> events;
    std::vector<Process>     processes;
    char                     pidString[20] = {0};
    char                     lshmName[40]  = {0};
    // Use current process ID to create a unique shared memory name for Linux multi-process
 #ifdef __linux__
    pid_t pid;
    pid = getpid();
    snprintf(pidString, sizeof(pidString), "%d", pid);
 #endif
    strcat(lshmName, shmName);
    strcat(lshmName, pidString);
    printf("PP: lshmName = %s\n", lshmName);
    checkCudaErrors(cudaGetDeviceCount(&devCount));
-    if (sharedMemoryCreate(shmName, sizeof(*shm), &info) != 0) {
+    if (sharedMemoryCreate(lshmName, sizeof(*shm), &info) != 0) {
        printf("Failed to create shared memory slab\n");
        exit(EXIT_FAILURE);
    }
--- a/Samples/1_Utilities/CMakeLists.txt
+++ b/Samples/1_Utilities/CMakeLists.txt
@ -1,4 +1,3 @@
 add_subdirectory(bandwidthTest)
 add_subdirectory(deviceQuery)
 add_subdirectory(deviceQueryDrv)
 add_subdirectory(topologyQuery)
--- a/Samples/1_Utilities/bandwidthTest/.vscode/c_cpp_properties.json
+++ b/Samples/1_Utilities/bandwidthTest/.vscode/c_cpp_properties.json
@ -1,18 +0,0 @@
 {
    "configurations": [
        {
            "name": "Linux",
            "includePath": [
                "${workspaceFolder}/**",
                "${workspaceFolder}/../../../Common"
            ],
            "defines": [],
            "compilerPath": "/usr/local/cuda/bin/nvcc",
            "cStandard": "gnu17",
            "cppStandard": "gnu++14",
            "intelliSenseMode": "linux-gcc-x64",
            "configurationProvider": "ms-vscode.makefile-tools"
        }
    ],
    "version": 4
 }
--- a/Samples/1_Utilities/bandwidthTest/.vscode/extensions.json
+++ b/Samples/1_Utilities/bandwidthTest/.vscode/extensions.json
@ -1,7 +0,0 @@
 {
    "recommendations": [
        "nvidia.nsight-vscode-edition",
        "ms-vscode.cpptools",
        "ms-vscode.makefile-tools"
    ]
 }
--- a/Samples/1_Utilities/bandwidthTest/.vscode/launch.json
+++ b/Samples/1_Utilities/bandwidthTest/.vscode/launch.json
@ -1,10 +0,0 @@
 {
    "configurations": [
        {
            "name": "CUDA C++: Launch",
            "type": "cuda-gdb",
            "request": "launch",
            "program": "${workspaceFolder}/bandwidthTest"
        }
    ]
 }
--- a/Samples/1_Utilities/bandwidthTest/.vscode/tasks.json
+++ b/Samples/1_Utilities/bandwidthTest/.vscode/tasks.json
@ -1,15 +0,0 @@
 {
    "version": "2.0.0",
    "tasks": [
        {
            "label": "sample",
            "type": "shell",
            "command": "make dbg=1",
            "problemMatcher": ["$nvcc"],
            "group": {
                "kind": "build",
                "isDefault": true
            }
        }
    ]
 }
--- a/Samples/1_Utilities/bandwidthTest/CMakeLists.txt
+++ b/Samples/1_Utilities/bandwidthTest/CMakeLists.txt
@ -1,30 +0,0 @@
 cmake_minimum_required(VERSION 3.20)
 list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/Modules")
 project(bandwidthTest LANGUAGES C CXX CUDA)
 find_package(CUDAToolkit REQUIRED)
 set(CMAKE_POSITION_INDEPENDENT_CODE ON)
 set(CMAKE_CUDA_ARCHITECTURES 75 80 86 87 89 90 100 101 120)
 set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Wno-deprecated-gpu-targets")
 if(ENABLE_CUDA_DEBUG)
    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G")        # enable cuda-gdb (may significantly affect performance on some targets)
 else()
    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -lineinfo") # add line information to all builds for debug tools (exclusive to -G option)
 endif()
 # Include directories and libraries
 include_directories(../../../Common)
 # Source file
 # Add target for bandwidthTest
 add_executable(bandwidthTest bandwidthTest.cu)
 target_compile_options(bandwidthTest PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:--extended-lambda>)
 target_compile_features(bandwidthTest PRIVATE cxx_std_17 cuda_std_17)
 set_target_properties(bandwidthTest PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
--- a/Samples/1_Utilities/bandwidthTest/README.md
+++ b/Samples/1_Utilities/bandwidthTest/README.md
@ -1,32 +0,0 @@
 # bandwidthTest - Bandwidth Test
 ## Description
 This is a simple test program to measure the memcopy bandwidth of the GPU and memcpy bandwidth across PCI-e. This test application is capable of measuring device to device copy bandwidth, host to device copy bandwidth for pageable and page-locked memory, and device to host copy bandwidth for pageable and page-locked memory.
 ## Key Concepts
 CUDA Streams and Events, Performance Strategies
 ## Supported SM Architectures
 [SM 5.0 ](https://developer.nvidia.com/cuda-gpus)  [SM 5.2 ](https://developer.nvidia.com/cuda-gpus)  [SM 5.3 ](https://developer.nvidia.com/cuda-gpus)  [SM 6.0 ](https://developer.nvidia.com/cuda-gpus)  [SM 6.1 ](https://developer.nvidia.com/cuda-gpus)  [SM 7.0 ](https://developer.nvidia.com/cuda-gpus)  [SM 7.2 ](https://developer.nvidia.com/cuda-gpus)  [SM 7.5 ](https://developer.nvidia.com/cuda-gpus)  [SM 8.0 ](https://developer.nvidia.com/cuda-gpus)  [SM 8.6 ](https://developer.nvidia.com/cuda-gpus)  [SM 8.7 ](https://developer.nvidia.com/cuda-gpus)  [SM 8.9 ](https://developer.nvidia.com/cuda-gpus)  [SM 9.0 ](https://developer.nvidia.com/cuda-gpus)
 ## Supported OSes
 Linux, Windows
 ## Supported CPU Architecture
 x86_64, armv7l
 ## CUDA APIs involved
 ### [CUDA Runtime API](http://docs.nvidia.com/cuda/cuda-runtime-api/index.html)
 cudaHostAlloc, cudaMemcpy, cudaMalloc, cudaMemcpyAsync, cudaFree, cudaGetErrorString, cudaMallocHost, cudaSetDevice, cudaGetDeviceProperties, cudaDeviceSynchronize, cudaEventRecord, cudaFreeHost, cudaEventDestroy, cudaEventElapsedTime, cudaGetDeviceCount, cudaEventCreate
 ## Prerequisites
 Download and install the [CUDA Toolkit 12.5](https://developer.nvidia.com/cuda-downloads) for your corresponding platform.
 ## References (for more details)
--- a/Samples/1_Utilities/bandwidthTest/bandwidthTest.cu
+++ b/Samples/1_Utilities/bandwidthTest/bandwidthTest.cu
--- a/Samples/2_Concepts_and_Techniques/streamOrderedAllocationIPC/streamOrderedAllocationIPC.cu
+++ b/Samples/2_Concepts_and_Techniques/streamOrderedAllocationIPC/streamOrderedAllocationIPC.cu
@ -102,13 +102,23 @@ static void childProcess(int id)
    int                 threads = 128;
    cudaDeviceProp      prop;
    std::vector<void *> ptrs;
    pid_t               pid;
    char                pidString[20] = {0};
    char                lshmName[40]  = {0};
    std::vector<char> verification_buffer(DATA_SIZE);
    pid = getppid();
    snprintf(pidString, sizeof(pidString), "%d", pid);
    strcat(lshmName, shmName);
    strcat(lshmName, pidString);
    printf("CP: lshmName = %s\n", lshmName);
    ipcHandle *ipcChildHandle = NULL;
    checkIpcErrors(ipcOpenSocket(ipcChildHandle));
-    if (sharedMemoryOpen(shmName, sizeof(shmStruct), &info) != 0) {
+    if (sharedMemoryOpen(lshmName, sizeof(shmStruct), &info) != 0) {
        printf("Failed to create shared memory slab\n");
        exit(EXIT_FAILURE);
    }
@ -245,6 +255,16 @@ static void parentProcess(char *app)
    std::vector<void *>         ptrs;
    std::vector<Process>        processes;
    cudaMemAllocationHandleType handleType = cudaMemHandleTypeNone;
    pid_t                       pid;
    char                        pidString[20] = {0};
    char                        lshmName[40]  = {0};
    pid = getpid();
    snprintf(pidString, sizeof(pidString), "%d", pid);
    strcat(lshmName, shmName);
    strcat(lshmName, pidString);
    printf("PP: lshmName = %s\n", lshmName);
    checkCudaErrors(cudaGetDeviceCount(&devCount));
    std::vector<CUdevice> devices(devCount);
@ -252,7 +272,7 @@ static void parentProcess(char *app)
        cuDeviceGet(&devices[i], i);
    }
-    if (sharedMemoryCreate(shmName, sizeof(*shm), &info) != 0) {
+    if (sharedMemoryCreate(lshmName, sizeof(*shm), &info) != 0) {
        printf("Failed to create shared memory slab\n");
        exit(EXIT_FAILURE);
    }
--- a/Samples/3_CUDA_Features/memMapIPCDrv/memMapIpc.cpp
+++ b/Samples/3_CUDA_Features/memMapIPCDrv/memMapIpc.cpp
@ -310,10 +310,24 @@ static void childProcess(int devId, int id, char **argv)
    ipcHandle          *ipcChildHandle = NULL;
    int                 blocks         = 0;
    int                 threads        = 128;
    char                pidString[20]  = {0};
    char                lshmName[40]   = {0};
    // Use parent process ID to create a unique shared memory name for Linux multi-process
 #ifdef __linux__
    pid_t pid;
    pid = getppid();
    snprintf(pidString, sizeof(pidString), "%d", pid);
 #endif
    strcat(lshmName, shmName);
    strcat(lshmName, pidString);
    printf("CP: lshmName = %s\n", lshmName);
    checkIpcErrors(ipcOpenSocket(ipcChildHandle));
-    if (sharedMemoryOpen(shmName, sizeof(shmStruct), &info) != 0) {
+    if (sharedMemoryOpen(lshmName, sizeof(shmStruct), &info) != 0) {
        printf("Failed to create shared memory slab\n");
        exit(EXIT_FAILURE);
    }
@ -421,11 +435,24 @@ static void parentProcess(char *app)
    volatile shmStruct  *shm = NULL;
    sharedMemoryInfo     info;
    std::vector<Process> processes;
    char                 pidString[20] = {0};
    char                 lshmName[40]  = {0};
    // Use current process ID to create a unique shared memory name for Linux multi-process
 #ifdef __linux__
    pid_t pid;
    pid = getpid();
    snprintf(pidString, sizeof(pidString), "%d", pid);
 #endif
    strcat(lshmName, shmName);
    strcat(lshmName, pidString);
    printf("PP: lshmName = %s\n", lshmName);
    checkCudaErrors(cuDeviceGetCount(&devCount));
    std::vector<CUdevice> devices(devCount);
-    if (sharedMemoryCreate(shmName, sizeof(*shm), &info) != 0) {
+    if (sharedMemoryCreate(lshmName, sizeof(*shm), &info) != 0) {
        printf("Failed to create shared memory slab\n");
        exit(EXIT_FAILURE);
    }
--- a/Samples/4_CUDA_Libraries/cudaNvSci/main.cpp
+++ b/Samples/4_CUDA_Libraries/cudaNvSci/main.cpp
@ -25,8 +25,8 @@
 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */
 #include <cuda_runtime.h>
 #include <cuda.h>
 #include <cuda_runtime.h>
 #include <helper_cuda.h>
 #include <helper_image.h>
 #include <vector>
--- a/Samples/4_CUDA_Libraries/oceanFFT/oceanFFT.cpp
+++ b/Samples/4_CUDA_Libraries/oceanFFT/oceanFFT.cpp
@ -45,13 +45,15 @@
 #include <windows.h>
 #endif
 // includes for OpenGL
 #include <helper_gl.h>
 // includes
 #include <cuda_gl_interop.h>
 #include <cuda_runtime.h>
 #include <cufft.h>
 #include <helper_cuda.h>
 #include <helper_functions.h>
 #include <helper_gl.h>
 #include <math.h>
 #include <math_constants.h>
 #include <stdio.h>
--- a/Samples/5_Domain_Specific/marchingCubes/marchingCubes.cpp
+++ b/Samples/5_Domain_Specific/marchingCubes/marchingCubes.cpp
@ -86,12 +86,14 @@
 #include <windows.h>
 #endif
 // includes for OpenGL
 #include <helper_gl.h>
 // includes
 #include <cuda_gl_interop.h>
 #include <cuda_runtime.h>
 #include <helper_cuda.h> // includes cuda.h and cuda_runtime_api.h
 #include <helper_functions.h>
 #include <helper_gl.h>
 #include <math.h>
 #include <stdio.h>
 #include <stdlib.h>
--- a/Samples/5_Domain_Specific/nbody/render_particles.cpp
+++ b/Samples/5_Domain_Specific/nbody/render_particles.cpp
@ -28,11 +28,15 @@
 #include "render_particles.h"
 #define HELPERGL_EXTERN_GL_FUNC_IMPLEMENTATION
 // includes for OpenGL
 #include <helper_gl.h>
 // includes
 #include <assert.h>
 #include <cuda_gl_interop.h>
 #include <cuda_runtime.h>
 #include <helper_cuda.h>
 #include <helper_gl.h>
 #include <math.h>
 #define GL_POINT_SPRITE_ARB             0x8861
--- a/Samples/5_Domain_Specific/simpleD3D11/simpleD3D11.cpp
+++ b/Samples/5_Domain_Specific/simpleD3D11/simpleD3D11.cpp
@ -31,9 +31,12 @@
 #pragma warning(disable : 4312)
-#include <mmsystem.h>
+// includes for Windows
 #include <windows.h>
 // includes for multimedia
 #include <mmsystem.h>
 // This header inclues all the necessary D3D11 and CUDA includes
 #include <cuda_d3d11_interop.h>
 #include <cuda_runtime_api.h>
--- a/Samples/5_Domain_Specific/simpleD3D11Texture/simpleD3D11Texture.cpp
+++ b/Samples/5_Domain_Specific/simpleD3D11Texture/simpleD3D11Texture.cpp
@ -31,9 +31,12 @@
 #pragma warning(disable : 4312)
-#include <mmsystem.h>
+// includes for Windows
 #include <windows.h>
 // includes for multimedia
 #include <mmsystem.h>
 // This header inclues all the necessary D3D11 and CUDA includes
 #include <cuda_d3d11_interop.h>
 #include <cuda_runtime_api.h>
--- a/Samples/5_Domain_Specific/smokeParticles/ParticleSystem.cpp
+++ b/Samples/5_Domain_Specific/smokeParticles/ParticleSystem.cpp
@ -33,11 +33,15 @@
 #include <memory.h>
 #define HELPERGL_EXTERN_GL_FUNC_IMPLEMENTATION
 // includes for OpenGL
 #include <helper_gl.h>
 // includes
 #include <cuda_gl_interop.h>
 #include <cuda_runtime.h>
 #include <helper_cuda.h>
 #include <helper_functions.h>
 #include <helper_gl.h>
 #include "ParticleSystem.cuh"
 #include "ParticleSystem.h"
--- a/Samples/5_Domain_Specific/smokeParticles/ParticleSystem_cuda.cu
+++ b/Samples/5_Domain_Specific/smokeParticles/ParticleSystem_cuda.cu
@ -29,11 +29,15 @@
 This file contains simple wrapper functions that call the CUDA kernels
 */
 #define HELPERGL_EXTERN_GL_FUNC_IMPLEMENTATION
 // includes for OpenGL
 #include <helper_gl.h>
 // includes
 #include <cstdio>
 #include <cstdlib>
 #include <cuda_gl_interop.h>
 #include <helper_cuda.h>
 #include <helper_gl.h>
 #include <string.h>
 #include "ParticleSystem.cuh"
--- a/Samples/CMakeLists.txt
+++ b/Samples/CMakeLists.txt
@ -1,11 +1,33 @@
 # This layer of CMakeLists.txt adds folders, for better organization in Visual Studio
 # and other IDEs that support this feature.
 set_property(GLOBAL PROPERTY USE_FOLDERS ON)
 set(CMAKE_FOLDER "0_Introduction")
 add_subdirectory(0_Introduction)
 set(CMAKE_FOLDER "1_Utilities")
 add_subdirectory(1_Utilities)
 set(CMAKE_FOLDER "2_Concepts_and_Techniques")
 add_subdirectory(2_Concepts_and_Techniques)
 set(CMAKE_FOLDER "3_CUDA_Features")
 add_subdirectory(3_CUDA_Features)
 set(CMAKE_FOLDER "4_CUDA_Libraries")
 add_subdirectory(4_CUDA_Libraries)
 set(CMAKE_FOLDER "5_Domain_Specific")
 add_subdirectory(5_Domain_Specific)
 set(CMAKE_FOLDER "6_Performance")
 add_subdirectory(6_Performance)
 set(CMAKE_FOLDER "7_libNVVM")
 add_subdirectory(7_libNVVM)
 if(BUILD_TEGRA)
    set(CMAKE_FOLDER "8_Platform_Specific/Tegra")
    add_subdirectory(8_Platform_Specific/Tegra)
 endif()
--- a/run_tests.py
+++ b/run_tests.py
@ -33,6 +33,15 @@ import json
 import subprocess
 import argparse
 from pathlib import Path
 import concurrent.futures
 import threading
 print_lock = threading.Lock()
 def safe_print(*args, **kwargs):
    """Thread-safe print function"""
    with print_lock:
        print(*args, **kwargs)
 def normalize_exe_name(name):
    """Normalize executable name across platforms by removing .exe if present"""
@ -78,96 +87,49 @@ def find_executables(root_dir):
    return executables
-def run_test(executable, output_dir, args_config, global_args=None):
+def run_single_test_instance(executable, args, output_file, global_args, run_description):
-    """Run a single test and capture output"""
+    """Run a single instance of a test executable with specific arguments."""
    exe_path = str(executable)
    exe_name = executable.name
    base_name = normalize_exe_name(exe_name)
-    # Check if this executable should be skipped
+    safe_print(f"Starting {exe_name} {run_description}")
    if base_name in args_config and args_config[base_name].get("skip", False):
        print(f"Skipping {exe_name} (marked as skip in config)")
        return 0
    # Get argument sets for this executable
    arg_sets = []
    if base_name in args_config:
        config = args_config[base_name]
        if "args" in config:
            # Single argument set (backwards compatibility)
            if isinstance(config["args"], list):
                arg_sets.append(config["args"])
            else:
                print(f"Warning: Arguments for {base_name} must be a list")
        elif "runs" in config:
            # Multiple argument sets
            for run in config["runs"]:
                if isinstance(run.get("args", []), list):
                    arg_sets.append(run.get("args", []))
                else:
                    print(f"Warning: Arguments for {base_name} run must be a list")
    # If no specific args defined, run once with no args
    if not arg_sets:
        arg_sets.append([])
    # Run for each argument set
    failed = False
    run_number = 1
    for args in arg_sets:
        # Create output file name with run number if multiple runs
        if len(arg_sets) > 1:
            output_file = os.path.abspath(f"{output_dir}/APM_{exe_name}.run{run_number}.txt")
            print(f"Running {exe_name} (run {run_number}/{len(arg_sets)})")
        else:
            output_file = os.path.abspath(f"{output_dir}/APM_{exe_name}.txt")
            print(f"Running {exe_name}")
    try:
            # Prepare command with arguments
        cmd = [f"./{exe_name}"]
        cmd.extend(args)
            # Add global arguments if provided
        if global_args:
            cmd.extend(global_args)
-            print(f"    Command: {' '.join(cmd)}")
+        safe_print(f"    Command ({exe_name} {run_description}): {' '.join(cmd)}")
-            # Store current directory
+        # Run the executable in its own directory using cwd
            original_dir = os.getcwd()
            try:
                # Change to executable's directory
                os.chdir(os.path.dirname(exe_path))
                # Run the executable and capture output
        with open(output_file, 'w') as f:
            result = subprocess.run(
                cmd,
                stdout=f,
                stderr=subprocess.STDOUT,
-                        timeout=300  # 5 minute timeout
+                timeout=300,  # 5 minute timeout
                cwd=os.path.dirname(exe_path) # Execute in the executable's directory
            )
-                if result.returncode != 0:
+        status = "Passed" if result.returncode == 0 else "Failed"
-                    failed = True
+        safe_print(f"    Finished {exe_name} {run_description}: {status} (code {result.returncode})")
-                print(f"    Test completed with return code {result.returncode}")
+        return {"name": exe_name, "description": run_description, "return_code": result.returncode, "status": status}
            finally:
                # Always restore original directory
                os.chdir(original_dir)
    except subprocess.TimeoutExpired:
-            print(f"Error: {exe_name} timed out after 5 minutes")
+        safe_print(f"Error ({exe_name} {run_description}): Timed out after 5 minutes")
-            failed = True
+        return {"name": exe_name, "description": run_description, "return_code": -1, "status": "Timeout"}
    except Exception as e:
-            print(f"Error running {exe_name}: {str(e)}")
+        safe_print(f"Error running {exe_name} {run_description}: {str(e)}")
-            failed = True
+        return {"name": exe_name, "description": run_description, "return_code": -1, "status": f"Error: {str(e)}"}
-        run_number += 1
+def run_test(executable, output_dir, args_config, global_args=None):
-
+    """Deprecated: This function is replaced by the parallel execution logic in main."""
-    return 1 if failed else 0
+    # This function is no longer called directly by the main logic.
    # It remains here temporarily in case it's needed for reference or single-threaded debugging.
    # The core logic is now in run_single_test_instance and managed by ThreadPoolExecutor.
    print("Warning: run_test function called directly - this indicates an issue in the refactoring.")
    return 1 # Indicate failure if called
 def main():
    parser = argparse.ArgumentParser(description='Run all executables and capture output')
@ -175,6 +137,7 @@ def main():
    parser.add_argument('--config', help='JSON configuration file for executable arguments')
    parser.add_argument('--output', default='.',  # Default to current directory
                       help='Output directory for test results')
    parser.add_argument('--parallel', type=int, default=1, help='Number of parallel tests to run')
    parser.add_argument('--args', nargs=argparse.REMAINDER,
                       help='Global arguments to pass to all executables')
    args = parser.parse_args()
@ -192,23 +155,104 @@ def main():
        return 1
    print(f"Found {len(executables)} executables")
    print(f"Running tests with up to {args.parallel} parallel tasks.")
    tasks = []
    for exe in executables:
        exe_name = exe.name
        base_name = normalize_exe_name(exe_name)
        # Check if this executable should be skipped globally
        if base_name in args_config and args_config[base_name].get("skip", False):
            safe_print(f"Skipping {exe_name} (marked as skip in config)")
            continue
        arg_sets_configs = []
        if base_name in args_config:
            config = args_config[base_name]
            if "args" in config:
                if isinstance(config["args"], list):
                    arg_sets_configs.append({"args": config["args"]}) # Wrap in dict for consistency
                else:
                    safe_print(f"Warning: Arguments for {base_name} must be a list")
            elif "runs" in config:
                for i, run_config in enumerate(config["runs"]):
                    if run_config.get("skip", False):
                         safe_print(f"Skipping run {i+1} for {exe_name} (marked as skip in config)")
                         continue
                    if isinstance(run_config.get("args", []), list):
                        arg_sets_configs.append(run_config)
                    else:
                        safe_print(f"Warning: Arguments for {base_name} run {i+1} must be a list")
        # If no specific args defined, create one run with no args
        if not arg_sets_configs:
            arg_sets_configs.append({"args": []})
        # Create tasks for each run configuration
        num_runs = len(arg_sets_configs)
        for i, run_config in enumerate(arg_sets_configs):
            current_args = run_config.get("args", [])
            run_desc = f"(run {i+1}/{num_runs})" if num_runs > 1 else ""
            # Create output file name
            if num_runs > 1:
                output_file = os.path.abspath(f"{args.output}/APM_{exe_name}.run{i+1}.txt")
            else:
                output_file = os.path.abspath(f"{args.output}/APM_{exe_name}.txt")
            tasks.append({
                "executable": exe,
                "args": current_args,
                "output_file": output_file,
                "global_args": args.args,
                "description": run_desc
            })
    failed = []
-    for exe in executables:
+    total_runs = len(tasks)
-        ret_code = run_test(exe, args.output, args_config, args.args)
+    completed_runs = 0
-        if ret_code != 0:
+
-            failed.append((exe.name, ret_code))
+    with concurrent.futures.ThreadPoolExecutor(max_workers=args.parallel) as executor:
        future_to_task = {
            executor.submit(run_single_test_instance,
                            task["executable"],
                            task["args"],
                            task["output_file"],
                            task["global_args"],
                            task["description"]): task
            for task in tasks
        }
        for future in concurrent.futures.as_completed(future_to_task):
            task_info = future_to_task[future]
            completed_runs += 1
            safe_print(f"Progress: {completed_runs}/{total_runs} runs completed.")
            try:
                result = future.result()
                if result["return_code"] != 0:
                    failed.append(result)
            except Exception as exc:
                safe_print(f'Task {task_info["executable"].name} {task_info["description"]} generated an exception: {exc}')
                failed.append({
                    "name": task_info["executable"].name,
                    "description": task_info["description"],
                    "return_code": -1,
                    "status": f"Execution Exception: {exc}"
                })
    # Print summary
    print("\nTest Summary:")
-    print(f"Ran {len(executables)} tests")
+    print(f"Ran {total_runs} test runs for {len(executables)} executables.")
    if failed:
-        print(f"Failed tests ({len(failed)}):")
+        print(f"Failed runs ({len(failed)}):")
-        for name, code in failed:
+        for fail in failed:
-            print(f"  {name}: returned {code}")
+            print(f"  {fail['name']} {fail['description']}: {fail['status']} (code {fail['return_code']})")
-        return failed[0][1]  # Return first failure code
+        # Return the return code of the first failure, or 1 if only exceptions occurred
        first_failure_code = next((f["return_code"] for f in failed if f["return_code"] != -1), 1)
        return first_failure_code
    else:
-        print("All tests passed!")
+        print("All test runs passed!")
        return 0
 if __name__ == '__main__':