Merge branch 'master' into cuda_a_dev

2025-12-16 02:27:49 +08:00 · 2025-04-09 08:33:37 -07:00 · 2025-04-09 08:33:37 -07:00 · 278f4adbd2
commit 278f4adbd2
parent d00076a7c1 0345908807
24 changed files with 389 additions and 1279 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -2,6 +2,12 @@

 ### CUDA 12.9
 * Updated toolchain for cross-compilation for Tegra Linux platforms.
+* Repository has been updated with consistent code formatting across all samples
+* Many small code tweaks and bug fixes (see commit history for details)
+* Removed the following outdated samples:
+  * `1_Utilities`
+    * `bandwidthTest` - this sample was out of date and did not produce accurate results. For bandwidth
+    testing of NVIDIA GPU platforms, please refer to [NVBandwidth](https://github.com/NVIDIA/nvbandwidth)

 ### CUDA 12.8
 * Updated build system across the repository to CMake. Removed Visual Studio project files and Makefiles.
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -0,0 +1,103 @@
+
+# Contributing to the CUDA Samples
+
+Thank you for your interest in contributing to the CUDA Samples!
+
+
+## Getting Started
+
+1. **Fork & Clone the Repository**:
+
+   Fork the reporistory and clone the fork. For more information, check [GitHub's documentation on forking](https://docs.github.com/en/github/getting-started-with-github/fork-a-repo) and [cloning a repository](https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/cloning-a-repository).
+
+## Making Changes
+
+1. **Create a New Branch**:
+
+   ```bash
+   git checkout -b your-feature-branch
+   ```
+
+2. **Make Changes**.
+
+3. **Build and Test**:
+
+   Ensure changes don't break existing functionality by building and running tests.
+
+   For more details on building and testing, refer to the [Building and Testing](#building-and-testing) section below.
+
+4. **Commit Changes**:
+
+   ```bash
+   git commit -m "Brief description of the change"
+   ```
+
+## Building and Testing
+
+For information on building a running tests on the samples, please refer to the main [README](README.md)
+
+## Creating a Pull Request
+
+1. Push changes to your fork
+2. Create a pull request targeting the `master` branch of the original CUDA Samples repository. Refer to [GitHub's documentation](https://docs.github.com/en/github/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests) for more information on creating a pull request.
+3. Describe the purpose and context of the changes in the pull request description.
+
+## Code Formatting (pre-commit hooks)
+
+The CUDA Samples repository uses [pre-commit](https://pre-commit.com/) to execute all code linters and formatters. These
+tools ensure a consistent coding style throughout the project. Using pre-commit ensures that linter
+versions and options are aligned for all developers. Additionally, there is a CI check in place to
+enforce that committed code follows our standards.
+
+The linters used by the CUDA Samples are listed in `.pre-commit-config.yaml`.
+For example, C++ and CUDA code is formatted with [`clang-format`](https://clang.llvm.org/docs/ClangFormat.html).
+
+To use `pre-commit`, install via `conda` or `pip`:
+
+```bash
+conda config --add channels conda-forge
+conda install pre-commit
+```
+
+```bash
+pip install pre-commit
+```
+
+Then run pre-commit hooks before committing code:
+
+```bash
+pre-commit run
+```
+
+By default, pre-commit runs on staged files (only changes and additions that will be committed).
+To run pre-commit checks on all files, execute:
+
+```bash
+pre-commit run --all-files
+```
+
+Optionally, you may set up the pre-commit hooks to run automatically when you make a git commit. This can be done by running:
+
+```bash
+pre-commit install
+```
+
+Now code linters and formatters will be run each time you commit changes.
+
+You can skip these checks with `git commit --no-verify` or with the short version `git commit -n`, althoguh please note
+that this may result in pull requests being rejected if subsequent checks fail.
+
+## Review Process
+
+Once submitted, maintainers will be automatically assigned to review the pull request. They might suggest changes or improvements. Constructive feedback is a part of the collaborative process, aimed at ensuring the highest quality code.
+
+For constructive feedback and effective communication during reviews, we recommend following [Conventional Comments](https://conventionalcomments.org/).
+
+Further recommended reading for successful PR reviews:
+
+- [How to Do Code Reviews Like a Human (Part One)](https://mtlynch.io/human-code-reviews-1/)
+- [How to Do Code Reviews Like a Human (Part Two)](https://mtlynch.io/human-code-reviews-2/)
+
+## Thank You
+
+Your contributions enhance the CUDA Samples for the entire community. We appreciate your effort and collaboration!
--- a/README.md
+++ b/README.md
@ -149,11 +149,13 @@ This Python3 script finds all executables in a subdirectory you choose, matching
 the following command line arguments:

 | Switch     | Purpose                                                                                                        | Example                 |
-| -------- | -------------------------------------------------------------------------------------------------------------- | ----------------------- |
+| ---------- | -------------------------------------------------------------------------------------------------------------- | ----------------------- |
 | --dir      | Specify the root directory to search for executables (recursively)                                             | --dir ./build/Samples   |
 | --config   | JSON configuration file for executable arguments                                                               | --config test_args.json |
 | --output   | Output directory for test results (stdout saved to .txt files - directory will be created if it doesn't exist) | --output ./test         |
 | --args     | Global arguments to pass to all executables (not currently used)                                               | --args arg_1 arg_2 ...  |
+| --parallel | Number of applications to execute in parallel.                                                                 | --parallel 8            |
+

 Application configurations are loaded from `test_args.json` and matched against executable names (discarding the `.exe` extension on Windows).

@ -281,18 +283,18 @@ and system configuration):

 ```
 Test Summary:
-Ran 181 tests
-All tests passed!
+Ran 199 test runs for 180 executables.
+All test runs passed!
 ```

 If some samples fail, you will see something like this:

 ```
 Test Summary:
-Ran 181 tests
-Failed tests (2):
-  volumeFiltering: returned 1
-  postProcessGL: returned 1
+Ran 199 test runs for 180 executables.
+Failed runs (2):
+  bicubicTexture (run 1/5): Failed (code 1)
+  Mandelbrot (run 1/2): Failed (code 1)
 ```

 You can inspect the stdout logs in the output directory (generally `APM_<application_name>.txt` or `APM_<application_name>.run<n>.txt`) to help
--- a/Samples/0_Introduction/simpleIPC/simpleIPC.cu
+++ b/Samples/0_Introduction/simpleIPC/simpleIPC.cu
@ -99,8 +99,21 @@ static void childProcess(int id)
    std::vector<void *>      ptrs;
    std::vector<cudaEvent_t> events;
    std::vector<char>        verification_buffer(DATA_SIZE);
+    char                     pidString[20] = {0};
+    char                     lshmName[40]  = {0};

-    if (sharedMemoryOpen(shmName, sizeof(shmStruct), &info) != 0) {
+    // Use parent process ID to create a unique shared memory name for Linux multi-process
+#ifdef __linux__
+    pid_t pid;
+    pid = getppid();
+    snprintf(pidString, sizeof(pidString), "%d", pid);
+#endif
+    strcat(lshmName, shmName);
+    strcat(lshmName, pidString);
+
+    printf("CP: lshmName = %s\n", lshmName);
+
+    if (sharedMemoryOpen(lshmName, sizeof(shmStruct), &info) != 0) {
        printf("Failed to create shared memory slab\n");
        exit(EXIT_FAILURE);
    }
@ -195,10 +208,23 @@ static void parentProcess(char *app)
    std::vector<void *>      ptrs;
    std::vector<cudaEvent_t> events;
    std::vector<Process>     processes;
+    char                     pidString[20] = {0};
+    char                     lshmName[40]  = {0};
+
+    // Use current process ID to create a unique shared memory name for Linux multi-process
+#ifdef __linux__
+    pid_t pid;
+    pid = getpid();
+    snprintf(pidString, sizeof(pidString), "%d", pid);
+#endif
+    strcat(lshmName, shmName);
+    strcat(lshmName, pidString);
+
+    printf("PP: lshmName = %s\n", lshmName);

    checkCudaErrors(cudaGetDeviceCount(&devCount));

-    if (sharedMemoryCreate(shmName, sizeof(*shm), &info) != 0) {
+    if (sharedMemoryCreate(lshmName, sizeof(*shm), &info) != 0) {
        printf("Failed to create shared memory slab\n");
        exit(EXIT_FAILURE);
    }
--- a/Samples/1_Utilities/CMakeLists.txt
+++ b/Samples/1_Utilities/CMakeLists.txt
@ -1,4 +1,3 @@
-add_subdirectory(bandwidthTest)
 add_subdirectory(deviceQuery)
 add_subdirectory(deviceQueryDrv)
 add_subdirectory(topologyQuery)
--- a/Samples/1_Utilities/bandwidthTest/.vscode/c_cpp_properties.json
+++ b/Samples/1_Utilities/bandwidthTest/.vscode/c_cpp_properties.json
@ -1,18 +0,0 @@
-{
-    "configurations": [
-        {
-            "name": "Linux",
-            "includePath": [
-                "${workspaceFolder}/**",
-                "${workspaceFolder}/../../../Common"
-            ],
-            "defines": [],
-            "compilerPath": "/usr/local/cuda/bin/nvcc",
-            "cStandard": "gnu17",
-            "cppStandard": "gnu++14",
-            "intelliSenseMode": "linux-gcc-x64",
-            "configurationProvider": "ms-vscode.makefile-tools"
-        }
-    ],
-    "version": 4
-}
--- a/Samples/1_Utilities/bandwidthTest/.vscode/extensions.json
+++ b/Samples/1_Utilities/bandwidthTest/.vscode/extensions.json
@ -1,7 +0,0 @@
-{
-    "recommendations": [
-        "nvidia.nsight-vscode-edition",
-        "ms-vscode.cpptools",
-        "ms-vscode.makefile-tools"
-    ]
-}
--- a/Samples/1_Utilities/bandwidthTest/.vscode/launch.json
+++ b/Samples/1_Utilities/bandwidthTest/.vscode/launch.json
@ -1,10 +0,0 @@
-{
-    "configurations": [
-        {
-            "name": "CUDA C++: Launch",
-            "type": "cuda-gdb",
-            "request": "launch",
-            "program": "${workspaceFolder}/bandwidthTest"
-        }
-    ]
-}
--- a/Samples/1_Utilities/bandwidthTest/.vscode/tasks.json
+++ b/Samples/1_Utilities/bandwidthTest/.vscode/tasks.json
@ -1,15 +0,0 @@
-{
-    "version": "2.0.0",
-    "tasks": [
-        {
-            "label": "sample",
-            "type": "shell",
-            "command": "make dbg=1",
-            "problemMatcher": ["$nvcc"],
-            "group": {
-                "kind": "build",
-                "isDefault": true
-            }
-        }
-    ]
-}
--- a/Samples/1_Utilities/bandwidthTest/CMakeLists.txt
+++ b/Samples/1_Utilities/bandwidthTest/CMakeLists.txt
@ -1,30 +0,0 @@
-cmake_minimum_required(VERSION 3.20)
-
-list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/../../../cmake/Modules")
-
-project(bandwidthTest LANGUAGES C CXX CUDA)
-
-find_package(CUDAToolkit REQUIRED)
-
-set(CMAKE_POSITION_INDEPENDENT_CODE ON)
-
-set(CMAKE_CUDA_ARCHITECTURES 75 80 86 87 89 90 100 101 120)
-set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Wno-deprecated-gpu-targets")
-if(ENABLE_CUDA_DEBUG)
-    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G")        # enable cuda-gdb (may significantly affect performance on some targets)
-else()
-    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -lineinfo") # add line information to all builds for debug tools (exclusive to -G option)
-endif()
-
-# Include directories and libraries
-include_directories(../../../Common)
-
-# Source file
-# Add target for bandwidthTest
-add_executable(bandwidthTest bandwidthTest.cu)
-
-target_compile_options(bandwidthTest PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:--extended-lambda>)
-
-target_compile_features(bandwidthTest PRIVATE cxx_std_17 cuda_std_17)
-
-set_target_properties(bandwidthTest PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
--- a/Samples/1_Utilities/bandwidthTest/README.md
+++ b/Samples/1_Utilities/bandwidthTest/README.md
@ -1,32 +0,0 @@
-# bandwidthTest - Bandwidth Test
-
-## Description
-
-This is a simple test program to measure the memcopy bandwidth of the GPU and memcpy bandwidth across PCI-e. This test application is capable of measuring device to device copy bandwidth, host to device copy bandwidth for pageable and page-locked memory, and device to host copy bandwidth for pageable and page-locked memory.
-
-## Key Concepts
-
-CUDA Streams and Events, Performance Strategies
-
-## Supported SM Architectures
-
-[SM 5.0 ](https://developer.nvidia.com/cuda-gpus)  [SM 5.2 ](https://developer.nvidia.com/cuda-gpus)  [SM 5.3 ](https://developer.nvidia.com/cuda-gpus)  [SM 6.0 ](https://developer.nvidia.com/cuda-gpus)  [SM 6.1 ](https://developer.nvidia.com/cuda-gpus)  [SM 7.0 ](https://developer.nvidia.com/cuda-gpus)  [SM 7.2 ](https://developer.nvidia.com/cuda-gpus)  [SM 7.5 ](https://developer.nvidia.com/cuda-gpus)  [SM 8.0 ](https://developer.nvidia.com/cuda-gpus)  [SM 8.6 ](https://developer.nvidia.com/cuda-gpus)  [SM 8.7 ](https://developer.nvidia.com/cuda-gpus)  [SM 8.9 ](https://developer.nvidia.com/cuda-gpus)  [SM 9.0 ](https://developer.nvidia.com/cuda-gpus)
-
-## Supported OSes
-
-Linux, Windows
-
-## Supported CPU Architecture
-
-x86_64, armv7l
-
-## CUDA APIs involved
-
-### [CUDA Runtime API](http://docs.nvidia.com/cuda/cuda-runtime-api/index.html)
-cudaHostAlloc, cudaMemcpy, cudaMalloc, cudaMemcpyAsync, cudaFree, cudaGetErrorString, cudaMallocHost, cudaSetDevice, cudaGetDeviceProperties, cudaDeviceSynchronize, cudaEventRecord, cudaFreeHost, cudaEventDestroy, cudaEventElapsedTime, cudaGetDeviceCount, cudaEventCreate
-
-## Prerequisites
-
-Download and install the [CUDA Toolkit 12.5](https://developer.nvidia.com/cuda-downloads) for your corresponding platform.
-
-## References (for more details)
--- a/Samples/1_Utilities/bandwidthTest/bandwidthTest.cu
+++ b/Samples/1_Utilities/bandwidthTest/bandwidthTest.cu
--- a/Samples/2_Concepts_and_Techniques/streamOrderedAllocationIPC/streamOrderedAllocationIPC.cu
+++ b/Samples/2_Concepts_and_Techniques/streamOrderedAllocationIPC/streamOrderedAllocationIPC.cu
@ -102,13 +102,23 @@ static void childProcess(int id)
    int                 threads = 128;
    cudaDeviceProp      prop;
    std::vector<void *> ptrs;
+    pid_t               pid;
+    char                pidString[20] = {0};
+    char                lshmName[40]  = {0};

    std::vector<char> verification_buffer(DATA_SIZE);

+    pid = getppid();
+    snprintf(pidString, sizeof(pidString), "%d", pid);
+    strcat(lshmName, shmName);
+    strcat(lshmName, pidString);
+
+    printf("CP: lshmName = %s\n", lshmName);
+
    ipcHandle *ipcChildHandle = NULL;
    checkIpcErrors(ipcOpenSocket(ipcChildHandle));

-    if (sharedMemoryOpen(shmName, sizeof(shmStruct), &info) != 0) {
+    if (sharedMemoryOpen(lshmName, sizeof(shmStruct), &info) != 0) {
        printf("Failed to create shared memory slab\n");
        exit(EXIT_FAILURE);
    }
@ -245,6 +255,16 @@ static void parentProcess(char *app)
    std::vector<void *>         ptrs;
    std::vector<Process>        processes;
    cudaMemAllocationHandleType handleType = cudaMemHandleTypeNone;
+    pid_t                       pid;
+    char                        pidString[20] = {0};
+    char                        lshmName[40]  = {0};
+
+    pid = getpid();
+    snprintf(pidString, sizeof(pidString), "%d", pid);
+    strcat(lshmName, shmName);
+    strcat(lshmName, pidString);
+
+    printf("PP: lshmName = %s\n", lshmName);

    checkCudaErrors(cudaGetDeviceCount(&devCount));
    std::vector<CUdevice> devices(devCount);
@ -252,7 +272,7 @@ static void parentProcess(char *app)
        cuDeviceGet(&devices[i], i);
    }

-    if (sharedMemoryCreate(shmName, sizeof(*shm), &info) != 0) {
+    if (sharedMemoryCreate(lshmName, sizeof(*shm), &info) != 0) {
        printf("Failed to create shared memory slab\n");
        exit(EXIT_FAILURE);
    }
--- a/Samples/3_CUDA_Features/memMapIPCDrv/memMapIpc.cpp
+++ b/Samples/3_CUDA_Features/memMapIPCDrv/memMapIpc.cpp
@ -310,10 +310,24 @@ static void childProcess(int devId, int id, char **argv)
    ipcHandle          *ipcChildHandle = NULL;
    int                 blocks         = 0;
    int                 threads        = 128;
+    char                pidString[20]  = {0};
+    char                lshmName[40]   = {0};
+
+
+    // Use parent process ID to create a unique shared memory name for Linux multi-process
+#ifdef __linux__
+    pid_t pid;
+    pid = getppid();
+    snprintf(pidString, sizeof(pidString), "%d", pid);
+#endif
+    strcat(lshmName, shmName);
+    strcat(lshmName, pidString);
+
+    printf("CP: lshmName = %s\n", lshmName);

    checkIpcErrors(ipcOpenSocket(ipcChildHandle));

-    if (sharedMemoryOpen(shmName, sizeof(shmStruct), &info) != 0) {
+    if (sharedMemoryOpen(lshmName, sizeof(shmStruct), &info) != 0) {
        printf("Failed to create shared memory slab\n");
        exit(EXIT_FAILURE);
    }
@ -421,11 +435,24 @@ static void parentProcess(char *app)
    volatile shmStruct  *shm = NULL;
    sharedMemoryInfo     info;
    std::vector<Process> processes;
+    char                 pidString[20] = {0};
+    char                 lshmName[40]  = {0};
+
+    // Use current process ID to create a unique shared memory name for Linux multi-process
+#ifdef __linux__
+    pid_t pid;
+    pid = getpid();
+    snprintf(pidString, sizeof(pidString), "%d", pid);
+#endif
+    strcat(lshmName, shmName);
+    strcat(lshmName, pidString);
+
+    printf("PP: lshmName = %s\n", lshmName);

    checkCudaErrors(cuDeviceGetCount(&devCount));
    std::vector<CUdevice> devices(devCount);

-    if (sharedMemoryCreate(shmName, sizeof(*shm), &info) != 0) {
+    if (sharedMemoryCreate(lshmName, sizeof(*shm), &info) != 0) {
        printf("Failed to create shared memory slab\n");
        exit(EXIT_FAILURE);
    }
--- a/Samples/4_CUDA_Libraries/cudaNvSci/main.cpp
+++ b/Samples/4_CUDA_Libraries/cudaNvSci/main.cpp
@ -25,8 +25,8 @@
 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */

-#include <cuda_runtime.h>
 #include <cuda.h>
+#include <cuda_runtime.h>
 #include <helper_cuda.h>
 #include <helper_image.h>
 #include <vector>
--- a/Samples/4_CUDA_Libraries/oceanFFT/oceanFFT.cpp
+++ b/Samples/4_CUDA_Libraries/oceanFFT/oceanFFT.cpp
@ -45,13 +45,15 @@
 #include <windows.h>
 #endif

+// includes for OpenGL
+#include <helper_gl.h>
+
 // includes
 #include <cuda_gl_interop.h>
 #include <cuda_runtime.h>
 #include <cufft.h>
 #include <helper_cuda.h>
 #include <helper_functions.h>
-#include <helper_gl.h>
 #include <math.h>
 #include <math_constants.h>
 #include <stdio.h>
--- a/Samples/5_Domain_Specific/marchingCubes/marchingCubes.cpp
+++ b/Samples/5_Domain_Specific/marchingCubes/marchingCubes.cpp
@ -86,12 +86,14 @@
 #include <windows.h>
 #endif

+// includes for OpenGL
+#include <helper_gl.h>
+
 // includes
 #include <cuda_gl_interop.h>
 #include <cuda_runtime.h>
 #include <helper_cuda.h> // includes cuda.h and cuda_runtime_api.h
 #include <helper_functions.h>
-#include <helper_gl.h>
 #include <math.h>
 #include <stdio.h>
 #include <stdlib.h>
--- a/Samples/5_Domain_Specific/nbody/render_particles.cpp
+++ b/Samples/5_Domain_Specific/nbody/render_particles.cpp
@ -28,11 +28,15 @@
 #include "render_particles.h"

 #define HELPERGL_EXTERN_GL_FUNC_IMPLEMENTATION
+
+// includes for OpenGL
+#include <helper_gl.h>
+
+// includes
 #include <assert.h>
 #include <cuda_gl_interop.h>
 #include <cuda_runtime.h>
 #include <helper_cuda.h>
-#include <helper_gl.h>
 #include <math.h>

 #define GL_POINT_SPRITE_ARB             0x8861
--- a/Samples/5_Domain_Specific/simpleD3D11/simpleD3D11.cpp
+++ b/Samples/5_Domain_Specific/simpleD3D11/simpleD3D11.cpp
@ -31,9 +31,12 @@

 #pragma warning(disable : 4312)

-#include <mmsystem.h>
+// includes for Windows
 #include <windows.h>

+// includes for multimedia
+#include <mmsystem.h>
+
 // This header inclues all the necessary D3D11 and CUDA includes
 #include <cuda_d3d11_interop.h>
 #include <cuda_runtime_api.h>
--- a/Samples/5_Domain_Specific/simpleD3D11Texture/simpleD3D11Texture.cpp
+++ b/Samples/5_Domain_Specific/simpleD3D11Texture/simpleD3D11Texture.cpp
@ -31,9 +31,12 @@

 #pragma warning(disable : 4312)

-#include <mmsystem.h>
+// includes for Windows
 #include <windows.h>

+// includes for multimedia
+#include <mmsystem.h>
+
 // This header inclues all the necessary D3D11 and CUDA includes
 #include <cuda_d3d11_interop.h>
 #include <cuda_runtime_api.h>
--- a/Samples/5_Domain_Specific/smokeParticles/ParticleSystem.cpp
+++ b/Samples/5_Domain_Specific/smokeParticles/ParticleSystem.cpp
@ -33,11 +33,15 @@
 #include <memory.h>

 #define HELPERGL_EXTERN_GL_FUNC_IMPLEMENTATION
+
+// includes for OpenGL
+#include <helper_gl.h>
+
+// includes
 #include <cuda_gl_interop.h>
 #include <cuda_runtime.h>
 #include <helper_cuda.h>
 #include <helper_functions.h>
-#include <helper_gl.h>

 #include "ParticleSystem.cuh"
 #include "ParticleSystem.h"
--- a/Samples/5_Domain_Specific/smokeParticles/ParticleSystem_cuda.cu
+++ b/Samples/5_Domain_Specific/smokeParticles/ParticleSystem_cuda.cu
@ -29,11 +29,15 @@
 This file contains simple wrapper functions that call the CUDA kernels
 */
 #define HELPERGL_EXTERN_GL_FUNC_IMPLEMENTATION
+
+// includes for OpenGL
+#include <helper_gl.h>
+
+// includes
 #include <cstdio>
 #include <cstdlib>
 #include <cuda_gl_interop.h>
 #include <helper_cuda.h>
-#include <helper_gl.h>
 #include <string.h>

 #include "ParticleSystem.cuh"
--- a/Samples/CMakeLists.txt
+++ b/Samples/CMakeLists.txt
@ -1,11 +1,33 @@
+# This layer of CMakeLists.txt adds folders, for better organization in Visual Studio
+# and other IDEs that support this feature.
+
+set_property(GLOBAL PROPERTY USE_FOLDERS ON)
+
+set(CMAKE_FOLDER "0_Introduction")
 add_subdirectory(0_Introduction)
+
+set(CMAKE_FOLDER "1_Utilities")
 add_subdirectory(1_Utilities)
+
+set(CMAKE_FOLDER "2_Concepts_and_Techniques")
 add_subdirectory(2_Concepts_and_Techniques)
+
+set(CMAKE_FOLDER "3_CUDA_Features")
 add_subdirectory(3_CUDA_Features)
+
+set(CMAKE_FOLDER "4_CUDA_Libraries")
 add_subdirectory(4_CUDA_Libraries)
+
+set(CMAKE_FOLDER "5_Domain_Specific")
 add_subdirectory(5_Domain_Specific)
+
+set(CMAKE_FOLDER "6_Performance")
 add_subdirectory(6_Performance)
+
+set(CMAKE_FOLDER "7_libNVVM")
 add_subdirectory(7_libNVVM)
+
 if(BUILD_TEGRA)
+    set(CMAKE_FOLDER "8_Platform_Specific/Tegra")
    add_subdirectory(8_Platform_Specific/Tegra)
 endif()
--- a/run_tests.py
+++ b/run_tests.py
@ -33,6 +33,15 @@ import json
 import subprocess
 import argparse
 from pathlib import Path
+import concurrent.futures
+import threading
+
+print_lock = threading.Lock()
+
+def safe_print(*args, **kwargs):
+    """Thread-safe print function"""
+    with print_lock:
+        print(*args, **kwargs)

 def normalize_exe_name(name):
    """Normalize executable name across platforms by removing .exe if present"""
@ -78,96 +87,49 @@ def find_executables(root_dir):

    return executables

-def run_test(executable, output_dir, args_config, global_args=None):
-    """Run a single test and capture output"""
+def run_single_test_instance(executable, args, output_file, global_args, run_description):
+    """Run a single instance of a test executable with specific arguments."""
    exe_path = str(executable)
    exe_name = executable.name
-    base_name = normalize_exe_name(exe_name)

-    # Check if this executable should be skipped
-    if base_name in args_config and args_config[base_name].get("skip", False):
-        print(f"Skipping {exe_name} (marked as skip in config)")
-        return 0
-
-    # Get argument sets for this executable
-    arg_sets = []
-    if base_name in args_config:
-        config = args_config[base_name]
-        if "args" in config:
-            # Single argument set (backwards compatibility)
-            if isinstance(config["args"], list):
-                arg_sets.append(config["args"])
-            else:
-                print(f"Warning: Arguments for {base_name} must be a list")
-        elif "runs" in config:
-            # Multiple argument sets
-            for run in config["runs"]:
-                if isinstance(run.get("args", []), list):
-                    arg_sets.append(run.get("args", []))
-                else:
-                    print(f"Warning: Arguments for {base_name} run must be a list")
-
-    # If no specific args defined, run once with no args
-    if not arg_sets:
-        arg_sets.append([])
-
-    # Run for each argument set
-    failed = False
-    run_number = 1
-    for args in arg_sets:
-        # Create output file name with run number if multiple runs
-        if len(arg_sets) > 1:
-            output_file = os.path.abspath(f"{output_dir}/APM_{exe_name}.run{run_number}.txt")
-            print(f"Running {exe_name} (run {run_number}/{len(arg_sets)})")
-        else:
-            output_file = os.path.abspath(f"{output_dir}/APM_{exe_name}.txt")
-            print(f"Running {exe_name}")
+    safe_print(f"Starting {exe_name} {run_description}")

    try:
-            # Prepare command with arguments
        cmd = [f"./{exe_name}"]
        cmd.extend(args)
-
-            # Add global arguments if provided
        if global_args:
            cmd.extend(global_args)

-            print(f"    Command: {' '.join(cmd)}")
+        safe_print(f"    Command ({exe_name} {run_description}): {' '.join(cmd)}")

-            # Store current directory
-            original_dir = os.getcwd()
-
-            try:
-                # Change to executable's directory
-                os.chdir(os.path.dirname(exe_path))
-
-                # Run the executable and capture output
+        # Run the executable in its own directory using cwd
        with open(output_file, 'w') as f:
            result = subprocess.run(
                cmd,
                stdout=f,
                stderr=subprocess.STDOUT,
-                        timeout=300  # 5 minute timeout
+                timeout=300,  # 5 minute timeout
+                cwd=os.path.dirname(exe_path) # Execute in the executable's directory
            )

-                if result.returncode != 0:
-                    failed = True
-                print(f"    Test completed with return code {result.returncode}")
-
-            finally:
-                # Always restore original directory
-                os.chdir(original_dir)
+        status = "Passed" if result.returncode == 0 else "Failed"
+        safe_print(f"    Finished {exe_name} {run_description}: {status} (code {result.returncode})")
+        return {"name": exe_name, "description": run_description, "return_code": result.returncode, "status": status}

    except subprocess.TimeoutExpired:
-            print(f"Error: {exe_name} timed out after 5 minutes")
-            failed = True
+        safe_print(f"Error ({exe_name} {run_description}): Timed out after 5 minutes")
+        return {"name": exe_name, "description": run_description, "return_code": -1, "status": "Timeout"}
    except Exception as e:
-            print(f"Error running {exe_name}: {str(e)}")
-            failed = True
+        safe_print(f"Error running {exe_name} {run_description}: {str(e)}")
+        return {"name": exe_name, "description": run_description, "return_code": -1, "status": f"Error: {str(e)}"}

-        run_number += 1
-
-    return 1 if failed else 0
+def run_test(executable, output_dir, args_config, global_args=None):
+    """Deprecated: This function is replaced by the parallel execution logic in main."""
+    # This function is no longer called directly by the main logic.
+    # It remains here temporarily in case it's needed for reference or single-threaded debugging.
+    # The core logic is now in run_single_test_instance and managed by ThreadPoolExecutor.
+    print("Warning: run_test function called directly - this indicates an issue in the refactoring.")
+    return 1 # Indicate failure if called

 def main():
    parser = argparse.ArgumentParser(description='Run all executables and capture output')
@ -175,6 +137,7 @@ def main():
    parser.add_argument('--config', help='JSON configuration file for executable arguments')
    parser.add_argument('--output', default='.',  # Default to current directory
                       help='Output directory for test results')
+    parser.add_argument('--parallel', type=int, default=1, help='Number of parallel tests to run')
    parser.add_argument('--args', nargs=argparse.REMAINDER,
                       help='Global arguments to pass to all executables')
    args = parser.parse_args()
@ -192,23 +155,104 @@ def main():
        return 1

    print(f"Found {len(executables)} executables")
+    print(f"Running tests with up to {args.parallel} parallel tasks.")
+
+    tasks = []
+    for exe in executables:
+        exe_name = exe.name
+        base_name = normalize_exe_name(exe_name)
+
+        # Check if this executable should be skipped globally
+        if base_name in args_config and args_config[base_name].get("skip", False):
+            safe_print(f"Skipping {exe_name} (marked as skip in config)")
+            continue
+
+        arg_sets_configs = []
+        if base_name in args_config:
+            config = args_config[base_name]
+            if "args" in config:
+                if isinstance(config["args"], list):
+                    arg_sets_configs.append({"args": config["args"]}) # Wrap in dict for consistency
+                else:
+                    safe_print(f"Warning: Arguments for {base_name} must be a list")
+            elif "runs" in config:
+                for i, run_config in enumerate(config["runs"]):
+                    if run_config.get("skip", False):
+                         safe_print(f"Skipping run {i+1} for {exe_name} (marked as skip in config)")
+                         continue
+                    if isinstance(run_config.get("args", []), list):
+                        arg_sets_configs.append(run_config)
+                    else:
+                        safe_print(f"Warning: Arguments for {base_name} run {i+1} must be a list")
+
+        # If no specific args defined, create one run with no args
+        if not arg_sets_configs:
+            arg_sets_configs.append({"args": []})
+
+        # Create tasks for each run configuration
+        num_runs = len(arg_sets_configs)
+        for i, run_config in enumerate(arg_sets_configs):
+            current_args = run_config.get("args", [])
+            run_desc = f"(run {i+1}/{num_runs})" if num_runs > 1 else ""
+
+            # Create output file name
+            if num_runs > 1:
+                output_file = os.path.abspath(f"{args.output}/APM_{exe_name}.run{i+1}.txt")
+            else:
+                output_file = os.path.abspath(f"{args.output}/APM_{exe_name}.txt")
+
+            tasks.append({
+                "executable": exe,
+                "args": current_args,
+                "output_file": output_file,
+                "global_args": args.args,
+                "description": run_desc
+            })

    failed = []
-    for exe in executables:
-        ret_code = run_test(exe, args.output, args_config, args.args)
-        if ret_code != 0:
-            failed.append((exe.name, ret_code))
+    total_runs = len(tasks)
+    completed_runs = 0
+
+    with concurrent.futures.ThreadPoolExecutor(max_workers=args.parallel) as executor:
+        future_to_task = {
+            executor.submit(run_single_test_instance,
+                            task["executable"],
+                            task["args"],
+                            task["output_file"],
+                            task["global_args"],
+                            task["description"]): task
+            for task in tasks
+        }
+
+        for future in concurrent.futures.as_completed(future_to_task):
+            task_info = future_to_task[future]
+            completed_runs += 1
+            safe_print(f"Progress: {completed_runs}/{total_runs} runs completed.")
+            try:
+                result = future.result()
+                if result["return_code"] != 0:
+                    failed.append(result)
+            except Exception as exc:
+                safe_print(f'Task {task_info["executable"].name} {task_info["description"]} generated an exception: {exc}')
+                failed.append({
+                    "name": task_info["executable"].name,
+                    "description": task_info["description"],
+                    "return_code": -1,
+                    "status": f"Execution Exception: {exc}"
+                })

    # Print summary
    print("\nTest Summary:")
-    print(f"Ran {len(executables)} tests")
+    print(f"Ran {total_runs} test runs for {len(executables)} executables.")
    if failed:
-        print(f"Failed tests ({len(failed)}):")
-        for name, code in failed:
-            print(f"  {name}: returned {code}")
-        return failed[0][1]  # Return first failure code
+        print(f"Failed runs ({len(failed)}):")
+        for fail in failed:
+            print(f"  {fail['name']} {fail['description']}: {fail['status']} (code {fail['return_code']})")
+        # Return the return code of the first failure, or 1 if only exceptions occurred
+        first_failure_code = next((f["return_code"] for f in failed if f["return_code"] != -1), 1)
+        return first_failure_code
    else:
-        print("All tests passed!")
+        print("All test runs passed!")
        return 0

 if __name__ == '__main__':