mirror of https://github.com/NVIDIA/cuda-samples.git synced 2026-07-16 21:06:52 +08:00

History

- Added Python samples for CUDA Python 1.0 release
- Renamed top-level `Samples` directory to `cpp` to accommodate Python samples.

2026-05-13 17:13:18 -05:00

blockwiseSum.py

CUDA 13.2 samples update (#432 )

2026-05-13 17:13:18 -05:00

README.md

CUDA 13.2 samples update (#432 )

2026-05-13 17:13:18 -05:00

requirements.txt

CUDA 13.2 samples update (#432 )

2026-05-13 17:13:18 -05:00

README.md

Sample: Block-wise Array Sum (Python)

Description

Demonstrates fundamental CUDA thread cooperation: thread/block indexing, strided loops, and block-wise reduction using shared memory. This sample shows three progressively complex kernel patterns using the cuda.core API:

Simple indexing - One thread per element
Strided loop - Each thread processes multiple elements
Block partial sum - Shared memory reduction within each block

What You'll Learn

How to calculate global thread ID from block and thread indices
Strided loop pattern for processing arrays larger than grid size
Block-level cooperation using shared memory and __syncthreads()

Key Concepts

Thread and Block Indexing

Global Thread ID = blockIdx.x * blockDim.x + threadIdx.x
Stride = blockDim.x * gridDim.x

Strided Loop Pattern

Each thread processes multiple elements, enabling fixed grid size for arbitrary array lengths:

for (size_t i = tid; i < N; i += stride) {
    output[i] = input[i] * 2.0f;
}

Key APIs

From `cuda.core`:

Device - Device management and context
Program - Compile CUDA C++ kernels
ProgramOptions - Kernel compilation options (architecture target)
LaunchConfig - Configure grid/block dimensions and shared memory
launch() - Execute kernel
EventOptions - GPU timing configuration

From CuPy:

cp.asarray() - Transfer data to GPU
cp.zeros_like() - Allocate GPU arrays

Requirements

Hardware:

NVIDIA GPU with CUDA support

Software:

CUDA Toolkit 13.0 or newer
Python 3.10 or newer
See requirements.txt for Python packages

Installation

pip install -r requirements.txt

How to Run

python blockwiseSum.py

Expected Output

Device: <Your GPU>
Compute Capability: sm_XX
Array size: 1,048,576 elements

Simple indexing: Test PASSED
Strided loop:    Test PASSED
Block-wise sum:  Test PASSED

Kernel time: X.XXX ms, Bandwidth: XXX.X GB/s

Done

Files

blockwiseSum.py - Python implementation with CUDA kernels
README.md - This file
requirements.txt - Sample dependencies

README.md

Sample: Block-wise Array Sum (Python)

Description

What You'll Learn

Key Concepts

Thread and Block Indexing

Strided Loop Pattern

Key APIs

From cuda.core:

From CuPy:

Requirements

Hardware:

Software:

Installation

How to Run

Expected Output

Files

See Also

From `cuda.core`: