Remove unused bin/x86_64 directory hierarchy

This commit is contained in:
Rob Armstrong 2025-05-01 09:53:54 -07:00
parent c70d79cf3b
commit ab68d58d59
261 changed files with 0 additions and 151271 deletions

View File

@ -1,36 +0,0 @@
[./BlackScholes] - Starting...
GPU Device 0: "Hopper" with compute capability 9.0
Initializing data...
...allocating CPU memory for options.
...allocating GPU memory for options.
...generating input data in CPU mem.
...copying input data to GPU mem.
Data init done.
Executing Black-Scholes GPU kernel (512 iterations)...
Options count : 8000000
BlackScholesGPU() time : 0.048059 msec
Effective memory bandwidth: 1664.634581 GB/s
Gigaoptions per second : 166.463458
BlackScholes, Throughput = 166.4635 GOptions/s, Time = 0.00005 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128
Reading back GPU results...
Checking the results...
...running CPU calculations.
Comparing the results...
L1 norm: 1.741792E-07
Max absolute error: 1.192093E-05
Shutting down...
...releasing GPU memory.
...releasing CPU memory.
Shutdown done.
[BlackScholes] - Test Summary
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Test passed

View File

@ -1,34 +0,0 @@
[./BlackScholes_nvrtc] - Starting...
Initializing data...
...allocating CPU memory for options.
> Using CUDA Device [0]: NVIDIA H100 PCIe
> Using CUDA Device [0]: NVIDIA H100 PCIe
> GPU Device has SM 9.0 compute capability
...allocating GPU memory for options.
...generating input data in CPU mem.
...copying input data to GPU mem.
Data init done.
Executing Black-Scholes GPU kernel (512 iterations)...
Options count : 8000000
BlackScholesGPU() time : 0.047896 msec
Effective memory bandwidth: 1670.268678 GB/s
Gigaoptions per second : 167.026868
BlackScholes, Throughput = 167.0269 GOptions/s, Time = 0.00005 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128
Reading back GPU results...
Checking the results...
...running CPU calculations.
Comparing the results...
L1 norm: 1.741792E-07
Max absolute error: 1.192093E-05
Shutting down...
...releasing GPU memory.
...releasing CPU memory.
Shutdown done.
[./BlackScholes_nvrtc] - Test Summary
Test passed

View File

@ -1,37 +0,0 @@
./FDTD3d Starting...
Set-up, based upon target device GMEM size...
getTargetDeviceGlobalMemSize
cudaGetDeviceCount
GPU Device 0: "Hopper" with compute capability 9.0
cudaGetDeviceProperties
generateRandomData
FDTD on 376 x 376 x 376 volume with symmetric filter radius 4 for 5 timesteps...
fdtdReference...
calloc intermediate
Host FDTD loop
t = 0
t = 1
t = 2
t = 3
t = 4
fdtdReference complete
fdtdGPU...
GPU Device 0: "Hopper" with compute capability 9.0
set block size to 32x16
set grid size to 12x24
GPU FDTD loop
t = 0 launch kernel
t = 1 launch kernel
t = 2 launch kernel
t = 3 launch kernel
t = 4 launch kernel
fdtdGPU complete
CompareData (tolerance 0.000100)...

View File

@ -1,9 +0,0 @@
HSOpticalFlow Starting...
GPU Device 0: "Hopper" with compute capability 9.0
Loading "frame10.ppm" ...
Loading "frame11.ppm" ...
Computing optical flow on CPU...
Computing optical flow on GPU...
L1 error : 0.044308

View File

@ -1,14 +0,0 @@
Monte Carlo Estimate Pi (with inline PRNG)
==========================================
Estimating Pi on GPU (NVIDIA H100 PCIe)
Precision: single
Number of sims: 100000
Tolerance: 1.000000e-02
GPU result: 3.141480e+00
Expected: 3.141593e+00
Absolute error: 1.127720e-04
Relative error: 3.589644e-05
MonteCarloEstimatePiInlineP, Performance = 746954.33 sims/s, Time = 133.88(ms), NumDevsUsed = 1, Blocksize = 128

View File

@ -1,14 +0,0 @@
Monte Carlo Estimate Pi (with inline QRNG)
==========================================
Estimating Pi on GPU (NVIDIA H100 PCIe)
Precision: single
Number of sims: 100000
Tolerance: 1.000000e-02
GPU result: 3.141840e+00
Expected: 3.141593e+00
Absolute error: 2.472401e-04
Relative error: 7.869895e-05
MonteCarloEstimatePiInlineQ, Performance = 677644.44 sims/s, Time = 147.57(ms), NumDevsUsed = 1, Blocksize = 128

View File

@ -1,14 +0,0 @@
Monte Carlo Estimate Pi (with batch PRNG)
=========================================
Estimating Pi on GPU (NVIDIA H100 PCIe)
Precision: single
Number of sims: 100000
Tolerance: 1.000000e-02
GPU result: 3.136320e+00
Expected: 3.141593e+00
Absolute error: 5.272627e-03
Relative error: 1.678329e-03
MonteCarloEstimatePiP, Performance = 652941.82 sims/s, Time = 153.15(ms), NumDevsUsed = 1, Blocksize = 128

View File

@ -1,14 +0,0 @@
Monte Carlo Estimate Pi (with batch QRNG)
=========================================
Estimating Pi on GPU (NVIDIA H100 PCIe)
Precision: single
Number of sims: 100000
Tolerance: 1.000000e-02
GPU result: 3.141840e+00
Expected: 3.141593e+00
Absolute error: 2.472401e-04
Relative error: 7.869895e-05
MonteCarloEstimatePiQ, Performance = 821146.16 sims/s, Time = 121.78(ms), NumDevsUsed = 1, Blocksize = 128

View File

@ -1,13 +0,0 @@
Monte Carlo Single Asian Option (with PRNG)
===========================================
Pricing option on GPU (NVIDIA H100 PCIe)
Precision: single
Number of sims: 100000
Spot | Strike | r | sigma | tenor | Call/Put | Value | Expected |
-----------|------------|------------|------------|------------|------------|------------|------------|
40 | 35 | 0.03 | 0.2 | 0.333333 | Call | 5.17083 | 5.16253 |
MonteCarloSingleAsianOptionP, Performance = 824402.28 sims/s, Time = 121.30(ms), NumDevsUsed = 1, Blocksize = 128

View File

@ -1,19 +0,0 @@
./MersenneTwisterGP11213 Starting...
GPU Device 0: "Hopper" with compute capability 9.0
Allocating data for 2400000 samples...
Seeding with 777 ...
Generating random numbers on GPU...
Reading back the results...
Generating random numbers on CPU...
Comparing CPU/GPU random numbers...
Max absolute error: 0.000000E+00
L1 norm: 0.000000E+00
MersenneTwisterGP11213, Throughput = 74.5342 GNumbers/s, Time = 0.00003 s, Size = 2400000 Numbers
Shutting down...

View File

@ -1,29 +0,0 @@
./MonteCarloMultiGPU Starting...
Using single CPU thread for multiple GPUs
MonteCarloMultiGPU
==================
Parallelization method = streamed
Problem scaling = weak
Number of GPUs = 1
Total number of options = 8192
Number of paths = 262144
main(): generating input data...
main(): starting 1 host threads...
main(): GPU statistics, streamed
GPU Device #0: NVIDIA H100 PCIe
Options : 8192
Simulation paths: 262144
Total time (ms.): 5.516000
Note: This is elapsed time for all to compute.
Options per sec.: 1485134.210647
main(): comparing Monte Carlo and Black-Scholes results...
Shutting down...
Test Summary...
L1 norm : 4.869781E-04
Average reserve: 14.607882
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Test passed

View File

@ -1,10 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
TEST#1:
CUDA resize nv12(1920x1080 --> 640x480), batch: 24, average time: 0.024 ms ==> 0.001 ms/frame
CUDA convert nv12(640x480) to bgr(640x480), batch: 24, average time: 0.061 ms ==> 0.003 ms/frame
TEST#2:
CUDA convert nv12(1920x1080) to bgr(1920x1080), batch: 24, average time: 0.405 ms ==> 0.017 ms/frame
CUDA resize bgr(1920x1080 --> 640x480), batch: 24, average time: 0.318 ms ==> 0.013 ms/frame

View File

@ -1,19 +0,0 @@
Sobol Quasi-Random Number Generator Starting...
> number of vectors = 100000
> number of dimensions = 100
GPU Device 0: "Hopper" with compute capability 9.0
Allocating CPU memory...
Allocating GPU memory...
Initializing direction numbers...
Copying direction numbers to device...
Executing QRNG on GPU...
Gsamples/s: 7.51315
Reading results from GPU...
Executing QRNG on CPU...
Gsamples/s: 0.232504
Checking results...
L1-Error: 0
Shutting down...

View File

@ -1,6 +0,0 @@
Starting [./StreamPriorities]...
GPU Device 0: "Hopper" with compute capability 9.0
CUDA stream priority range: LOW: 0 to HIGH: -5
elapsed time of kernels launched to LOW priority stream: 0.661 ms
elapsed time of kernels launched to HI priority stream: 0.523 ms

View File

@ -1,17 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
Running ........................................................
Overall Time For matrixMultiplyPerf
Printing Average of 20 measurements in (ms)
Size_KB UMhint UMhntAs UMeasy 0Copy MemCopy CpAsync CpHpglk CpPglAs
4 0.210 0.264 0.332 0.014 0.033 0.026 0.037 0.024
16 0.201 0.307 0.489 0.025 0.043 0.035 0.046 0.045
64 0.311 0.381 0.758 0.067 0.084 0.075 0.074 0.063
256 0.545 0.604 1.429 0.323 0.228 0.212 0.197 0.187
1024 1.551 1.444 2.436 1.902 0.831 0.784 0.714 0.728
4096 4.960 4.375 7.863 11.966 3.239 3.179 2.908 2.919
16384 18.911 17.022 29.696 77.375 13.796 13.757 12.874 12.862
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

View File

@ -1,44 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
Executing tasks on host / device
Task [0], thread [0] executing on device (569)
Task [1], thread [1] executing on device (904)
Task [2], thread [2] executing on device (529)
Task [3], thread [3] executing on device (600)
Task [4], thread [0] executing on device (975)
Task [5], thread [2] executing on device (995)
Task [6], thread [1] executing on device (576)
Task [7], thread [3] executing on device (700)
Task [8], thread [0] executing on device (716)
Task [9], thread [2] executing on device (358)
Task [10], thread [3] executing on device (941)
Task [11], thread [1] executing on device (403)
Task [12], thread [0] executing on host (97)
Task [13], thread [2] executing on device (451)
Task [14], thread [1] executing on device (789)
Task [15], thread [0] executing on device (810)
Task [16], thread [2] executing on device (807)
Task [17], thread [3] executing on device (756)
Task [18], thread [0] executing on device (509)
Task [19], thread [1] executing on device (252)
Task [20], thread [2] executing on device (515)
Task [21], thread [3] executing on device (676)
Task [22], thread [0] executing on device (948)
Task [23], thread [1] executing on device (944)
Task [24], thread [3] executing on device (974)
Task [25], thread [2] executing on device (513)
Task [26], thread [0] executing on device (207)
Task [27], thread [1] executing on device (509)
Task [28], thread [2] executing on device (344)
Task [29], thread [3] executing on device (198)
Task [30], thread [0] executing on device (223)
Task [31], thread [1] executing on device (382)
Task [32], thread [3] executing on device (980)
Task [33], thread [2] executing on device (519)
Task [34], thread [0] executing on host (92)
Task [35], thread [1] executing on device (677)
Task [36], thread [2] executing on device (769)
Task [37], thread [0] executing on device (199)
Task [38], thread [3] executing on device (845)
Task [39], thread [0] executing on device (844)
All Done!

View File

@ -1,51 +0,0 @@
[./alignedTypes] - Starting...
GPU Device 0: "Hopper" with compute capability 9.0
[NVIDIA H100 PCIe] has 114 MP(s) x 128 (Cores/MP) = 14592 (Cores)
> Compute scaling value = 1.00
> Memory Size = 49999872
Allocating memory...
Generating host input data array...
Uploading input data to GPU memory...
Testing misaligned types...
uint8...
Avg. time: 1.298781 ms / Copy throughput: 35.853619 GB/s.
TEST OK
uint16...
Avg. time: 0.676656 ms / Copy throughput: 68.817823 GB/s.
TEST OK
RGBA8_misaligned...
Avg. time: 0.371437 ms / Copy throughput: 125.367015 GB/s.
TEST OK
LA32_misaligned...
Avg. time: 0.200531 ms / Copy throughput: 232.213238 GB/s.
TEST OK
RGB32_misaligned...
Avg. time: 0.154500 ms / Copy throughput: 301.398134 GB/s.
TEST OK
RGBA32_misaligned...
Avg. time: 0.124531 ms / Copy throughput: 373.930325 GB/s.
TEST OK
Testing aligned types...
RGBA8...
Avg. time: 0.364031 ms / Copy throughput: 127.917614 GB/s.
TEST OK
I32...
Avg. time: 0.363844 ms / Copy throughput: 127.983539 GB/s.
TEST OK
LA32...
Avg. time: 0.200750 ms / Copy throughput: 231.960205 GB/s.
TEST OK
RGB32...
Avg. time: 0.122375 ms / Copy throughput: 380.518985 GB/s.
TEST OK
RGBA32...
Avg. time: 0.122437 ms / Copy throughput: 380.324735 GB/s.
TEST OK
RGBA32_2...
Avg. time: 0.080563 ms / Copy throughput: 578.010964 GB/s.
TEST OK
[alignedTypes] -> Test Results: 0 Failures
Shutting down...
Test passed

View File

@ -1,7 +0,0 @@
[./asyncAPI] - Starting...
GPU Device 0: "Hopper" with compute capability 9.0
CUDA device [NVIDIA H100 PCIe]
time spent executing by the GPU: 5.34
time spent by CPU in CUDA calls: 0.03
CPU executed 55200 iterations while waiting for GPU to finish

View File

@ -1,24 +0,0 @@
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: NVIDIA H100 PCIe
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 27.9
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 25.0
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 1421.4
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

View File

@ -1,59 +0,0 @@
batchCUBLAS Starting...
GPU Device 0: "Hopper" with compute capability 9.0
==== Running single kernels ====
Testing sgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbf800000, -1) beta= (0x40000000, 2)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00195909 sec GFLOPS=2.14095
@@@@ sgemm test OK
Testing dgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0x0000000000000000, 0) beta= (0x0000000000000000, 0)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00003910 sec GFLOPS=107.269
@@@@ dgemm test OK
==== Running N=10 without streams ====
Testing sgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbf800000, -1) beta= (0x00000000, 0)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00016713 sec GFLOPS=250.958
@@@@ sgemm test OK
Testing dgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00144100 sec GFLOPS=29.1069
@@@@ dgemm test OK
==== Running N=10 with streams ====
Testing sgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0x40000000, 2) beta= (0x40000000, 2)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00017214 sec GFLOPS=243.659
@@@@ sgemm test OK
Testing dgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00014997 sec GFLOPS=279.685
@@@@ dgemm test OK
==== Running N=10 batched ====
Testing sgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0x3f800000, 1) beta= (0xbf800000, -1)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00004101 sec GFLOPS=1022.8
@@@@ sgemm test OK
Testing dgemm
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbff0000000000000, -1) beta= (0x4000000000000000, 2)
#### args: lda=128 ldb=128 ldc=128
^^^^ elapsed = 0.00004506 sec GFLOPS=930.803
@@@@ dgemm test OK
Test Summary
0 error(s)

View File

@ -1,21 +0,0 @@
NPP Library Version 12.0.0
CUDA Driver Version: 12.0
CUDA Runtime Version: 12.0
Input file load succeeded.
teapot_CompressedMarkerLabelsUF_8Way_512x512_32u succeeded, compressed label count is 155332.
Input file load succeeded.
CT_Skull_CompressedMarkerLabelsUF_8Way_512x512_32u succeeded, compressed label count is 414.
Input file load succeeded.
PCB_METAL_CompressedMarkerLabelsUF_8Way_509x335_32u succeeded, compressed label count is 3731.
Input file load succeeded.
PCB2_CompressedMarkerLabelsUF_8Way_1024x683_32u succeeded, compressed label count is 1224.
Input file load succeeded.
PCB_CompressedMarkerLabelsUF_8Way_1280x720_32u succeeded, compressed label count is 1440.
teapot_CompressedMarkerLabelsUFBatch_8Way_512x512_32u succeeded, compressed label count is 155332.
CT_Skull_CompressedMarkerLabelsUFBatch_8Way_512x512_32u succeeded, compressed label count is 414.
PCB_METAL_CompressedMarkerLabelsUFBatch_8Way_509x335_32u succeeded, compressed label count is 3731.
PCB2_CompressedMarkerLabelsUFBatch_8Way_1024x683_32u succeeded, compressed label count is 1222.
PCB_CompressedMarkerLabelsUFBatch_8Way_1280x720_32u succeeded, compressed label count is 1447.

View File

@ -1,11 +0,0 @@
Initializing...
GPU Device 0: "Hopper" with compute capability 9.0
M: 8192 (16 x 512)
N: 8192 (16 x 512)
K: 8192 (16 x 512)
Preparing data for GPU...
Required shared memory size: 72 Kb
Computing using high performance kernel = 0 - compute_bf16gemm_async_copy
Time: 9.149888 ms
TFLOPS: 120.17

View File

@ -1,8 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
Launching 228 blocks with 1024 threads...
Array size = 102400 Num of Odds = 50945 Sum of Odds = 1272565 Sum of Evens 1233938
...Done.

View File

@ -1,22 +0,0 @@
[./binomialOptions] - Starting...
GPU Device 0: "Hopper" with compute capability 9.0
Generating input data...
Running GPU binomial tree...
Options count : 1024
Time steps : 2048
binomialOptionsGPU() time: 2.081000 msec
Options per second : 492071.098457
Running CPU binomial tree...
Comparing the results...
GPU binomial vs. Black-Scholes
L1 norm: 2.220214E-04
CPU binomial vs. Black-Scholes
L1 norm: 2.220922E-04
CPU binomial vs. GPU binomial
L1 norm: 7.997008E-07
Shutting down...
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Test passed

View File

@ -1,23 +0,0 @@
[./binomialOptions_nvrtc] - Starting...
Generating input data...
Running GPU binomial tree...
> Using CUDA Device [0]: NVIDIA H100 PCIe
> Using CUDA Device [0]: NVIDIA H100 PCIe
> GPU Device has SM 9.0 compute capability
Options count : 1024
Time steps : 2048
binomialOptionsGPU() time: 3021.375000 msec
Options per second : 338.918539
Running CPU binomial tree...
Comparing the results...
GPU binomial vs. Black-Scholes
L1 norm: 2.216577E-04
CPU binomial vs. Black-Scholes
L1 norm: 9.435265E-05
CPU binomial vs. GPU binomial
L1 norm: 1.513570E-04
Shutting down...
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Test passed

View File

@ -1,4 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
Read 3223503 byte corpus from ../../../../Samples/0_Introduction/c++11_cuda/warandpeace.txt
counted 107310 instances of 'x', 'y', 'z', or 'w' in "../../../../Samples/0_Introduction/c++11_cuda/warandpeace.txt"

View File

@ -1,6 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
GPU device NVIDIA H100 PCIe has compute capabilities (SM 9.0)
Running qsort on 1000000 elements with seed 0, on NVIDIA H100 PCIe
cdpAdvancedQuicksort PASSED
Sorted 1000000 elems in 5.015 ms (199.389 Melems/sec)

View File

@ -1,2 +0,0 @@
Running on GPU 0 (NVIDIA H100 PCIe)
Computing Bezier Lines (CUDA Dynamic Parallelism Version) ... Done!

View File

@ -1,5 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
GPU device NVIDIA H100 PCIe has compute capabilities (SM 9.0)
Launching CDP kernel to build the quadtree
Results: OK

View File

@ -1,23 +0,0 @@
starting Simple Print (CUDA Dynamic Parallelism)
GPU Device 0: "Hopper" with compute capability 9.0
***************************************************************************
The CPU launches 2 blocks of 2 threads each. On the device each thread will
launch 2 blocks of 2 threads each. The GPU we will do that recursively
until it reaches max_depth=2
In total 2+8=10 blocks are launched!!! (8 from the GPU)
***************************************************************************
Launching cdp_kernel() with CUDA Dynamic Parallelism:
BLOCK 1 launched by the host
BLOCK 0 launched by the host
| BLOCK 3 launched by thread 0 of block 1
| BLOCK 2 launched by thread 0 of block 1
| BLOCK 4 launched by thread 0 of block 0
| BLOCK 5 launched by thread 0 of block 0
| BLOCK 7 launched by thread 1 of block 0
| BLOCK 6 launched by thread 1 of block 0
| BLOCK 9 launched by thread 1 of block 1
| BLOCK 8 launched by thread 1 of block 1

View File

@ -1,6 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
Initializing data:
Running quicksort on 128 elements
Launching kernel on the GPU
Validating results: OK

View File

@ -1,4 +0,0 @@
CUDA Clock sample
GPU Device 0: "Hopper" with compute capability 9.0
Average clocks/block = 1904.875000

View File

@ -1,5 +0,0 @@
CUDA Clock sample
> Using CUDA Device [0]: NVIDIA H100 PCIe
> Using CUDA Device [0]: NVIDIA H100 PCIe
> GPU Device has SM 9.0 compute capability
Average clocks/block = 1839.750000

View File

@ -1,8 +0,0 @@
[./concurrentKernels] - Starting...
GPU Device 0: "Hopper" with compute capability 9.0
> Detected Compute SM 9.0 hardware with 114 multi-processors
Expected time for serial execution of 8 kernels = 0.080s
Expected time for concurrent execution of 8 kernels = 0.010s
Measured time for sample = 0.010s
Test passed

View File

@ -1,13 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
> GPU device has 114 Multi-Processors, SM 9.0 compute capabilities
iteration = 1, residual = 4.449882e+01
iteration = 2, residual = 3.245218e+00
iteration = 3, residual = 2.690220e-01
iteration = 4, residual = 2.307639e-02
iteration = 5, residual = 1.993140e-03
iteration = 6, residual = 1.846193e-04
iteration = 7, residual = 1.693379e-05
iteration = 8, residual = 1.600115e-06
Test Summary: Error amount = 0.000000

View File

@ -1,13 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
> GPU device has 114 Multi-Processors, SM 9.0 compute capabilities
iteration = 1, residual = 4.449882e+01
iteration = 2, residual = 3.245218e+00
iteration = 3, residual = 2.690220e-01
iteration = 4, residual = 2.307639e-02
iteration = 5, residual = 1.993140e-03
iteration = 6, residual = 1.846193e-04
iteration = 7, residual = 1.693379e-05
iteration = 8, residual = 1.600115e-06
Test Summary: Error amount = 0.000000

View File

@ -1,8 +0,0 @@
Starting [conjugateGradientMultiBlockCG]...
GPU Device 0: "Hopper" with compute capability 9.0
> GPU device has 114 Multi-Processors, SM 9.0 compute capabilities
GPU Final, residual = 1.600115e-06, kernel execution time = 16.014656 ms
Test Summary: Error amount = 0.000000
&&&& conjugateGradientMultiBlockCG PASSED

View File

@ -1,4 +0,0 @@
Starting [conjugateGradientMultiDeviceCG]...
GPU Device 0: "NVIDIA H100 PCIe" with compute capability 9.0
No two or more GPUs with same architecture capable of concurrentManagedAccess found.
Waiving the sample

View File

@ -1,18 +0,0 @@
conjugateGradientPrecond starting...
GPU Device 0: "Hopper" with compute capability 9.0
GPU selected Device ID = 0
> GPU device has 114 Multi-Processors, SM 9.0 compute capabilities
laplace dimension = 128
Convergence of CG without preconditioning:
iteration = 564, residual = 9.174634e-13
Convergence Test: OK
Convergence of CG using ILU(0) preconditioning:
iteration = 188, residual = 9.084683e-13
Convergence Test: OK
Test Summary:
Counted total of 0 errors
qaerr1 = 0.000005 qaerr2 = 0.000003

View File

@ -1,16 +0,0 @@
Starting [conjugateGradientUM]...
GPU Device 0: "Hopper" with compute capability 9.0
> GPU device has 114 Multi-Processors, SM 9.0 compute capabilities
iteration = 1, residual = 4.449882e+01
iteration = 2, residual = 3.245218e+00
iteration = 3, residual = 2.690220e-01
iteration = 4, residual = 2.307639e-02
iteration = 5, residual = 1.993140e-03
iteration = 6, residual = 1.846193e-04
iteration = 7, residual = 1.693379e-05
iteration = 8, residual = 1.600115e-06
Final residual: 1.600115e-06
&&&& conjugateGradientUM PASSED
Test Summary: Error amount = 0.000000, result = SUCCESS

View File

@ -1,41 +0,0 @@
[./convolutionFFT2D] - Starting...
GPU Device 0: "Hopper" with compute capability 9.0
Testing built-in R2C / C2R FFT-based convolution
...allocating memory
...generating random input data
...creating R2C & C2R FFT plans for 2048 x 2048
...uploading to GPU and padding convolution kernel and input data
...transforming convolution kernel
...running GPU FFT convolution: 33613.444604 MPix/s (0.119000 ms)
...reading back GPU convolution results
...running reference CPU convolution
...comparing the results: rel L2 = 9.395370E-08 (max delta = 1.208283E-06)
L2norm Error OK
...shutting down
Testing custom R2C / C2R FFT-based convolution
...allocating memory
...generating random input data
...creating C2C FFT plan for 2048 x 1024
...uploading to GPU and padding convolution kernel and input data
...transforming convolution kernel
...running GPU FFT convolution: 29197.081461 MPix/s (0.137000 ms)
...reading back GPU FFT results
...running reference CPU convolution
...comparing the results: rel L2 = 1.067915E-07 (max delta = 9.817303E-07)
L2norm Error OK
...shutting down
Testing updated custom R2C / C2R FFT-based convolution
...allocating memory
...generating random input data
...creating C2C FFT plan for 2048 x 1024
...uploading to GPU and padding convolution kernel and input data
...transforming convolution kernel
...running GPU FFT convolution: 39603.959017 MPix/s (0.101000 ms)
...reading back GPU FFT results
...running reference CPU convolution
...comparing the results: rel L2 = 1.065127E-07 (max delta = 9.817303E-07)
L2norm Error OK
...shutting down
Test Summary: 0 errors
Test passed

View File

@ -1,21 +0,0 @@
[./convolutionSeparable] - Starting...
GPU Device 0: "Hopper" with compute capability 9.0
Image Width x Height = 3072 x 3072
Allocating and initializing host arrays...
Allocating and initializing CUDA arrays...
Running GPU convolution (16 identical iterations)...
convolutionSeparable, Throughput = 74676.0329 MPixels/sec, Time = 0.00013 s, Size = 9437184 Pixels, NumDevsUsed = 1, Workgroup = 0
Reading back GPU results...
Checking the results...
...running convolutionRowCPU()
...running convolutionColumnCPU()
...comparing the results
...Relative L2 norm: 0.000000E+00
Shutting down...
Test passed

View File

@ -1,17 +0,0 @@
[./convolutionTexture] - Starting...
GPU Device 0: "Hopper" with compute capability 9.0
Initializing data...
Running GPU rows convolution (10 identical iterations)...
Average convolutionRowsGPU() time: 0.117200 msecs; //40261.023178 Mpix/s
Copying convolutionRowGPU() output back to the texture...
cudaMemcpyToArray() time: 0.067000 msecs; //70426.744514 Mpix/s
Running GPU columns convolution (10 iterations)
Average convolutionColumnsGPU() time: 0.116000 msecs; //40677.518412 Mpix/s
Reading back GPU results...
Checking the results...
...running convolutionRowsCPU()
...running convolutionColumnsCPU()
Relative L2 norm: 0.000000E+00
Shutting down...
Test passed

View File

@ -1,4 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
Hello World.
Hello World.

View File

@ -1,30 +0,0 @@
C++ Function Overloading starting...
Device Count: 1
GPU Device 0: "Hopper" with compute capability 9.0
Shared Size: 1024
Constant Size: 0
Local Size: 0
Max Threads Per Block: 1024
Number of Registers: 12
PTX Version: 90
Binary Version: 90
simple_kernel(const int *pIn, int *pOut, int a) PASSED
Shared Size: 2048
Constant Size: 0
Local Size: 0
Max Threads Per Block: 1024
Number of Registers: 14
PTX Version: 90
Binary Version: 90
simple_kernel(const int2 *pIn, int *pOut, int a) PASSED
Shared Size: 2048
Constant Size: 0
Local Size: 0
Max Threads Per Block: 1024
Number of Registers: 14
PTX Version: 90
Binary Version: 90
simple_kernel(const int *pIn1, const int *pIn2, int *pOut, int a) PASSED

View File

@ -1,15 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
step 1: read matrix market format
Using default input file [../../../../Samples/4_CUDA_Libraries/cuSolverDn_LinearSolver/gr_900_900_crg.mtx]
sparse matrix A is 900 x 900 with 7744 nonzeros, base=1
step 2: convert CSR(A) to dense matrix
step 3: set right hand side vector (b) to 1
step 4: prepare data on device
step 5: solve A*x = b
timing: cholesky = 0.000789 sec
step 6: evaluate residual
|b - A*x| = 1.278977E-13
|A| = 1.600000E+01
|x| = 2.357708E+01
|b - A*x|/(|A|*|x|) = 3.390413E-16

View File

@ -1,58 +0,0 @@
step 1.1: preparation
step 1.1: read matrix market format
GPU Device 0: "Hopper" with compute capability 9.0
Using default input file [../../../../Samples/4_CUDA_Libraries/cuSolverRf/lap2D_5pt_n100.mtx]
WARNING: cusolverRf only works for base-0
sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=0
step 1.2: set right hand side vector (b) to 1
step 2: reorder the matrix to reduce zero fill-in
Q = symrcm(A) or Q = symamd(A)
step 3: B = Q*A*Q^T
step 4: solve A*x = b by LU(B) in cusolverSp
step 4.1: create opaque info structure
step 4.2: analyze LU(B) to know structure of Q and R, and upper bound for nnz(L+U)
step 4.3: workspace for LU(B)
step 4.4: compute Ppivot*B = L*U
step 4.5: check if the matrix is singular
step 4.6: solve A*x = b
i.e. solve B*(Qx) = Q*b
step 4.7: evaluate residual r = b - A*x (result on CPU)
(CPU) |b - A*x| = 4.547474E-12
(CPU) |A| = 8.000000E+00
(CPU) |x| = 7.513384E+02
(CPU) |b - A*x|/(|A|*|x|) = 7.565621E-16
step 5: extract P, Q, L and U from P*B*Q^T = L*U
L has implicit unit diagonal
nnzL = 671550, nnzU = 681550
step 6: form P*A*Q^T = L*U
step 6.1: P = Plu*Qreroder
step 6.2: Q = Qlu*Qreorder
step 7: create cusolverRf handle
step 8: set parameters for cusolverRf
step 9: assemble P*A*Q = L*U
step 10: analyze to extract parallelism
step 11: import A to cusolverRf
step 12: refactorization
step 13: solve A*x = b
step 14: evaluate residual r = b - A*x (result on GPU)
(GPU) |b - A*x| = 4.320100E-12
(GPU) |A| = 8.000000E+00
(GPU) |x| = 7.513384E+02
(GPU) |b - A*x|/(|A|*|x|) = 7.187340E-16
===== statistics
nnz(A) = 49600, nnz(L+U) = 1353100, zero fill-in ratio = 27.280242
===== timing profile
reorder A : 0.003304 sec
B = Q*A*Q^T : 0.000761 sec
cusolverSp LU analysis: 0.000188 sec
cusolverSp LU factor : 0.069354 sec
cusolverSp LU solve : 0.001780 sec
cusolverSp LU extract : 0.005654 sec
cusolverRf assemble : 0.002426 sec
cusolverRf reset : 0.000021 sec
cusolverRf refactor : 0.097122 sec
cusolverRf solve : 0.123813 sec

View File

@ -1,38 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
Using default input file [../../../../Samples/4_CUDA_Libraries/cuSolverSp_LinearSolver/lap2D_5pt_n100.mtx]
step 1: read matrix market format
sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=1
step 2: reorder the matrix A to minimize zero fill-in
if the user choose a reordering by -P=symrcm, -P=symamd or -P=metis
step 2.1: no reordering is chosen, Q = 0:n-1
step 2.2: B = A(Q,Q)
step 3: b(j) = 1 + j/n
step 4: prepare data on device
step 5: solve A*x = b on CPU
step 6: evaluate residual r = b - A*x (result on CPU)
(CPU) |b - A*x| = 5.393685E-12
(CPU) |A| = 8.000000E+00
(CPU) |x| = 1.136492E+03
(CPU) |b| = 1.999900E+00
(CPU) |b - A*x|/(|A|*|x| + |b|) = 5.931079E-16
step 7: solve A*x = b on GPU
step 8: evaluate residual r = b - A*x (result on GPU)
(GPU) |b - A*x| = 1.970424E-12
(GPU) |A| = 8.000000E+00
(GPU) |x| = 1.136492E+03
(GPU) |b| = 1.999900E+00
(GPU) |b - A*x|/(|A|*|x| + |b|) = 2.166745E-16
timing chol: CPU = 0.097956 sec , GPU = 0.103812 sec
show last 10 elements of solution vector (GPU)
consistent result for different reordering and solver
x[9990] = 3.000016E+01
x[9991] = 2.807343E+01
x[9992] = 2.601354E+01
x[9993] = 2.380285E+01
x[9994] = 2.141866E+01
x[9995] = 1.883070E+01
x[9996] = 1.599668E+01
x[9997] = 1.285365E+01
x[9998] = 9.299423E+00
x[9999] = 5.147265E+00

View File

@ -1,24 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
Using default input file [../../../../Samples/4_CUDA_Libraries/cuSolverSp_LowlevelCholesky/lap2D_5pt_n100.mtx]
step 1: read matrix market format
sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=1
step 2: create opaque info structure
step 3: analyze chol(A) to know structure of L
step 4: workspace for chol(A)
step 5: compute A = L*L^T
step 6: check if the matrix is singular
step 7: solve A*x = b
step 8: evaluate residual r = b - A*x (result on CPU)
(CPU) |b - A*x| = 3.637979E-12
(CPU) |A| = 8.000000E+00
(CPU) |x| = 7.513384E+02
(CPU) |b - A*x|/(|A|*|x|) = 6.052497E-16
step 9: create opaque info structure
step 10: analyze chol(A) to know structure of L
step 11: workspace for chol(A)
step 12: compute A = L*L^T
step 13: check if the matrix is singular
step 14: solve A*x = b
(GPU) |b - A*x| = 1.477929E-12
(GPU) |b - A*x|/(|A|*|x|) = 2.458827E-16

View File

@ -1,25 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
Using default input file [../../../../Samples/4_CUDA_Libraries/cuSolverSp_LowlevelQR/lap2D_5pt_n32.mtx]
step 1: read matrix market format
sparse matrix A is 1024 x 1024 with 3008 nonzeros, base=1
step 2: create opaque info structure
step 3: analyze qr(A) to know structure of L
step 4: workspace for qr(A)
step 5: compute A = L*L^T
step 6: check if the matrix is singular
step 7: solve A*x = b
step 8: evaluate residual r = b - A*x (result on CPU)
(CPU) |b - A*x| = 5.329071E-15
(CPU) |A| = 6.000000E+00
(CPU) |x| = 5.000000E-01
(CPU) |b - A*x|/(|A|*|x|) = 1.776357E-15
step 9: create opaque info structure
step 10: analyze qr(A) to know structure of L
step 11: workspace for qr(A)
GPU buffer size = 3751424 bytes
step 12: compute A = L*L^T
step 13: check if the matrix is singular
step 14: solve A*x = b
(GPU) |b - A*x| = 4.218847E-15
(GPU) |b - A*x|/(|A|*|x|) = 1.406282E-15

View File

@ -1,9 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
Generic memory compression support is available
Running saxpy on 167772160 bytes of Compressible memory
Running saxpy with 228 blocks x 1024 threads = 0.084 ms 5.960 TB/s
Running saxpy on 167772160 bytes of Non-Compressible memory
Running saxpy with 228 blocks x 1024 threads = 0.345 ms 1.460 TB/s
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

View File

@ -1,8 +0,0 @@
./cudaOpenMP Starting...
number of host CPUs: 32
number of CUDA devices: 1
0: NVIDIA H100 PCIe
---------------------------
CPU thread 0 (of 1) uses CUDA device 0
---------------------------

View File

@ -1,11 +0,0 @@
Initializing...
GPU Device 0: "Hopper" with compute capability 9.0
M: 4096 (16 x 256)
N: 4096 (16 x 256)
K: 4096 (16 x 256)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm
Time: 1.223904 ms
TFLOPS: 112.30

View File

@ -1,32 +0,0 @@
./dct8x8 Starting...
GPU Device 0: "Hopper" with compute capability 9.0
CUDA sample DCT/IDCT implementation
===================================
Loading test image: teapot512.bmp... [512 x 512]... Success
Running Gold 1 (CPU) version... Success
Running Gold 2 (CPU) version... Success
Running CUDA 1 (GPU) version... Success
Running CUDA 2 (GPU) version... 82435.220134 MPix/s //0.003180 ms
Success
Running CUDA short (GPU) version... Success
Dumping result to teapot512_gold1.bmp... Success
Dumping result to teapot512_gold2.bmp... Success
Dumping result to teapot512_cuda1.bmp... Success
Dumping result to teapot512_cuda2.bmp... Success
Dumping result to teapot512_cuda_short.bmp... Success
Processing time (CUDA 1) : 0.021800 ms
Processing time (CUDA 2) : 0.003180 ms
Processing time (CUDA short): 0.033000 ms
PSNR Original <---> CPU(Gold 1) : 32.527462
PSNR Original <---> CPU(Gold 2) : 32.527309
PSNR Original <---> GPU(CUDA 1) : 32.527184
PSNR Original <---> GPU(CUDA 2) : 32.527054
PSNR Original <---> GPU(CUDA short): 32.501888
PSNR CPU(Gold 1) <---> GPU(CUDA 1) : 62.845787
PSNR CPU(Gold 2) <---> GPU(CUDA 2) : 66.982300
PSNR CPU(Gold 2) <---> GPU(CUDA short): 40.958466
Test Summary...
Test passed

View File

@ -1,46 +0,0 @@
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA H100 PCIe"
CUDA Driver Version / Runtime Version 12.0 / 12.0
CUDA Capability Major/Minor version number: 9.0
Total amount of global memory: 81082 MBytes (85021163520 bytes)
(114) Multiprocessors, (128) CUDA Cores/MP: 14592 CUDA Cores
GPU Max Clock rate: 1650 MHz (1.65 GHz)
Memory Clock rate: 1593 Mhz
Memory Bus Width: 5120-bit
L2 Cache Size: 52428800 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 233472 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 193 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.0, CUDA Runtime Version = 12.0, NumDevs = 1
Result = PASS

View File

@ -1,43 +0,0 @@
./deviceQueryDrv Starting...
CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA H100 PCIe"
CUDA Driver Version: 12.0
CUDA Capability Major/Minor version number: 9.0
Total amount of global memory: 81082 MBytes (85021163520 bytes)
(114) Multiprocessors, (128) CUDA Cores/MP: 14592 CUDA Cores
GPU Max Clock rate: 1650 MHz (1.65 GHz)
Memory Clock rate: 1593 Mhz
Memory Bus Width: 5120-bit
L2 Cache Size: 52428800 bytes
Max Texture Dimension Sizes 1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 193 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Result = PASS

View File

@ -1,11 +0,0 @@
Initializing...
GPU Device 0: "Hopper" with compute capability 9.0
M: 8192 (8 x 1024)
N: 8192 (8 x 1024)
K: 4096 (4 x 1024)
Preparing data for GPU...
Required shared memory size: 68 Kb
Computing using high performance kernel = 0 - compute_dgemm_async_copy
Time: 30.856800 ms
FP64 TFLOPS: 17.82

View File

@ -1,11 +0,0 @@
./dwtHaar1D Starting...
GPU Device 0: "Hopper" with compute capability 9.0
source file = "../../../../Samples/5_Domain_Specific/dwtHaar1D/data/signal.dat"
reference file = "result.dat"
gold file = "../../../../Samples/5_Domain_Specific/dwtHaar1D/data/regression.gold.dat"
Reading signal from "../../../../Samples/5_Domain_Specific/dwtHaar1D/data/signal.dat"
Writing result to "result.dat"
Reading reference result from "../../../../Samples/5_Domain_Specific/dwtHaar1D/data/regression.gold.dat"
Test success!

View File

@ -1,16 +0,0 @@
./dxtc Starting...
GPU Device 0: "Hopper" with compute capability 9.0
Image Loaded '../../../../Samples/5_Domain_Specific/dxtc/data/teapot512_std.ppm', 512 x 512 pixels
Running DXT Compression on 512 x 512 image...
16384 Blocks, 64 Threads per Block, 1048576 Threads in Grid...
dxtc, Throughput = 442.8108 MPixels/s, Time = 0.00059 s, Size = 262144 Pixels, NumDevsUsed = 1, Workgroup = 64
Checking accuracy...
RMS(reference, result) = 0.000000
Test passed

View File

@ -1,13 +0,0 @@
Starting eigenvalues
GPU Device 0: "Hopper" with compute capability 9.0
Matrix size: 2048 x 2048
Precision: 0.000010
Iterations to be timed: 100
Result filename: 'eigenvalues.dat'
Gerschgorin interval: -2.894310 / 2.923303
Average time step 1: 1.032310 ms
Average time step 2, one intervals: 1.228451 ms
Average time step 2, mult intervals: 2.694728 ms
Average time TOTAL: 4.970180 ms
Test Succeeded!

View File

@ -1,17 +0,0 @@
./fastWalshTransform Starting...
GPU Device 0: "Hopper" with compute capability 9.0
Initializing data...
...allocating CPU memory
...allocating GPU memory
...generating data
Data length: 8388608; kernel length: 128
Running GPU dyadic convolution using Fast Walsh Transform...
GPU time: 0.751000 ms; GOP/s: 385.362158
Reading back GPU results...
Running straightforward CPU dyadic convolution...
Comparing the results...
Shutting down...
L2 norm: 1.021579E-07
Test passed

View File

@ -1,5 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
Result native operators : 644622.000000
Result intrinsics : 644622.000000
&&&& fp16ScalarProduct PASSED

View File

@ -1,11 +0,0 @@
[globalToShmemAsyncCopy] - Starting...
GPU Device 0: "Hopper" with compute capability 9.0
MatrixA(1280,1280), MatrixB(1280,1280)
Running kernel = 0 - AsyncCopyMultiStageLargeChunk
Computing result using CUDA Kernel...
done
Performance= 5289.33 GFlop/s, Time= 0.793 msec, Size= 4194304000 Ops, WorkgroupSize= 256 threads/block
Checking computed result for correctness: Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

View File

@ -1,89 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
Driver version is: 12.0
Running sample.
================================
Running virtual address reuse example.
Sequential allocations & frees within a single graph enable CUDA to reuse virtual addresses.
Check confirms that d_a and d_b share a virtual address.
FOOTPRINT: 67108864 bytes
Cleaning up example by trimming device memory.
FOOTPRINT: 0 bytes
================================
Running physical memory reuse example.
CUDA reuses the same physical memory for allocations from separate graphs when the allocation lifetimes don't overlap.
Creating the graph execs does not reserve any physical memory.
FOOTPRINT: 0 bytes
The first graph launched reserves the memory it needs.
FOOTPRINT: 67108864 bytes
A subsequent launch of the same graph in the same stream reuses the same physical memory. Thus the memory footprint does not grow here.
FOOTPRINT: 67108864 bytes
Subsequent launches of other graphs in the same stream also reuse the physical memory. Thus the memory footprint does not grow here.
01: FOOTPRINT: 67108864 bytes
02: FOOTPRINT: 67108864 bytes
03: FOOTPRINT: 67108864 bytes
04: FOOTPRINT: 67108864 bytes
05: FOOTPRINT: 67108864 bytes
06: FOOTPRINT: 67108864 bytes
07: FOOTPRINT: 67108864 bytes
Check confirms all graphs use a different virtual address.
Cleaning up example by trimming device memory.
FOOTPRINT: 0 bytes
================================
Running simultaneous streams example.
Graphs that can run concurrently need separate physical memory. In this example, each graph launched in a separate stream increases the total memory footprint.
When launching a new graph, CUDA may reuse physical memory from a graph whose execution has already finished -- even if the new graph is being launched in a different stream from the completed graph. Therefore, a kernel node is added to the graphs to increase runtime.
Initial footprint:
FOOTPRINT: 0 bytes
Each graph launch in a seperate stream grows the memory footprint:
01: FOOTPRINT: 67108864 bytes
02: FOOTPRINT: 134217728 bytes
03: FOOTPRINT: 201326592 bytes
04: FOOTPRINT: 268435456 bytes
05: FOOTPRINT: 335544320 bytes
06: FOOTPRINT: 402653184 bytes
07: FOOTPRINT: 402653184 bytes
Cleaning up example by trimming device memory.
FOOTPRINT: 0 bytes
================================
Running unfreed streams example.
CUDA cannot reuse phyiscal memory from graphs which do not free their allocations.
Despite being launched in the same stream, each graph launch grows the memory footprint. Since the allocation is not freed, CUDA keeps the memory valid for use.
00: FOOTPRINT: 67108864 bytes
01: FOOTPRINT: 134217728 bytes
02: FOOTPRINT: 201326592 bytes
03: FOOTPRINT: 268435456 bytes
04: FOOTPRINT: 335544320 bytes
05: FOOTPRINT: 402653184 bytes
06: FOOTPRINT: 469762048 bytes
07: FOOTPRINT: 536870912 bytes
Trimming does not impact the memory footprint since the un-freed allocations are still holding onto the memory.
FOOTPRINT: 536870912 bytes
Freeing the allocations does not shrink the footprint.
FOOTPRINT: 536870912 bytes
Since the allocations are now freed, trimming does reduce the footprint even when the graph execs are not yet destroyed.
FOOTPRINT: 0 bytes
Cleaning up example by trimming device memory.
FOOTPRINT: 0 bytes
================================
Sample complete.

View File

@ -1,34 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
Driver version is: 12.0
Setting up sample.
Setup complete.
Running negateSquares in a stream.
Validating negateSquares in a stream...
Validation PASSED!
Running negateSquares in a stream-captured graph.
Validating negateSquares in a stream-captured graph...
Validation PASSED!
Running negateSquares in an explicitly constructed graph.
Check verified that d_negSquare and d_input share a virtual address.
Validating negateSquares in an explicitly constructed graph...
Validation PASSED!
Running negateSquares with d_negSquare freed outside the stream.
Check verified that d_negSquare and d_input share a virtual address.
Validating negateSquares with d_negSquare freed outside the stream...
Validation PASSED!
Running negateSquares with d_negSquare freed outside the graph.
Validating negateSquares with d_negSquare freed outside the graph...
Validation PASSED!
Running negateSquares with d_negSquare freed in a different graph.
Validating negateSquares with d_negSquare freed in a different graph...
Validation PASSED!
Cleaning up sample.
Cleanup complete. Exiting sample.

View File

@ -1,48 +0,0 @@
[[histogram]] - Starting...
GPU Device 0: "Hopper" with compute capability 9.0
CUDA device [NVIDIA H100 PCIe] has 114 Multi-Processors, Compute 9.0
Initializing data...
...allocating CPU memory.
...generating input data
...allocating GPU memory and copying input data
Starting up 64-bin histogram...
Running 64-bin GPU histogram for 67108864 bytes (16 runs)...
histogram64() time (average) : 0.00007 sec, 916944.3386 MB/sec
histogram64, Throughput = 916944.3386 MB/s, Time = 0.00007 s, Size = 67108864 Bytes, NumDevsUsed = 1, Workgroup = 64
Validating GPU results...
...reading back GPU results
...histogram64CPU()
...comparing the results...
...64-bin histograms match
Shutting down 64-bin histogram...
Initializing 256-bin histogram...
Running 256-bin GPU histogram for 67108864 bytes (16 runs)...
histogram256() time (average) : 0.00018 sec, 379951.1088 MB/sec
histogram256, Throughput = 379951.1088 MB/s, Time = 0.00018 s, Size = 67108864 Bytes, NumDevsUsed = 1, Workgroup = 192
Validating GPU results...
...reading back GPU results
...histogram256CPU()
...comparing the results
...256-bin histograms match
Shutting down 256-bin histogram...
Shutting down...
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
[histogram] - Test Summary
Test passed

View File

@ -1,11 +0,0 @@
Initializing...
GPU Device 0: "Hopper" with compute capability 9.0
M: 4096 (16 x 256)
N: 4096 (16 x 256)
K: 4096 (16 x 256)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm_imma
Time: 0.629184 ms
TOPS: 218.44

View File

@ -1,4 +0,0 @@
CUDA inline PTX assembler sample
GPU Device 0: "Hopper" with compute capability 9.0
Test Successful.

View File

@ -1,5 +0,0 @@
CUDA inline PTX assembler sample
> Using CUDA Device [0]: NVIDIA H100 PCIe
> Using CUDA Device [0]: NVIDIA H100 PCIe
> GPU Device has SM 9.0 compute capability
Test Successful.

View File

@ -1,15 +0,0 @@
[Interval Computing] starting ...
GPU Device 0: "Hopper" with compute capability 9.0
> GPU Device has Compute Capabilities SM 9.0
GPU naive implementation
Searching for roots in [0.01, 4]...
Found 2 intervals that may contain the root(s)
i[0] = [0.999655515093009, 1.00011722206639]
i[1] = [1.00011907576551, 1.00044661086269]
Number of equations solved: 65536
Time per equation: 0.616870105266571 us
Check against Host computation...

View File

@ -1,9 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
CPU iterations : 2954
CPU error : 4.988e-03
CPU Processing time: 2525.311035 (ms)
GPU iterations : 2954
GPU error : 4.988e-03
GPU Processing time: 57.967999 (ms)
&&&& jacobiCudaGraphs PASSED

View File

@ -1,7 +0,0 @@
[./lineOfSight] - Starting...
GPU Device 0: "Hopper" with compute capability 9.0
Line of sight
Average time: 0.020620 ms
Test passed

View File

@ -1,10 +0,0 @@
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Hopper" with compute capability 9.0
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 4756.03 GFlop/s, Time= 0.028 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

View File

@ -1,12 +0,0 @@
[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "Hopper" with compute capability 9.0
GPU Device 0: "NVIDIA H100 PCIe" with compute capability 9.0
MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS...done.
Performance= 10873.05 GFlop/s, Time= 0.018 msec, Size= 196608000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

View File

@ -1,11 +0,0 @@
[ matrixMulDrv (Driver API) ]
> Using CUDA Device [0]: NVIDIA H100 PCIe
> GPU Device has SM 9.0 compute capability
Total amount of global memory: 85021163520 bytes
> findModulePath found file at <./matrixMul_kernel64.fatbin>
> initCUDA loading module: <./matrixMul_kernel64.fatbin>
> 32 block size selected
Processing time: 0.058000 (ms)
Checking computed result for correctness: Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

View File

@ -1,6 +0,0 @@
[ matrixMulDynlinkJIT (CUDA dynamic linking) ]
> Device 0: "NVIDIA H100 PCIe" with Compute 9.0 capability
> Compiling CUDA module
> PTX JIT log:
Test run success!

View File

@ -1,9 +0,0 @@
[Matrix Multiply Using CUDA] - Starting...
MatrixA(320,320), MatrixB(640,320)
> Using CUDA Device [0]: NVIDIA H100 PCIe
> Using CUDA Device [0]: NVIDIA H100 PCIe
> GPU Device has SM 9.0 compute capability
Computing result using CUDA Kernel...
Checking computed result for correctness: Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

View File

@ -1,6 +0,0 @@
> findModulePath found file at <./memMapIpc_kernel64.ptx>
> initCUDA loading module: <./memMapIpc_kernel64.ptx>
> PTX JIT log:
Step 0 done
Process 0: verifying...

View File

@ -1,17 +0,0 @@
./mergeSort Starting...
GPU Device 0: "Hopper" with compute capability 9.0
Allocating and initializing host arrays...
Allocating and initializing CUDA arrays...
Initializing GPU merge sort...
Running GPU merge sort...
Time: 1.344000 ms
Reading back GPU merge sort results...
Inspecting the results...
...inspecting keys array: OK
...inspecting keys and values array: OK
...stability property: stable!
Shutting down...

View File

@ -1,11 +0,0 @@
newdelete Starting...
GPU Device 0: "Hopper" with compute capability 9.0
> Container = Vector test OK
> Container = Vector, using placement new on SMEM buffer test OK
> Container = Vector, with user defined datatype test OK
Test Summary: 3/3 succesfully run

View File

@ -1,56 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
Using GPU 0 (NVIDIA H100 PCIe, 114 SMs, 2048 th/SM max, CC 9.0, ECC on)
Decoding images in directory: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/, total 8, batchsize 1
Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img1.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img2.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img3.jpg
Image is 3 channels.
Channel #0 size: 640 x 426
Channel #1 size: 320 x 213
Channel #2 size: 320 x 213
YUV 4:2:0 chroma subsampling
Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img4.jpg
Image is 3 channels.
Channel #0 size: 640 x 426
Channel #1 size: 320 x 213
Channel #2 size: 320 x 213
YUV 4:2:0 chroma subsampling
Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img5.jpg
Image is 3 channels.
Channel #0 size: 640 x 480
Channel #1 size: 320 x 240
Channel #2 size: 320 x 240
YUV 4:2:0 chroma subsampling
Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img6.jpg
Image is 3 channels.
Channel #0 size: 640 x 480
Channel #1 size: 320 x 240
Channel #2 size: 320 x 240
YUV 4:2:0 chroma subsampling
Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img7.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img8.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Total decoding time: 3.19197
Avg decoding time per image: 0.398996
Avg images per sec: 2.50629
Avg decoding time per batch: 0.398996

View File

@ -1,62 +0,0 @@
GPU Device 0: "Hopper" with compute capability 9.0
Using GPU 0 (NVIDIA H100 PCIe, 114 SMs, 2048 th/SM max, CC 9.0, ECC on)
Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img1.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Writing JPEG file: encode_output/img1.jpg
Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img2.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Writing JPEG file: encode_output/img2.jpg
Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img3.jpg
Image is 3 channels.
Channel #0 size: 640 x 426
Channel #1 size: 320 x 213
Channel #2 size: 320 x 213
YUV 4:2:0 chroma subsampling
Writing JPEG file: encode_output/img3.jpg
Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img4.jpg
Image is 3 channels.
Channel #0 size: 640 x 426
Channel #1 size: 320 x 213
Channel #2 size: 320 x 213
YUV 4:2:0 chroma subsampling
Writing JPEG file: encode_output/img4.jpg
Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img5.jpg
Image is 3 channels.
Channel #0 size: 640 x 480
Channel #1 size: 320 x 240
Channel #2 size: 320 x 240
YUV 4:2:0 chroma subsampling
Writing JPEG file: encode_output/img5.jpg
Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img6.jpg
Image is 3 channels.
Channel #0 size: 640 x 480
Channel #1 size: 320 x 240
Channel #2 size: 320 x 240
YUV 4:2:0 chroma subsampling
Writing JPEG file: encode_output/img6.jpg
Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img7.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Writing JPEG file: encode_output/img7.jpg
Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img8.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Writing JPEG file: encode_output/img8.jpg
Total images processed: 8
Total time spent on encoding: 1.9711
Avg time/image: 0.246388

View File

@ -1,35 +0,0 @@
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA H100 PCIe, pciBusID: c1, pciDeviceID: 0, pciDomainID:0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0
0 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0
0 1628.72
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0
0 1625.75
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0
0 1668.11
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0
0 1668.39
P2P=Disabled Latency Matrix (us)
GPU 0
0 2.67
CPU 0
0 2.04
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0
0 2.68
CPU 0
0 2.02
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

View File

@ -1,15 +0,0 @@
[PTX Just In Time (JIT) Compilation (no-qatest)] - Starting...
> Using CUDA Device [0]: NVIDIA H100 PCIe
> findModulePath <./ptxjit_kernel64.ptx>
> initCUDA loading module: <./ptxjit_kernel64.ptx>
Loading ptxjit_kernel[] program
CUDA Link Completed in 0.000000ms. Linker Output:
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'myKernel' for 'sm_90a'
ptxas info : Function properties for myKernel
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 8 registers
info : 0 bytes gmem
info : Function properties for 'myKernel':
info : used 8 registers, 0 stack, 0 bytes smem, 536 bytes cmem[0], 0 bytes lmem
CUDA kernel launched

View File

@ -1,24 +0,0 @@
./quasirandomGenerator Starting...
Allocating GPU memory...
Allocating CPU memory...
Initializing QRNG tables...
Testing QRNG...
quasirandomGenerator, Throughput = 51.2334 GNumbers/s, Time = 0.00006 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 384
Reading GPU results...
Comparing to the CPU results...
L1 norm: 7.275964E-12
Testing inverseCNDgpu()...
quasirandomGenerator-inverse, Throughput = 116.2931 GNumbers/s, Time = 0.00003 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 128
Reading GPU results...
Comparing to the CPU results...
L1 norm: 9.439909E-08
Shutting down...

View File

@ -1,27 +0,0 @@
./quasirandomGenerator_nvrtc Starting...
> Using CUDA Device [0]: NVIDIA H100 PCIe
> Using CUDA Device [0]: NVIDIA H100 PCIe
> GPU Device has SM 9.0 compute capability
Allocating GPU memory...
Allocating CPU memory...
Initializing QRNG tables...
Testing QRNG...
quasirandomGenerator, Throughput = 45.0355 GNumbers/s, Time = 0.00007 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 384
Reading GPU results...
Comparing to the CPU results...
L1 norm: 7.275964E-12
Testing inverseCNDgpu()...
quasirandomGenerator-inverse, Throughput = 94.7508 GNumbers/s, Time = 0.00003 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 128
Reading GPU results...
Comparing to the CPU results...
L1 norm: 9.439909E-08
Shutting down...

View File

@ -1,9 +0,0 @@
./radixSortThrust Starting...
GPU Device 0: "Hopper" with compute capability 9.0
Sorting 1048576 32-bit unsigned int keys and values
radixSortThrust, Throughput = 2276.9744 MElements/s, Time = 0.00046 s, Size = 1048576 elements
Test passed

View File

@ -1,18 +0,0 @@
./reduction Starting...
GPU Device 0: "Hopper" with compute capability 9.0
Using Device 0: NVIDIA H100 PCIe
Reducing array of type int
16777216 elements
256 threads (max)
64 blocks
Reduction, Throughput = 49.0089 GB/s, Time = 0.00137 s, Size = 16777216 Elements, NumDevsUsed = 1, Workgroup = 256
GPU result = 2139353471
CPU result = 2139353471
Test passed

View File

@ -1,14 +0,0 @@
reductionMultiBlockCG Starting...
GPU Device 0: "Hopper" with compute capability 9.0
33554432 elements
numThreads: 1024
numBlocks: 228
Launching SinglePass Multi Block Cooperative Groups kernel
Average time: 0.102750 ms
Bandwidth: 1306.254555 GB/s
GPU result = 1.992401599884
CPU result = 1.992401361465

View File

@ -1,19 +0,0 @@
./scalarProd Starting...
GPU Device 0: "Hopper" with compute capability 9.0
Initializing data...
...allocating CPU memory.
...allocating GPU memory.
...generating input data in CPU mem.
...copying input data to GPU mem.
Data init done.
Executing GPU kernel...
GPU time: 0.042000 msecs.
Reading back GPU result...
Checking GPU results...
..running CPU scalar product calculation
...comparing the results
Shutting down...
L1 error: 2.745062E-08
Test passed

View File

@ -1,138 +0,0 @@
./scan Starting...
GPU Device 0: "Hopper" with compute capability 9.0
Allocating and initializing host arrays...
Allocating and initializing CUDA arrays...
Initializing CUDA-C scan...
*** Running GPU scan for short arrays (100 identical iterations)...
Running scan for 4 elements (1703936 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
...comparing the results
...Results Match
Running scan for 8 elements (851968 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
...comparing the results
...Results Match
Running scan for 16 elements (425984 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
...comparing the results
...Results Match
Running scan for 32 elements (212992 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
...comparing the results
...Results Match
Running scan for 64 elements (106496 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
...comparing the results
...Results Match
Running scan for 128 elements (53248 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
...comparing the results
...Results Match
Running scan for 256 elements (26624 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
...comparing the results
...Results Match
Running scan for 512 elements (13312 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
...comparing the results
...Results Match
Running scan for 1024 elements (6656 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
...comparing the results
...Results Match
scan, Throughput = 35.1769 MElements/s, Time = 0.00003 s, Size = 1024 Elements, NumDevsUsed = 1, Workgroup = 256
***Running GPU scan for large arrays (100 identical iterations)...
Running scan for 2048 elements (3328 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
...comparing the results
...Results Match
Running scan for 4096 elements (1664 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
...comparing the results
...Results Match
Running scan for 8192 elements (832 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
...comparing the results
...Results Match
Running scan for 16384 elements (416 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
...comparing the results
...Results Match
Running scan for 32768 elements (208 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
...comparing the results
...Results Match
Running scan for 65536 elements (104 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
...comparing the results
...Results Match
Running scan for 131072 elements (52 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
...comparing the results
...Results Match
Running scan for 262144 elements (26 arrays)...
Validating the results...
...reading back GPU results
...scanExclusiveHost()
...comparing the results
...Results Match
scan, Throughput = 5146.1328 MElements/s, Time = 0.00005 s, Size = 262144 Elements, NumDevsUsed = 1, Workgroup = 256
Shutting down...

View File

@ -1,6 +0,0 @@
./segmentationTreeThrust Starting...
GPU Device 0: "Hopper" with compute capability 9.0
* Building segmentation tree... done in 24.6388 (ms)
* Dumping levels for each tree...

View File

@ -1,24 +0,0 @@
Starting shfl_scan
GPU Device 0: "Hopper" with compute capability 9.0
> Detected Compute SM 9.0 hardware with 114 multi-processors
Starting shfl_scan
GPU Device 0: "Hopper" with compute capability 9.0
> Detected Compute SM 9.0 hardware with 114 multi-processors
Computing Simple Sum test
---------------------------------------------------
Initialize test data [1, 1, 1...]
Scan summation for 65536 elements, 256 partial sums
Partial summing 256 elements with 1 blocks of size 256
Test Sum: 65536
Time (ms): 0.021504
65536 elements scanned in 0.021504 ms -> 3047.619141 MegaElements/s
CPU verify result diff (GPUvsCPU) = 0
CPU sum (naive) took 0.017810 ms
Computing Integral Image Test on size 1920 x 1080 synthetic data
---------------------------------------------------
Method: Fast Time (GPU Timer): 0.008032 ms Diff = 0
Method: Vertical Scan Time (GPU Timer): 0.068576 ms
CheckSum: 2073600, (expect 1920x1080=2073600)

View File

@ -1,6 +0,0 @@
./simpleAWBarrier starting...
GPU Device 0: "Hopper" with compute capability 9.0
Launching normVecByDotProductAWBarrier kernel with numBlocks = 228 blockSize = 576
Result = PASSED
./simpleAWBarrier completed, returned OK

View File

@ -1,16 +0,0 @@
simpleAssert starting...
OS_System_Type.release = 5.4.0-131-generic
OS Info: <#147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022>
GPU Device 0: "Hopper" with compute capability 9.0
Launch kernel to generate assertion failures
-- Begin assert output
-- End assert output
Device assert failed as expected, CUDA error message is: device-side assert triggered
simpleAssert completed, returned OK

View File

@ -1,12 +0,0 @@
simpleAssert_nvrtc starting...
Launch kernel to generate assertion failures
> Using CUDA Device [0]: NVIDIA H100 PCIe
> Using CUDA Device [0]: NVIDIA H100 PCIe
> GPU Device has SM 9.0 compute capability
-- Begin assert output
-- End assert output
Device assert failed as expected

View File

@ -1,5 +0,0 @@
simpleAtomicIntrinsics starting...
GPU Device 0: "Hopper" with compute capability 9.0
Processing time: 2.438000 (ms)
simpleAtomicIntrinsics completed, returned OK

View File

@ -1,6 +0,0 @@
simpleAtomicIntrinsics_nvrtc starting...
> Using CUDA Device [0]: NVIDIA H100 PCIe
> Using CUDA Device [0]: NVIDIA H100 PCIe
> GPU Device has SM 9.0 compute capability
Processing time: 0.171000 (ms)
simpleAtomicIntrinsics_nvrtc completed, returned OK

Some files were not shown because too many files have changed in this diff Show More