mirror of
https://github.com/NVIDIA/cuda-samples.git
synced 2025-07-01 20:20:29 +08:00
Remove unused bin/x86_64 directory hierarchy
This commit is contained in:
parent
c70d79cf3b
commit
ab68d58d59
@ -1,36 +0,0 @@
|
|||||||
[./BlackScholes] - Starting...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Initializing data...
|
|
||||||
...allocating CPU memory for options.
|
|
||||||
...allocating GPU memory for options.
|
|
||||||
...generating input data in CPU mem.
|
|
||||||
...copying input data to GPU mem.
|
|
||||||
Data init done.
|
|
||||||
|
|
||||||
Executing Black-Scholes GPU kernel (512 iterations)...
|
|
||||||
Options count : 8000000
|
|
||||||
BlackScholesGPU() time : 0.048059 msec
|
|
||||||
Effective memory bandwidth: 1664.634581 GB/s
|
|
||||||
Gigaoptions per second : 166.463458
|
|
||||||
|
|
||||||
BlackScholes, Throughput = 166.4635 GOptions/s, Time = 0.00005 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128
|
|
||||||
|
|
||||||
Reading back GPU results...
|
|
||||||
Checking the results...
|
|
||||||
...running CPU calculations.
|
|
||||||
|
|
||||||
Comparing the results...
|
|
||||||
L1 norm: 1.741792E-07
|
|
||||||
Max absolute error: 1.192093E-05
|
|
||||||
|
|
||||||
Shutting down...
|
|
||||||
...releasing GPU memory.
|
|
||||||
...releasing CPU memory.
|
|
||||||
Shutdown done.
|
|
||||||
|
|
||||||
[BlackScholes] - Test Summary
|
|
||||||
|
|
||||||
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
|
|
||||||
|
|
||||||
Test passed
|
|
@ -1,34 +0,0 @@
|
|||||||
[./BlackScholes_nvrtc] - Starting...
|
|
||||||
Initializing data...
|
|
||||||
...allocating CPU memory for options.
|
|
||||||
> Using CUDA Device [0]: NVIDIA H100 PCIe
|
|
||||||
> Using CUDA Device [0]: NVIDIA H100 PCIe
|
|
||||||
> GPU Device has SM 9.0 compute capability
|
|
||||||
...allocating GPU memory for options.
|
|
||||||
...generating input data in CPU mem.
|
|
||||||
...copying input data to GPU mem.
|
|
||||||
Data init done.
|
|
||||||
|
|
||||||
Executing Black-Scholes GPU kernel (512 iterations)...
|
|
||||||
Options count : 8000000
|
|
||||||
BlackScholesGPU() time : 0.047896 msec
|
|
||||||
Effective memory bandwidth: 1670.268678 GB/s
|
|
||||||
Gigaoptions per second : 167.026868
|
|
||||||
|
|
||||||
BlackScholes, Throughput = 167.0269 GOptions/s, Time = 0.00005 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128
|
|
||||||
|
|
||||||
Reading back GPU results...
|
|
||||||
Checking the results...
|
|
||||||
...running CPU calculations.
|
|
||||||
|
|
||||||
Comparing the results...
|
|
||||||
L1 norm: 1.741792E-07
|
|
||||||
Max absolute error: 1.192093E-05
|
|
||||||
|
|
||||||
Shutting down...
|
|
||||||
...releasing GPU memory.
|
|
||||||
...releasing CPU memory.
|
|
||||||
Shutdown done.
|
|
||||||
|
|
||||||
[./BlackScholes_nvrtc] - Test Summary
|
|
||||||
Test passed
|
|
@ -1,37 +0,0 @@
|
|||||||
./FDTD3d Starting...
|
|
||||||
|
|
||||||
Set-up, based upon target device GMEM size...
|
|
||||||
getTargetDeviceGlobalMemSize
|
|
||||||
cudaGetDeviceCount
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
cudaGetDeviceProperties
|
|
||||||
generateRandomData
|
|
||||||
|
|
||||||
FDTD on 376 x 376 x 376 volume with symmetric filter radius 4 for 5 timesteps...
|
|
||||||
|
|
||||||
fdtdReference...
|
|
||||||
calloc intermediate
|
|
||||||
Host FDTD loop
|
|
||||||
t = 0
|
|
||||||
t = 1
|
|
||||||
t = 2
|
|
||||||
t = 3
|
|
||||||
t = 4
|
|
||||||
|
|
||||||
fdtdReference complete
|
|
||||||
fdtdGPU...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
set block size to 32x16
|
|
||||||
set grid size to 12x24
|
|
||||||
GPU FDTD loop
|
|
||||||
t = 0 launch kernel
|
|
||||||
t = 1 launch kernel
|
|
||||||
t = 2 launch kernel
|
|
||||||
t = 3 launch kernel
|
|
||||||
t = 4 launch kernel
|
|
||||||
|
|
||||||
fdtdGPU complete
|
|
||||||
|
|
||||||
CompareData (tolerance 0.000100)...
|
|
@ -1,9 +0,0 @@
|
|||||||
HSOpticalFlow Starting...
|
|
||||||
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Loading "frame10.ppm" ...
|
|
||||||
Loading "frame11.ppm" ...
|
|
||||||
Computing optical flow on CPU...
|
|
||||||
Computing optical flow on GPU...
|
|
||||||
L1 error : 0.044308
|
|
@ -1,14 +0,0 @@
|
|||||||
Monte Carlo Estimate Pi (with inline PRNG)
|
|
||||||
==========================================
|
|
||||||
|
|
||||||
Estimating Pi on GPU (NVIDIA H100 PCIe)
|
|
||||||
|
|
||||||
Precision: single
|
|
||||||
Number of sims: 100000
|
|
||||||
Tolerance: 1.000000e-02
|
|
||||||
GPU result: 3.141480e+00
|
|
||||||
Expected: 3.141593e+00
|
|
||||||
Absolute error: 1.127720e-04
|
|
||||||
Relative error: 3.589644e-05
|
|
||||||
|
|
||||||
MonteCarloEstimatePiInlineP, Performance = 746954.33 sims/s, Time = 133.88(ms), NumDevsUsed = 1, Blocksize = 128
|
|
@ -1,14 +0,0 @@
|
|||||||
Monte Carlo Estimate Pi (with inline QRNG)
|
|
||||||
==========================================
|
|
||||||
|
|
||||||
Estimating Pi on GPU (NVIDIA H100 PCIe)
|
|
||||||
|
|
||||||
Precision: single
|
|
||||||
Number of sims: 100000
|
|
||||||
Tolerance: 1.000000e-02
|
|
||||||
GPU result: 3.141840e+00
|
|
||||||
Expected: 3.141593e+00
|
|
||||||
Absolute error: 2.472401e-04
|
|
||||||
Relative error: 7.869895e-05
|
|
||||||
|
|
||||||
MonteCarloEstimatePiInlineQ, Performance = 677644.44 sims/s, Time = 147.57(ms), NumDevsUsed = 1, Blocksize = 128
|
|
@ -1,14 +0,0 @@
|
|||||||
Monte Carlo Estimate Pi (with batch PRNG)
|
|
||||||
=========================================
|
|
||||||
|
|
||||||
Estimating Pi on GPU (NVIDIA H100 PCIe)
|
|
||||||
|
|
||||||
Precision: single
|
|
||||||
Number of sims: 100000
|
|
||||||
Tolerance: 1.000000e-02
|
|
||||||
GPU result: 3.136320e+00
|
|
||||||
Expected: 3.141593e+00
|
|
||||||
Absolute error: 5.272627e-03
|
|
||||||
Relative error: 1.678329e-03
|
|
||||||
|
|
||||||
MonteCarloEstimatePiP, Performance = 652941.82 sims/s, Time = 153.15(ms), NumDevsUsed = 1, Blocksize = 128
|
|
@ -1,14 +0,0 @@
|
|||||||
Monte Carlo Estimate Pi (with batch QRNG)
|
|
||||||
=========================================
|
|
||||||
|
|
||||||
Estimating Pi on GPU (NVIDIA H100 PCIe)
|
|
||||||
|
|
||||||
Precision: single
|
|
||||||
Number of sims: 100000
|
|
||||||
Tolerance: 1.000000e-02
|
|
||||||
GPU result: 3.141840e+00
|
|
||||||
Expected: 3.141593e+00
|
|
||||||
Absolute error: 2.472401e-04
|
|
||||||
Relative error: 7.869895e-05
|
|
||||||
|
|
||||||
MonteCarloEstimatePiQ, Performance = 821146.16 sims/s, Time = 121.78(ms), NumDevsUsed = 1, Blocksize = 128
|
|
@ -1,13 +0,0 @@
|
|||||||
Monte Carlo Single Asian Option (with PRNG)
|
|
||||||
===========================================
|
|
||||||
|
|
||||||
Pricing option on GPU (NVIDIA H100 PCIe)
|
|
||||||
|
|
||||||
Precision: single
|
|
||||||
Number of sims: 100000
|
|
||||||
|
|
||||||
Spot | Strike | r | sigma | tenor | Call/Put | Value | Expected |
|
|
||||||
-----------|------------|------------|------------|------------|------------|------------|------------|
|
|
||||||
40 | 35 | 0.03 | 0.2 | 0.333333 | Call | 5.17083 | 5.16253 |
|
|
||||||
|
|
||||||
MonteCarloSingleAsianOptionP, Performance = 824402.28 sims/s, Time = 121.30(ms), NumDevsUsed = 1, Blocksize = 128
|
|
@ -1,19 +0,0 @@
|
|||||||
./MersenneTwisterGP11213 Starting...
|
|
||||||
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Allocating data for 2400000 samples...
|
|
||||||
Seeding with 777 ...
|
|
||||||
Generating random numbers on GPU...
|
|
||||||
|
|
||||||
|
|
||||||
Reading back the results...
|
|
||||||
Generating random numbers on CPU...
|
|
||||||
|
|
||||||
Comparing CPU/GPU random numbers...
|
|
||||||
|
|
||||||
Max absolute error: 0.000000E+00
|
|
||||||
L1 norm: 0.000000E+00
|
|
||||||
|
|
||||||
MersenneTwisterGP11213, Throughput = 74.5342 GNumbers/s, Time = 0.00003 s, Size = 2400000 Numbers
|
|
||||||
Shutting down...
|
|
@ -1,29 +0,0 @@
|
|||||||
./MonteCarloMultiGPU Starting...
|
|
||||||
|
|
||||||
Using single CPU thread for multiple GPUs
|
|
||||||
MonteCarloMultiGPU
|
|
||||||
==================
|
|
||||||
Parallelization method = streamed
|
|
||||||
Problem scaling = weak
|
|
||||||
Number of GPUs = 1
|
|
||||||
Total number of options = 8192
|
|
||||||
Number of paths = 262144
|
|
||||||
main(): generating input data...
|
|
||||||
main(): starting 1 host threads...
|
|
||||||
main(): GPU statistics, streamed
|
|
||||||
GPU Device #0: NVIDIA H100 PCIe
|
|
||||||
Options : 8192
|
|
||||||
Simulation paths: 262144
|
|
||||||
|
|
||||||
Total time (ms.): 5.516000
|
|
||||||
Note: This is elapsed time for all to compute.
|
|
||||||
Options per sec.: 1485134.210647
|
|
||||||
main(): comparing Monte Carlo and Black-Scholes results...
|
|
||||||
Shutting down...
|
|
||||||
Test Summary...
|
|
||||||
L1 norm : 4.869781E-04
|
|
||||||
Average reserve: 14.607882
|
|
||||||
|
|
||||||
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
|
|
||||||
|
|
||||||
Test passed
|
|
@ -1,10 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
|
|
||||||
TEST#1:
|
|
||||||
CUDA resize nv12(1920x1080 --> 640x480), batch: 24, average time: 0.024 ms ==> 0.001 ms/frame
|
|
||||||
CUDA convert nv12(640x480) to bgr(640x480), batch: 24, average time: 0.061 ms ==> 0.003 ms/frame
|
|
||||||
|
|
||||||
TEST#2:
|
|
||||||
CUDA convert nv12(1920x1080) to bgr(1920x1080), batch: 24, average time: 0.405 ms ==> 0.017 ms/frame
|
|
||||||
CUDA resize bgr(1920x1080 --> 640x480), batch: 24, average time: 0.318 ms ==> 0.013 ms/frame
|
|
@ -1,19 +0,0 @@
|
|||||||
Sobol Quasi-Random Number Generator Starting...
|
|
||||||
|
|
||||||
> number of vectors = 100000
|
|
||||||
> number of dimensions = 100
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Allocating CPU memory...
|
|
||||||
Allocating GPU memory...
|
|
||||||
Initializing direction numbers...
|
|
||||||
Copying direction numbers to device...
|
|
||||||
Executing QRNG on GPU...
|
|
||||||
Gsamples/s: 7.51315
|
|
||||||
Reading results from GPU...
|
|
||||||
|
|
||||||
Executing QRNG on CPU...
|
|
||||||
Gsamples/s: 0.232504
|
|
||||||
Checking results...
|
|
||||||
L1-Error: 0
|
|
||||||
Shutting down...
|
|
@ -1,6 +0,0 @@
|
|||||||
Starting [./StreamPriorities]...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
CUDA stream priority range: LOW: 0 to HIGH: -5
|
|
||||||
elapsed time of kernels launched to LOW priority stream: 0.661 ms
|
|
||||||
elapsed time of kernels launched to HI priority stream: 0.523 ms
|
|
@ -1,17 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Running ........................................................
|
|
||||||
|
|
||||||
Overall Time For matrixMultiplyPerf
|
|
||||||
|
|
||||||
Printing Average of 20 measurements in (ms)
|
|
||||||
Size_KB UMhint UMhntAs UMeasy 0Copy MemCopy CpAsync CpHpglk CpPglAs
|
|
||||||
4 0.210 0.264 0.332 0.014 0.033 0.026 0.037 0.024
|
|
||||||
16 0.201 0.307 0.489 0.025 0.043 0.035 0.046 0.045
|
|
||||||
64 0.311 0.381 0.758 0.067 0.084 0.075 0.074 0.063
|
|
||||||
256 0.545 0.604 1.429 0.323 0.228 0.212 0.197 0.187
|
|
||||||
1024 1.551 1.444 2.436 1.902 0.831 0.784 0.714 0.728
|
|
||||||
4096 4.960 4.375 7.863 11.966 3.239 3.179 2.908 2.919
|
|
||||||
16384 18.911 17.022 29.696 77.375 13.796 13.757 12.874 12.862
|
|
||||||
|
|
||||||
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
|
|
@ -1,44 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Executing tasks on host / device
|
|
||||||
Task [0], thread [0] executing on device (569)
|
|
||||||
Task [1], thread [1] executing on device (904)
|
|
||||||
Task [2], thread [2] executing on device (529)
|
|
||||||
Task [3], thread [3] executing on device (600)
|
|
||||||
Task [4], thread [0] executing on device (975)
|
|
||||||
Task [5], thread [2] executing on device (995)
|
|
||||||
Task [6], thread [1] executing on device (576)
|
|
||||||
Task [7], thread [3] executing on device (700)
|
|
||||||
Task [8], thread [0] executing on device (716)
|
|
||||||
Task [9], thread [2] executing on device (358)
|
|
||||||
Task [10], thread [3] executing on device (941)
|
|
||||||
Task [11], thread [1] executing on device (403)
|
|
||||||
Task [12], thread [0] executing on host (97)
|
|
||||||
Task [13], thread [2] executing on device (451)
|
|
||||||
Task [14], thread [1] executing on device (789)
|
|
||||||
Task [15], thread [0] executing on device (810)
|
|
||||||
Task [16], thread [2] executing on device (807)
|
|
||||||
Task [17], thread [3] executing on device (756)
|
|
||||||
Task [18], thread [0] executing on device (509)
|
|
||||||
Task [19], thread [1] executing on device (252)
|
|
||||||
Task [20], thread [2] executing on device (515)
|
|
||||||
Task [21], thread [3] executing on device (676)
|
|
||||||
Task [22], thread [0] executing on device (948)
|
|
||||||
Task [23], thread [1] executing on device (944)
|
|
||||||
Task [24], thread [3] executing on device (974)
|
|
||||||
Task [25], thread [2] executing on device (513)
|
|
||||||
Task [26], thread [0] executing on device (207)
|
|
||||||
Task [27], thread [1] executing on device (509)
|
|
||||||
Task [28], thread [2] executing on device (344)
|
|
||||||
Task [29], thread [3] executing on device (198)
|
|
||||||
Task [30], thread [0] executing on device (223)
|
|
||||||
Task [31], thread [1] executing on device (382)
|
|
||||||
Task [32], thread [3] executing on device (980)
|
|
||||||
Task [33], thread [2] executing on device (519)
|
|
||||||
Task [34], thread [0] executing on host (92)
|
|
||||||
Task [35], thread [1] executing on device (677)
|
|
||||||
Task [36], thread [2] executing on device (769)
|
|
||||||
Task [37], thread [0] executing on device (199)
|
|
||||||
Task [38], thread [3] executing on device (845)
|
|
||||||
Task [39], thread [0] executing on device (844)
|
|
||||||
All Done!
|
|
@ -1,51 +0,0 @@
|
|||||||
[./alignedTypes] - Starting...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
[NVIDIA H100 PCIe] has 114 MP(s) x 128 (Cores/MP) = 14592 (Cores)
|
|
||||||
> Compute scaling value = 1.00
|
|
||||||
> Memory Size = 49999872
|
|
||||||
Allocating memory...
|
|
||||||
Generating host input data array...
|
|
||||||
Uploading input data to GPU memory...
|
|
||||||
Testing misaligned types...
|
|
||||||
uint8...
|
|
||||||
Avg. time: 1.298781 ms / Copy throughput: 35.853619 GB/s.
|
|
||||||
TEST OK
|
|
||||||
uint16...
|
|
||||||
Avg. time: 0.676656 ms / Copy throughput: 68.817823 GB/s.
|
|
||||||
TEST OK
|
|
||||||
RGBA8_misaligned...
|
|
||||||
Avg. time: 0.371437 ms / Copy throughput: 125.367015 GB/s.
|
|
||||||
TEST OK
|
|
||||||
LA32_misaligned...
|
|
||||||
Avg. time: 0.200531 ms / Copy throughput: 232.213238 GB/s.
|
|
||||||
TEST OK
|
|
||||||
RGB32_misaligned...
|
|
||||||
Avg. time: 0.154500 ms / Copy throughput: 301.398134 GB/s.
|
|
||||||
TEST OK
|
|
||||||
RGBA32_misaligned...
|
|
||||||
Avg. time: 0.124531 ms / Copy throughput: 373.930325 GB/s.
|
|
||||||
TEST OK
|
|
||||||
Testing aligned types...
|
|
||||||
RGBA8...
|
|
||||||
Avg. time: 0.364031 ms / Copy throughput: 127.917614 GB/s.
|
|
||||||
TEST OK
|
|
||||||
I32...
|
|
||||||
Avg. time: 0.363844 ms / Copy throughput: 127.983539 GB/s.
|
|
||||||
TEST OK
|
|
||||||
LA32...
|
|
||||||
Avg. time: 0.200750 ms / Copy throughput: 231.960205 GB/s.
|
|
||||||
TEST OK
|
|
||||||
RGB32...
|
|
||||||
Avg. time: 0.122375 ms / Copy throughput: 380.518985 GB/s.
|
|
||||||
TEST OK
|
|
||||||
RGBA32...
|
|
||||||
Avg. time: 0.122437 ms / Copy throughput: 380.324735 GB/s.
|
|
||||||
TEST OK
|
|
||||||
RGBA32_2...
|
|
||||||
Avg. time: 0.080563 ms / Copy throughput: 578.010964 GB/s.
|
|
||||||
TEST OK
|
|
||||||
|
|
||||||
[alignedTypes] -> Test Results: 0 Failures
|
|
||||||
Shutting down...
|
|
||||||
Test passed
|
|
@ -1,7 +0,0 @@
|
|||||||
[./asyncAPI] - Starting...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
CUDA device [NVIDIA H100 PCIe]
|
|
||||||
time spent executing by the GPU: 5.34
|
|
||||||
time spent by CPU in CUDA calls: 0.03
|
|
||||||
CPU executed 55200 iterations while waiting for GPU to finish
|
|
@ -1,24 +0,0 @@
|
|||||||
[CUDA Bandwidth Test] - Starting...
|
|
||||||
Running on...
|
|
||||||
|
|
||||||
Device 0: NVIDIA H100 PCIe
|
|
||||||
Quick Mode
|
|
||||||
|
|
||||||
Host to Device Bandwidth, 1 Device(s)
|
|
||||||
PINNED Memory Transfers
|
|
||||||
Transfer Size (Bytes) Bandwidth(GB/s)
|
|
||||||
32000000 27.9
|
|
||||||
|
|
||||||
Device to Host Bandwidth, 1 Device(s)
|
|
||||||
PINNED Memory Transfers
|
|
||||||
Transfer Size (Bytes) Bandwidth(GB/s)
|
|
||||||
32000000 25.0
|
|
||||||
|
|
||||||
Device to Device Bandwidth, 1 Device(s)
|
|
||||||
PINNED Memory Transfers
|
|
||||||
Transfer Size (Bytes) Bandwidth(GB/s)
|
|
||||||
32000000 1421.4
|
|
||||||
|
|
||||||
Result = PASS
|
|
||||||
|
|
||||||
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
|
|
@ -1,59 +0,0 @@
|
|||||||
batchCUBLAS Starting...
|
|
||||||
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
|
|
||||||
==== Running single kernels ====
|
|
||||||
|
|
||||||
Testing sgemm
|
|
||||||
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbf800000, -1) beta= (0x40000000, 2)
|
|
||||||
#### args: lda=128 ldb=128 ldc=128
|
|
||||||
^^^^ elapsed = 0.00195909 sec GFLOPS=2.14095
|
|
||||||
@@@@ sgemm test OK
|
|
||||||
Testing dgemm
|
|
||||||
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0x0000000000000000, 0) beta= (0x0000000000000000, 0)
|
|
||||||
#### args: lda=128 ldb=128 ldc=128
|
|
||||||
^^^^ elapsed = 0.00003910 sec GFLOPS=107.269
|
|
||||||
@@@@ dgemm test OK
|
|
||||||
|
|
||||||
==== Running N=10 without streams ====
|
|
||||||
|
|
||||||
Testing sgemm
|
|
||||||
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbf800000, -1) beta= (0x00000000, 0)
|
|
||||||
#### args: lda=128 ldb=128 ldc=128
|
|
||||||
^^^^ elapsed = 0.00016713 sec GFLOPS=250.958
|
|
||||||
@@@@ sgemm test OK
|
|
||||||
Testing dgemm
|
|
||||||
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0)
|
|
||||||
#### args: lda=128 ldb=128 ldc=128
|
|
||||||
^^^^ elapsed = 0.00144100 sec GFLOPS=29.1069
|
|
||||||
@@@@ dgemm test OK
|
|
||||||
|
|
||||||
==== Running N=10 with streams ====
|
|
||||||
|
|
||||||
Testing sgemm
|
|
||||||
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0x40000000, 2) beta= (0x40000000, 2)
|
|
||||||
#### args: lda=128 ldb=128 ldc=128
|
|
||||||
^^^^ elapsed = 0.00017214 sec GFLOPS=243.659
|
|
||||||
@@@@ sgemm test OK
|
|
||||||
Testing dgemm
|
|
||||||
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0)
|
|
||||||
#### args: lda=128 ldb=128 ldc=128
|
|
||||||
^^^^ elapsed = 0.00014997 sec GFLOPS=279.685
|
|
||||||
@@@@ dgemm test OK
|
|
||||||
|
|
||||||
==== Running N=10 batched ====
|
|
||||||
|
|
||||||
Testing sgemm
|
|
||||||
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0x3f800000, 1) beta= (0xbf800000, -1)
|
|
||||||
#### args: lda=128 ldb=128 ldc=128
|
|
||||||
^^^^ elapsed = 0.00004101 sec GFLOPS=1022.8
|
|
||||||
@@@@ sgemm test OK
|
|
||||||
Testing dgemm
|
|
||||||
#### args: ta=0 tb=0 m=128 n=128 k=128 alpha = (0xbff0000000000000, -1) beta= (0x4000000000000000, 2)
|
|
||||||
#### args: lda=128 ldb=128 ldc=128
|
|
||||||
^^^^ elapsed = 0.00004506 sec GFLOPS=930.803
|
|
||||||
@@@@ dgemm test OK
|
|
||||||
|
|
||||||
Test Summary
|
|
||||||
0 error(s)
|
|
@ -1,21 +0,0 @@
|
|||||||
NPP Library Version 12.0.0
|
|
||||||
CUDA Driver Version: 12.0
|
|
||||||
CUDA Runtime Version: 12.0
|
|
||||||
|
|
||||||
Input file load succeeded.
|
|
||||||
teapot_CompressedMarkerLabelsUF_8Way_512x512_32u succeeded, compressed label count is 155332.
|
|
||||||
Input file load succeeded.
|
|
||||||
CT_Skull_CompressedMarkerLabelsUF_8Way_512x512_32u succeeded, compressed label count is 414.
|
|
||||||
Input file load succeeded.
|
|
||||||
PCB_METAL_CompressedMarkerLabelsUF_8Way_509x335_32u succeeded, compressed label count is 3731.
|
|
||||||
Input file load succeeded.
|
|
||||||
PCB2_CompressedMarkerLabelsUF_8Way_1024x683_32u succeeded, compressed label count is 1224.
|
|
||||||
Input file load succeeded.
|
|
||||||
PCB_CompressedMarkerLabelsUF_8Way_1280x720_32u succeeded, compressed label count is 1440.
|
|
||||||
|
|
||||||
|
|
||||||
teapot_CompressedMarkerLabelsUFBatch_8Way_512x512_32u succeeded, compressed label count is 155332.
|
|
||||||
CT_Skull_CompressedMarkerLabelsUFBatch_8Way_512x512_32u succeeded, compressed label count is 414.
|
|
||||||
PCB_METAL_CompressedMarkerLabelsUFBatch_8Way_509x335_32u succeeded, compressed label count is 3731.
|
|
||||||
PCB2_CompressedMarkerLabelsUFBatch_8Way_1024x683_32u succeeded, compressed label count is 1222.
|
|
||||||
PCB_CompressedMarkerLabelsUFBatch_8Way_1280x720_32u succeeded, compressed label count is 1447.
|
|
@ -1,11 +0,0 @@
|
|||||||
Initializing...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
M: 8192 (16 x 512)
|
|
||||||
N: 8192 (16 x 512)
|
|
||||||
K: 8192 (16 x 512)
|
|
||||||
Preparing data for GPU...
|
|
||||||
Required shared memory size: 72 Kb
|
|
||||||
Computing using high performance kernel = 0 - compute_bf16gemm_async_copy
|
|
||||||
Time: 9.149888 ms
|
|
||||||
TFLOPS: 120.17
|
|
@ -1,8 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
|
|
||||||
Launching 228 blocks with 1024 threads...
|
|
||||||
|
|
||||||
Array size = 102400 Num of Odds = 50945 Sum of Odds = 1272565 Sum of Evens 1233938
|
|
||||||
|
|
||||||
...Done.
|
|
@ -1,22 +0,0 @@
|
|||||||
[./binomialOptions] - Starting...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Generating input data...
|
|
||||||
Running GPU binomial tree...
|
|
||||||
Options count : 1024
|
|
||||||
Time steps : 2048
|
|
||||||
binomialOptionsGPU() time: 2.081000 msec
|
|
||||||
Options per second : 492071.098457
|
|
||||||
Running CPU binomial tree...
|
|
||||||
Comparing the results...
|
|
||||||
GPU binomial vs. Black-Scholes
|
|
||||||
L1 norm: 2.220214E-04
|
|
||||||
CPU binomial vs. Black-Scholes
|
|
||||||
L1 norm: 2.220922E-04
|
|
||||||
CPU binomial vs. GPU binomial
|
|
||||||
L1 norm: 7.997008E-07
|
|
||||||
Shutting down...
|
|
||||||
|
|
||||||
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
|
|
||||||
|
|
||||||
Test passed
|
|
@ -1,23 +0,0 @@
|
|||||||
[./binomialOptions_nvrtc] - Starting...
|
|
||||||
Generating input data...
|
|
||||||
Running GPU binomial tree...
|
|
||||||
> Using CUDA Device [0]: NVIDIA H100 PCIe
|
|
||||||
> Using CUDA Device [0]: NVIDIA H100 PCIe
|
|
||||||
> GPU Device has SM 9.0 compute capability
|
|
||||||
Options count : 1024
|
|
||||||
Time steps : 2048
|
|
||||||
binomialOptionsGPU() time: 3021.375000 msec
|
|
||||||
Options per second : 338.918539
|
|
||||||
Running CPU binomial tree...
|
|
||||||
Comparing the results...
|
|
||||||
GPU binomial vs. Black-Scholes
|
|
||||||
L1 norm: 2.216577E-04
|
|
||||||
CPU binomial vs. Black-Scholes
|
|
||||||
L1 norm: 9.435265E-05
|
|
||||||
CPU binomial vs. GPU binomial
|
|
||||||
L1 norm: 1.513570E-04
|
|
||||||
Shutting down...
|
|
||||||
|
|
||||||
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
|
|
||||||
|
|
||||||
Test passed
|
|
@ -1,4 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Read 3223503 byte corpus from ../../../../Samples/0_Introduction/c++11_cuda/warandpeace.txt
|
|
||||||
counted 107310 instances of 'x', 'y', 'z', or 'w' in "../../../../Samples/0_Introduction/c++11_cuda/warandpeace.txt"
|
|
@ -1,6 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
GPU device NVIDIA H100 PCIe has compute capabilities (SM 9.0)
|
|
||||||
Running qsort on 1000000 elements with seed 0, on NVIDIA H100 PCIe
|
|
||||||
cdpAdvancedQuicksort PASSED
|
|
||||||
Sorted 1000000 elems in 5.015 ms (199.389 Melems/sec)
|
|
@ -1,2 +0,0 @@
|
|||||||
Running on GPU 0 (NVIDIA H100 PCIe)
|
|
||||||
Computing Bezier Lines (CUDA Dynamic Parallelism Version) ... Done!
|
|
@ -1,5 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
GPU device NVIDIA H100 PCIe has compute capabilities (SM 9.0)
|
|
||||||
Launching CDP kernel to build the quadtree
|
|
||||||
Results: OK
|
|
@ -1,23 +0,0 @@
|
|||||||
starting Simple Print (CUDA Dynamic Parallelism)
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
***************************************************************************
|
|
||||||
The CPU launches 2 blocks of 2 threads each. On the device each thread will
|
|
||||||
launch 2 blocks of 2 threads each. The GPU we will do that recursively
|
|
||||||
until it reaches max_depth=2
|
|
||||||
|
|
||||||
In total 2+8=10 blocks are launched!!! (8 from the GPU)
|
|
||||||
***************************************************************************
|
|
||||||
|
|
||||||
Launching cdp_kernel() with CUDA Dynamic Parallelism:
|
|
||||||
|
|
||||||
BLOCK 1 launched by the host
|
|
||||||
BLOCK 0 launched by the host
|
|
||||||
| BLOCK 3 launched by thread 0 of block 1
|
|
||||||
| BLOCK 2 launched by thread 0 of block 1
|
|
||||||
| BLOCK 4 launched by thread 0 of block 0
|
|
||||||
| BLOCK 5 launched by thread 0 of block 0
|
|
||||||
| BLOCK 7 launched by thread 1 of block 0
|
|
||||||
| BLOCK 6 launched by thread 1 of block 0
|
|
||||||
| BLOCK 9 launched by thread 1 of block 1
|
|
||||||
| BLOCK 8 launched by thread 1 of block 1
|
|
@ -1,6 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Initializing data:
|
|
||||||
Running quicksort on 128 elements
|
|
||||||
Launching kernel on the GPU
|
|
||||||
Validating results: OK
|
|
@ -1,4 +0,0 @@
|
|||||||
CUDA Clock sample
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Average clocks/block = 1904.875000
|
|
@ -1,5 +0,0 @@
|
|||||||
CUDA Clock sample
|
|
||||||
> Using CUDA Device [0]: NVIDIA H100 PCIe
|
|
||||||
> Using CUDA Device [0]: NVIDIA H100 PCIe
|
|
||||||
> GPU Device has SM 9.0 compute capability
|
|
||||||
Average clocks/block = 1839.750000
|
|
@ -1,8 +0,0 @@
|
|||||||
[./concurrentKernels] - Starting...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
> Detected Compute SM 9.0 hardware with 114 multi-processors
|
|
||||||
Expected time for serial execution of 8 kernels = 0.080s
|
|
||||||
Expected time for concurrent execution of 8 kernels = 0.010s
|
|
||||||
Measured time for sample = 0.010s
|
|
||||||
Test passed
|
|
@ -1,13 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
> GPU device has 114 Multi-Processors, SM 9.0 compute capabilities
|
|
||||||
|
|
||||||
iteration = 1, residual = 4.449882e+01
|
|
||||||
iteration = 2, residual = 3.245218e+00
|
|
||||||
iteration = 3, residual = 2.690220e-01
|
|
||||||
iteration = 4, residual = 2.307639e-02
|
|
||||||
iteration = 5, residual = 1.993140e-03
|
|
||||||
iteration = 6, residual = 1.846193e-04
|
|
||||||
iteration = 7, residual = 1.693379e-05
|
|
||||||
iteration = 8, residual = 1.600115e-06
|
|
||||||
Test Summary: Error amount = 0.000000
|
|
@ -1,13 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
> GPU device has 114 Multi-Processors, SM 9.0 compute capabilities
|
|
||||||
|
|
||||||
iteration = 1, residual = 4.449882e+01
|
|
||||||
iteration = 2, residual = 3.245218e+00
|
|
||||||
iteration = 3, residual = 2.690220e-01
|
|
||||||
iteration = 4, residual = 2.307639e-02
|
|
||||||
iteration = 5, residual = 1.993140e-03
|
|
||||||
iteration = 6, residual = 1.846193e-04
|
|
||||||
iteration = 7, residual = 1.693379e-05
|
|
||||||
iteration = 8, residual = 1.600115e-06
|
|
||||||
Test Summary: Error amount = 0.000000
|
|
@ -1,8 +0,0 @@
|
|||||||
Starting [conjugateGradientMultiBlockCG]...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
> GPU device has 114 Multi-Processors, SM 9.0 compute capabilities
|
|
||||||
|
|
||||||
GPU Final, residual = 1.600115e-06, kernel execution time = 16.014656 ms
|
|
||||||
Test Summary: Error amount = 0.000000
|
|
||||||
&&&& conjugateGradientMultiBlockCG PASSED
|
|
@ -1,4 +0,0 @@
|
|||||||
Starting [conjugateGradientMultiDeviceCG]...
|
|
||||||
GPU Device 0: "NVIDIA H100 PCIe" with compute capability 9.0
|
|
||||||
No two or more GPUs with same architecture capable of concurrentManagedAccess found.
|
|
||||||
Waiving the sample
|
|
@ -1,18 +0,0 @@
|
|||||||
conjugateGradientPrecond starting...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
GPU selected Device ID = 0
|
|
||||||
> GPU device has 114 Multi-Processors, SM 9.0 compute capabilities
|
|
||||||
|
|
||||||
laplace dimension = 128
|
|
||||||
Convergence of CG without preconditioning:
|
|
||||||
iteration = 564, residual = 9.174634e-13
|
|
||||||
Convergence Test: OK
|
|
||||||
|
|
||||||
Convergence of CG using ILU(0) preconditioning:
|
|
||||||
iteration = 188, residual = 9.084683e-13
|
|
||||||
Convergence Test: OK
|
|
||||||
|
|
||||||
Test Summary:
|
|
||||||
Counted total of 0 errors
|
|
||||||
qaerr1 = 0.000005 qaerr2 = 0.000003
|
|
@ -1,16 +0,0 @@
|
|||||||
Starting [conjugateGradientUM]...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
> GPU device has 114 Multi-Processors, SM 9.0 compute capabilities
|
|
||||||
|
|
||||||
iteration = 1, residual = 4.449882e+01
|
|
||||||
iteration = 2, residual = 3.245218e+00
|
|
||||||
iteration = 3, residual = 2.690220e-01
|
|
||||||
iteration = 4, residual = 2.307639e-02
|
|
||||||
iteration = 5, residual = 1.993140e-03
|
|
||||||
iteration = 6, residual = 1.846193e-04
|
|
||||||
iteration = 7, residual = 1.693379e-05
|
|
||||||
iteration = 8, residual = 1.600115e-06
|
|
||||||
Final residual: 1.600115e-06
|
|
||||||
&&&& conjugateGradientUM PASSED
|
|
||||||
Test Summary: Error amount = 0.000000, result = SUCCESS
|
|
@ -1,41 +0,0 @@
|
|||||||
[./convolutionFFT2D] - Starting...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Testing built-in R2C / C2R FFT-based convolution
|
|
||||||
...allocating memory
|
|
||||||
...generating random input data
|
|
||||||
...creating R2C & C2R FFT plans for 2048 x 2048
|
|
||||||
...uploading to GPU and padding convolution kernel and input data
|
|
||||||
...transforming convolution kernel
|
|
||||||
...running GPU FFT convolution: 33613.444604 MPix/s (0.119000 ms)
|
|
||||||
...reading back GPU convolution results
|
|
||||||
...running reference CPU convolution
|
|
||||||
...comparing the results: rel L2 = 9.395370E-08 (max delta = 1.208283E-06)
|
|
||||||
L2norm Error OK
|
|
||||||
...shutting down
|
|
||||||
Testing custom R2C / C2R FFT-based convolution
|
|
||||||
...allocating memory
|
|
||||||
...generating random input data
|
|
||||||
...creating C2C FFT plan for 2048 x 1024
|
|
||||||
...uploading to GPU and padding convolution kernel and input data
|
|
||||||
...transforming convolution kernel
|
|
||||||
...running GPU FFT convolution: 29197.081461 MPix/s (0.137000 ms)
|
|
||||||
...reading back GPU FFT results
|
|
||||||
...running reference CPU convolution
|
|
||||||
...comparing the results: rel L2 = 1.067915E-07 (max delta = 9.817303E-07)
|
|
||||||
L2norm Error OK
|
|
||||||
...shutting down
|
|
||||||
Testing updated custom R2C / C2R FFT-based convolution
|
|
||||||
...allocating memory
|
|
||||||
...generating random input data
|
|
||||||
...creating C2C FFT plan for 2048 x 1024
|
|
||||||
...uploading to GPU and padding convolution kernel and input data
|
|
||||||
...transforming convolution kernel
|
|
||||||
...running GPU FFT convolution: 39603.959017 MPix/s (0.101000 ms)
|
|
||||||
...reading back GPU FFT results
|
|
||||||
...running reference CPU convolution
|
|
||||||
...comparing the results: rel L2 = 1.065127E-07 (max delta = 9.817303E-07)
|
|
||||||
L2norm Error OK
|
|
||||||
...shutting down
|
|
||||||
Test Summary: 0 errors
|
|
||||||
Test passed
|
|
@ -1,21 +0,0 @@
|
|||||||
[./convolutionSeparable] - Starting...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Image Width x Height = 3072 x 3072
|
|
||||||
|
|
||||||
Allocating and initializing host arrays...
|
|
||||||
Allocating and initializing CUDA arrays...
|
|
||||||
Running GPU convolution (16 identical iterations)...
|
|
||||||
|
|
||||||
convolutionSeparable, Throughput = 74676.0329 MPixels/sec, Time = 0.00013 s, Size = 9437184 Pixels, NumDevsUsed = 1, Workgroup = 0
|
|
||||||
|
|
||||||
Reading back GPU results...
|
|
||||||
|
|
||||||
Checking the results...
|
|
||||||
...running convolutionRowCPU()
|
|
||||||
...running convolutionColumnCPU()
|
|
||||||
...comparing the results
|
|
||||||
...Relative L2 norm: 0.000000E+00
|
|
||||||
|
|
||||||
Shutting down...
|
|
||||||
Test passed
|
|
@ -1,17 +0,0 @@
|
|||||||
[./convolutionTexture] - Starting...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Initializing data...
|
|
||||||
Running GPU rows convolution (10 identical iterations)...
|
|
||||||
Average convolutionRowsGPU() time: 0.117200 msecs; //40261.023178 Mpix/s
|
|
||||||
Copying convolutionRowGPU() output back to the texture...
|
|
||||||
cudaMemcpyToArray() time: 0.067000 msecs; //70426.744514 Mpix/s
|
|
||||||
Running GPU columns convolution (10 iterations)
|
|
||||||
Average convolutionColumnsGPU() time: 0.116000 msecs; //40677.518412 Mpix/s
|
|
||||||
Reading back GPU results...
|
|
||||||
Checking the results...
|
|
||||||
...running convolutionRowsCPU()
|
|
||||||
...running convolutionColumnsCPU()
|
|
||||||
Relative L2 norm: 0.000000E+00
|
|
||||||
Shutting down...
|
|
||||||
Test passed
|
|
@ -1,4 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Hello World.
|
|
||||||
Hello World.
|
|
@ -1,30 +0,0 @@
|
|||||||
C++ Function Overloading starting...
|
|
||||||
Device Count: 1
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Shared Size: 1024
|
|
||||||
Constant Size: 0
|
|
||||||
Local Size: 0
|
|
||||||
Max Threads Per Block: 1024
|
|
||||||
Number of Registers: 12
|
|
||||||
PTX Version: 90
|
|
||||||
Binary Version: 90
|
|
||||||
simple_kernel(const int *pIn, int *pOut, int a) PASSED
|
|
||||||
|
|
||||||
Shared Size: 2048
|
|
||||||
Constant Size: 0
|
|
||||||
Local Size: 0
|
|
||||||
Max Threads Per Block: 1024
|
|
||||||
Number of Registers: 14
|
|
||||||
PTX Version: 90
|
|
||||||
Binary Version: 90
|
|
||||||
simple_kernel(const int2 *pIn, int *pOut, int a) PASSED
|
|
||||||
|
|
||||||
Shared Size: 2048
|
|
||||||
Constant Size: 0
|
|
||||||
Local Size: 0
|
|
||||||
Max Threads Per Block: 1024
|
|
||||||
Number of Registers: 14
|
|
||||||
PTX Version: 90
|
|
||||||
Binary Version: 90
|
|
||||||
simple_kernel(const int *pIn1, const int *pIn2, int *pOut, int a) PASSED
|
|
@ -1,15 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
step 1: read matrix market format
|
|
||||||
Using default input file [../../../../Samples/4_CUDA_Libraries/cuSolverDn_LinearSolver/gr_900_900_crg.mtx]
|
|
||||||
sparse matrix A is 900 x 900 with 7744 nonzeros, base=1
|
|
||||||
step 2: convert CSR(A) to dense matrix
|
|
||||||
step 3: set right hand side vector (b) to 1
|
|
||||||
step 4: prepare data on device
|
|
||||||
step 5: solve A*x = b
|
|
||||||
timing: cholesky = 0.000789 sec
|
|
||||||
step 6: evaluate residual
|
|
||||||
|b - A*x| = 1.278977E-13
|
|
||||||
|A| = 1.600000E+01
|
|
||||||
|x| = 2.357708E+01
|
|
||||||
|b - A*x|/(|A|*|x|) = 3.390413E-16
|
|
@ -1,58 +0,0 @@
|
|||||||
step 1.1: preparation
|
|
||||||
step 1.1: read matrix market format
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Using default input file [../../../../Samples/4_CUDA_Libraries/cuSolverRf/lap2D_5pt_n100.mtx]
|
|
||||||
WARNING: cusolverRf only works for base-0
|
|
||||||
sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=0
|
|
||||||
step 1.2: set right hand side vector (b) to 1
|
|
||||||
step 2: reorder the matrix to reduce zero fill-in
|
|
||||||
Q = symrcm(A) or Q = symamd(A)
|
|
||||||
step 3: B = Q*A*Q^T
|
|
||||||
step 4: solve A*x = b by LU(B) in cusolverSp
|
|
||||||
step 4.1: create opaque info structure
|
|
||||||
step 4.2: analyze LU(B) to know structure of Q and R, and upper bound for nnz(L+U)
|
|
||||||
step 4.3: workspace for LU(B)
|
|
||||||
step 4.4: compute Ppivot*B = L*U
|
|
||||||
step 4.5: check if the matrix is singular
|
|
||||||
step 4.6: solve A*x = b
|
|
||||||
i.e. solve B*(Qx) = Q*b
|
|
||||||
step 4.7: evaluate residual r = b - A*x (result on CPU)
|
|
||||||
(CPU) |b - A*x| = 4.547474E-12
|
|
||||||
(CPU) |A| = 8.000000E+00
|
|
||||||
(CPU) |x| = 7.513384E+02
|
|
||||||
(CPU) |b - A*x|/(|A|*|x|) = 7.565621E-16
|
|
||||||
step 5: extract P, Q, L and U from P*B*Q^T = L*U
|
|
||||||
L has implicit unit diagonal
|
|
||||||
nnzL = 671550, nnzU = 681550
|
|
||||||
step 6: form P*A*Q^T = L*U
|
|
||||||
step 6.1: P = Plu*Qreroder
|
|
||||||
step 6.2: Q = Qlu*Qreorder
|
|
||||||
step 7: create cusolverRf handle
|
|
||||||
step 8: set parameters for cusolverRf
|
|
||||||
step 9: assemble P*A*Q = L*U
|
|
||||||
step 10: analyze to extract parallelism
|
|
||||||
step 11: import A to cusolverRf
|
|
||||||
step 12: refactorization
|
|
||||||
step 13: solve A*x = b
|
|
||||||
step 14: evaluate residual r = b - A*x (result on GPU)
|
|
||||||
(GPU) |b - A*x| = 4.320100E-12
|
|
||||||
(GPU) |A| = 8.000000E+00
|
|
||||||
(GPU) |x| = 7.513384E+02
|
|
||||||
(GPU) |b - A*x|/(|A|*|x|) = 7.187340E-16
|
|
||||||
===== statistics
|
|
||||||
nnz(A) = 49600, nnz(L+U) = 1353100, zero fill-in ratio = 27.280242
|
|
||||||
|
|
||||||
===== timing profile
|
|
||||||
reorder A : 0.003304 sec
|
|
||||||
B = Q*A*Q^T : 0.000761 sec
|
|
||||||
|
|
||||||
cusolverSp LU analysis: 0.000188 sec
|
|
||||||
cusolverSp LU factor : 0.069354 sec
|
|
||||||
cusolverSp LU solve : 0.001780 sec
|
|
||||||
cusolverSp LU extract : 0.005654 sec
|
|
||||||
|
|
||||||
cusolverRf assemble : 0.002426 sec
|
|
||||||
cusolverRf reset : 0.000021 sec
|
|
||||||
cusolverRf refactor : 0.097122 sec
|
|
||||||
cusolverRf solve : 0.123813 sec
|
|
@ -1,38 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Using default input file [../../../../Samples/4_CUDA_Libraries/cuSolverSp_LinearSolver/lap2D_5pt_n100.mtx]
|
|
||||||
step 1: read matrix market format
|
|
||||||
sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=1
|
|
||||||
step 2: reorder the matrix A to minimize zero fill-in
|
|
||||||
if the user choose a reordering by -P=symrcm, -P=symamd or -P=metis
|
|
||||||
step 2.1: no reordering is chosen, Q = 0:n-1
|
|
||||||
step 2.2: B = A(Q,Q)
|
|
||||||
step 3: b(j) = 1 + j/n
|
|
||||||
step 4: prepare data on device
|
|
||||||
step 5: solve A*x = b on CPU
|
|
||||||
step 6: evaluate residual r = b - A*x (result on CPU)
|
|
||||||
(CPU) |b - A*x| = 5.393685E-12
|
|
||||||
(CPU) |A| = 8.000000E+00
|
|
||||||
(CPU) |x| = 1.136492E+03
|
|
||||||
(CPU) |b| = 1.999900E+00
|
|
||||||
(CPU) |b - A*x|/(|A|*|x| + |b|) = 5.931079E-16
|
|
||||||
step 7: solve A*x = b on GPU
|
|
||||||
step 8: evaluate residual r = b - A*x (result on GPU)
|
|
||||||
(GPU) |b - A*x| = 1.970424E-12
|
|
||||||
(GPU) |A| = 8.000000E+00
|
|
||||||
(GPU) |x| = 1.136492E+03
|
|
||||||
(GPU) |b| = 1.999900E+00
|
|
||||||
(GPU) |b - A*x|/(|A|*|x| + |b|) = 2.166745E-16
|
|
||||||
timing chol: CPU = 0.097956 sec , GPU = 0.103812 sec
|
|
||||||
show last 10 elements of solution vector (GPU)
|
|
||||||
consistent result for different reordering and solver
|
|
||||||
x[9990] = 3.000016E+01
|
|
||||||
x[9991] = 2.807343E+01
|
|
||||||
x[9992] = 2.601354E+01
|
|
||||||
x[9993] = 2.380285E+01
|
|
||||||
x[9994] = 2.141866E+01
|
|
||||||
x[9995] = 1.883070E+01
|
|
||||||
x[9996] = 1.599668E+01
|
|
||||||
x[9997] = 1.285365E+01
|
|
||||||
x[9998] = 9.299423E+00
|
|
||||||
x[9999] = 5.147265E+00
|
|
@ -1,24 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Using default input file [../../../../Samples/4_CUDA_Libraries/cuSolverSp_LowlevelCholesky/lap2D_5pt_n100.mtx]
|
|
||||||
step 1: read matrix market format
|
|
||||||
sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=1
|
|
||||||
step 2: create opaque info structure
|
|
||||||
step 3: analyze chol(A) to know structure of L
|
|
||||||
step 4: workspace for chol(A)
|
|
||||||
step 5: compute A = L*L^T
|
|
||||||
step 6: check if the matrix is singular
|
|
||||||
step 7: solve A*x = b
|
|
||||||
step 8: evaluate residual r = b - A*x (result on CPU)
|
|
||||||
(CPU) |b - A*x| = 3.637979E-12
|
|
||||||
(CPU) |A| = 8.000000E+00
|
|
||||||
(CPU) |x| = 7.513384E+02
|
|
||||||
(CPU) |b - A*x|/(|A|*|x|) = 6.052497E-16
|
|
||||||
step 9: create opaque info structure
|
|
||||||
step 10: analyze chol(A) to know structure of L
|
|
||||||
step 11: workspace for chol(A)
|
|
||||||
step 12: compute A = L*L^T
|
|
||||||
step 13: check if the matrix is singular
|
|
||||||
step 14: solve A*x = b
|
|
||||||
(GPU) |b - A*x| = 1.477929E-12
|
|
||||||
(GPU) |b - A*x|/(|A|*|x|) = 2.458827E-16
|
|
@ -1,25 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Using default input file [../../../../Samples/4_CUDA_Libraries/cuSolverSp_LowlevelQR/lap2D_5pt_n32.mtx]
|
|
||||||
step 1: read matrix market format
|
|
||||||
sparse matrix A is 1024 x 1024 with 3008 nonzeros, base=1
|
|
||||||
step 2: create opaque info structure
|
|
||||||
step 3: analyze qr(A) to know structure of L
|
|
||||||
step 4: workspace for qr(A)
|
|
||||||
step 5: compute A = L*L^T
|
|
||||||
step 6: check if the matrix is singular
|
|
||||||
step 7: solve A*x = b
|
|
||||||
step 8: evaluate residual r = b - A*x (result on CPU)
|
|
||||||
(CPU) |b - A*x| = 5.329071E-15
|
|
||||||
(CPU) |A| = 6.000000E+00
|
|
||||||
(CPU) |x| = 5.000000E-01
|
|
||||||
(CPU) |b - A*x|/(|A|*|x|) = 1.776357E-15
|
|
||||||
step 9: create opaque info structure
|
|
||||||
step 10: analyze qr(A) to know structure of L
|
|
||||||
step 11: workspace for qr(A)
|
|
||||||
GPU buffer size = 3751424 bytes
|
|
||||||
step 12: compute A = L*L^T
|
|
||||||
step 13: check if the matrix is singular
|
|
||||||
step 14: solve A*x = b
|
|
||||||
(GPU) |b - A*x| = 4.218847E-15
|
|
||||||
(GPU) |b - A*x|/(|A|*|x|) = 1.406282E-15
|
|
@ -1,9 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Generic memory compression support is available
|
|
||||||
Running saxpy on 167772160 bytes of Compressible memory
|
|
||||||
Running saxpy with 228 blocks x 1024 threads = 0.084 ms 5.960 TB/s
|
|
||||||
Running saxpy on 167772160 bytes of Non-Compressible memory
|
|
||||||
Running saxpy with 228 blocks x 1024 threads = 0.345 ms 1.460 TB/s
|
|
||||||
|
|
||||||
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
|
|
@ -1,8 +0,0 @@
|
|||||||
./cudaOpenMP Starting...
|
|
||||||
|
|
||||||
number of host CPUs: 32
|
|
||||||
number of CUDA devices: 1
|
|
||||||
0: NVIDIA H100 PCIe
|
|
||||||
---------------------------
|
|
||||||
CPU thread 0 (of 1) uses CUDA device 0
|
|
||||||
---------------------------
|
|
@ -1,11 +0,0 @@
|
|||||||
Initializing...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
M: 4096 (16 x 256)
|
|
||||||
N: 4096 (16 x 256)
|
|
||||||
K: 4096 (16 x 256)
|
|
||||||
Preparing data for GPU...
|
|
||||||
Required shared memory size: 64 Kb
|
|
||||||
Computing... using high performance kernel compute_gemm
|
|
||||||
Time: 1.223904 ms
|
|
||||||
TFLOPS: 112.30
|
|
@ -1,32 +0,0 @@
|
|||||||
./dct8x8 Starting...
|
|
||||||
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
CUDA sample DCT/IDCT implementation
|
|
||||||
===================================
|
|
||||||
Loading test image: teapot512.bmp... [512 x 512]... Success
|
|
||||||
Running Gold 1 (CPU) version... Success
|
|
||||||
Running Gold 2 (CPU) version... Success
|
|
||||||
Running CUDA 1 (GPU) version... Success
|
|
||||||
Running CUDA 2 (GPU) version... 82435.220134 MPix/s //0.003180 ms
|
|
||||||
Success
|
|
||||||
Running CUDA short (GPU) version... Success
|
|
||||||
Dumping result to teapot512_gold1.bmp... Success
|
|
||||||
Dumping result to teapot512_gold2.bmp... Success
|
|
||||||
Dumping result to teapot512_cuda1.bmp... Success
|
|
||||||
Dumping result to teapot512_cuda2.bmp... Success
|
|
||||||
Dumping result to teapot512_cuda_short.bmp... Success
|
|
||||||
Processing time (CUDA 1) : 0.021800 ms
|
|
||||||
Processing time (CUDA 2) : 0.003180 ms
|
|
||||||
Processing time (CUDA short): 0.033000 ms
|
|
||||||
PSNR Original <---> CPU(Gold 1) : 32.527462
|
|
||||||
PSNR Original <---> CPU(Gold 2) : 32.527309
|
|
||||||
PSNR Original <---> GPU(CUDA 1) : 32.527184
|
|
||||||
PSNR Original <---> GPU(CUDA 2) : 32.527054
|
|
||||||
PSNR Original <---> GPU(CUDA short): 32.501888
|
|
||||||
PSNR CPU(Gold 1) <---> GPU(CUDA 1) : 62.845787
|
|
||||||
PSNR CPU(Gold 2) <---> GPU(CUDA 2) : 66.982300
|
|
||||||
PSNR CPU(Gold 2) <---> GPU(CUDA short): 40.958466
|
|
||||||
|
|
||||||
Test Summary...
|
|
||||||
Test passed
|
|
@ -1,46 +0,0 @@
|
|||||||
./deviceQuery Starting...
|
|
||||||
|
|
||||||
CUDA Device Query (Runtime API) version (CUDART static linking)
|
|
||||||
|
|
||||||
Detected 1 CUDA Capable device(s)
|
|
||||||
|
|
||||||
Device 0: "NVIDIA H100 PCIe"
|
|
||||||
CUDA Driver Version / Runtime Version 12.0 / 12.0
|
|
||||||
CUDA Capability Major/Minor version number: 9.0
|
|
||||||
Total amount of global memory: 81082 MBytes (85021163520 bytes)
|
|
||||||
(114) Multiprocessors, (128) CUDA Cores/MP: 14592 CUDA Cores
|
|
||||||
GPU Max Clock rate: 1650 MHz (1.65 GHz)
|
|
||||||
Memory Clock rate: 1593 Mhz
|
|
||||||
Memory Bus Width: 5120-bit
|
|
||||||
L2 Cache Size: 52428800 bytes
|
|
||||||
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
|
|
||||||
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
|
|
||||||
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
|
|
||||||
Total amount of constant memory: 65536 bytes
|
|
||||||
Total amount of shared memory per block: 49152 bytes
|
|
||||||
Total shared memory per multiprocessor: 233472 bytes
|
|
||||||
Total number of registers available per block: 65536
|
|
||||||
Warp size: 32
|
|
||||||
Maximum number of threads per multiprocessor: 2048
|
|
||||||
Maximum number of threads per block: 1024
|
|
||||||
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
|
|
||||||
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
|
|
||||||
Maximum memory pitch: 2147483647 bytes
|
|
||||||
Texture alignment: 512 bytes
|
|
||||||
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
|
|
||||||
Run time limit on kernels: No
|
|
||||||
Integrated GPU sharing Host Memory: No
|
|
||||||
Support host page-locked memory mapping: Yes
|
|
||||||
Alignment requirement for Surfaces: Yes
|
|
||||||
Device has ECC support: Enabled
|
|
||||||
Device supports Unified Addressing (UVA): Yes
|
|
||||||
Device supports Managed Memory: Yes
|
|
||||||
Device supports Compute Preemption: Yes
|
|
||||||
Supports Cooperative Kernel Launch: Yes
|
|
||||||
Supports MultiDevice Co-op Kernel Launch: Yes
|
|
||||||
Device PCI Domain ID / Bus ID / location ID: 0 / 193 / 0
|
|
||||||
Compute Mode:
|
|
||||||
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
|
|
||||||
|
|
||||||
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.0, CUDA Runtime Version = 12.0, NumDevs = 1
|
|
||||||
Result = PASS
|
|
@ -1,43 +0,0 @@
|
|||||||
./deviceQueryDrv Starting...
|
|
||||||
|
|
||||||
CUDA Device Query (Driver API) statically linked version
|
|
||||||
Detected 1 CUDA Capable device(s)
|
|
||||||
|
|
||||||
Device 0: "NVIDIA H100 PCIe"
|
|
||||||
CUDA Driver Version: 12.0
|
|
||||||
CUDA Capability Major/Minor version number: 9.0
|
|
||||||
Total amount of global memory: 81082 MBytes (85021163520 bytes)
|
|
||||||
(114) Multiprocessors, (128) CUDA Cores/MP: 14592 CUDA Cores
|
|
||||||
GPU Max Clock rate: 1650 MHz (1.65 GHz)
|
|
||||||
Memory Clock rate: 1593 Mhz
|
|
||||||
Memory Bus Width: 5120-bit
|
|
||||||
L2 Cache Size: 52428800 bytes
|
|
||||||
Max Texture Dimension Sizes 1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384)
|
|
||||||
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
|
|
||||||
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
|
|
||||||
Total amount of constant memory: 65536 bytes
|
|
||||||
Total amount of shared memory per block: 49152 bytes
|
|
||||||
Total number of registers available per block: 65536
|
|
||||||
Warp size: 32
|
|
||||||
Maximum number of threads per multiprocessor: 2048
|
|
||||||
Maximum number of threads per block: 1024
|
|
||||||
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
|
|
||||||
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
|
|
||||||
Texture alignment: 512 bytes
|
|
||||||
Maximum memory pitch: 2147483647 bytes
|
|
||||||
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
|
|
||||||
Run time limit on kernels: No
|
|
||||||
Integrated GPU sharing Host Memory: No
|
|
||||||
Support host page-locked memory mapping: Yes
|
|
||||||
Concurrent kernel execution: Yes
|
|
||||||
Alignment requirement for Surfaces: Yes
|
|
||||||
Device has ECC support: Enabled
|
|
||||||
Device supports Unified Addressing (UVA): Yes
|
|
||||||
Device supports Managed Memory: Yes
|
|
||||||
Device supports Compute Preemption: Yes
|
|
||||||
Supports Cooperative Kernel Launch: Yes
|
|
||||||
Supports MultiDevice Co-op Kernel Launch: Yes
|
|
||||||
Device PCI Domain ID / Bus ID / location ID: 0 / 193 / 0
|
|
||||||
Compute Mode:
|
|
||||||
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
|
|
||||||
Result = PASS
|
|
@ -1,11 +0,0 @@
|
|||||||
Initializing...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
M: 8192 (8 x 1024)
|
|
||||||
N: 8192 (8 x 1024)
|
|
||||||
K: 4096 (4 x 1024)
|
|
||||||
Preparing data for GPU...
|
|
||||||
Required shared memory size: 68 Kb
|
|
||||||
Computing using high performance kernel = 0 - compute_dgemm_async_copy
|
|
||||||
Time: 30.856800 ms
|
|
||||||
FP64 TFLOPS: 17.82
|
|
@ -1,11 +0,0 @@
|
|||||||
./dwtHaar1D Starting...
|
|
||||||
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
source file = "../../../../Samples/5_Domain_Specific/dwtHaar1D/data/signal.dat"
|
|
||||||
reference file = "result.dat"
|
|
||||||
gold file = "../../../../Samples/5_Domain_Specific/dwtHaar1D/data/regression.gold.dat"
|
|
||||||
Reading signal from "../../../../Samples/5_Domain_Specific/dwtHaar1D/data/signal.dat"
|
|
||||||
Writing result to "result.dat"
|
|
||||||
Reading reference result from "../../../../Samples/5_Domain_Specific/dwtHaar1D/data/regression.gold.dat"
|
|
||||||
Test success!
|
|
@ -1,16 +0,0 @@
|
|||||||
./dxtc Starting...
|
|
||||||
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Image Loaded '../../../../Samples/5_Domain_Specific/dxtc/data/teapot512_std.ppm', 512 x 512 pixels
|
|
||||||
|
|
||||||
Running DXT Compression on 512 x 512 image...
|
|
||||||
|
|
||||||
16384 Blocks, 64 Threads per Block, 1048576 Threads in Grid...
|
|
||||||
|
|
||||||
dxtc, Throughput = 442.8108 MPixels/s, Time = 0.00059 s, Size = 262144 Pixels, NumDevsUsed = 1, Workgroup = 64
|
|
||||||
|
|
||||||
Checking accuracy...
|
|
||||||
RMS(reference, result) = 0.000000
|
|
||||||
|
|
||||||
Test passed
|
|
@ -1,13 +0,0 @@
|
|||||||
Starting eigenvalues
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Matrix size: 2048 x 2048
|
|
||||||
Precision: 0.000010
|
|
||||||
Iterations to be timed: 100
|
|
||||||
Result filename: 'eigenvalues.dat'
|
|
||||||
Gerschgorin interval: -2.894310 / 2.923303
|
|
||||||
Average time step 1: 1.032310 ms
|
|
||||||
Average time step 2, one intervals: 1.228451 ms
|
|
||||||
Average time step 2, mult intervals: 2.694728 ms
|
|
||||||
Average time TOTAL: 4.970180 ms
|
|
||||||
Test Succeeded!
|
|
@ -1,17 +0,0 @@
|
|||||||
./fastWalshTransform Starting...
|
|
||||||
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Initializing data...
|
|
||||||
...allocating CPU memory
|
|
||||||
...allocating GPU memory
|
|
||||||
...generating data
|
|
||||||
Data length: 8388608; kernel length: 128
|
|
||||||
Running GPU dyadic convolution using Fast Walsh Transform...
|
|
||||||
GPU time: 0.751000 ms; GOP/s: 385.362158
|
|
||||||
Reading back GPU results...
|
|
||||||
Running straightforward CPU dyadic convolution...
|
|
||||||
Comparing the results...
|
|
||||||
Shutting down...
|
|
||||||
L2 norm: 1.021579E-07
|
|
||||||
Test passed
|
|
@ -1,5 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Result native operators : 644622.000000
|
|
||||||
Result intrinsics : 644622.000000
|
|
||||||
&&&& fp16ScalarProduct PASSED
|
|
@ -1,11 +0,0 @@
|
|||||||
[globalToShmemAsyncCopy] - Starting...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
MatrixA(1280,1280), MatrixB(1280,1280)
|
|
||||||
Running kernel = 0 - AsyncCopyMultiStageLargeChunk
|
|
||||||
Computing result using CUDA Kernel...
|
|
||||||
done
|
|
||||||
Performance= 5289.33 GFlop/s, Time= 0.793 msec, Size= 4194304000 Ops, WorkgroupSize= 256 threads/block
|
|
||||||
Checking computed result for correctness: Result = PASS
|
|
||||||
|
|
||||||
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
|
|
@ -1,89 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Driver version is: 12.0
|
|
||||||
Running sample.
|
|
||||||
================================
|
|
||||||
Running virtual address reuse example.
|
|
||||||
Sequential allocations & frees within a single graph enable CUDA to reuse virtual addresses.
|
|
||||||
|
|
||||||
Check confirms that d_a and d_b share a virtual address.
|
|
||||||
FOOTPRINT: 67108864 bytes
|
|
||||||
|
|
||||||
Cleaning up example by trimming device memory.
|
|
||||||
FOOTPRINT: 0 bytes
|
|
||||||
|
|
||||||
================================
|
|
||||||
Running physical memory reuse example.
|
|
||||||
CUDA reuses the same physical memory for allocations from separate graphs when the allocation lifetimes don't overlap.
|
|
||||||
|
|
||||||
Creating the graph execs does not reserve any physical memory.
|
|
||||||
FOOTPRINT: 0 bytes
|
|
||||||
|
|
||||||
The first graph launched reserves the memory it needs.
|
|
||||||
FOOTPRINT: 67108864 bytes
|
|
||||||
A subsequent launch of the same graph in the same stream reuses the same physical memory. Thus the memory footprint does not grow here.
|
|
||||||
FOOTPRINT: 67108864 bytes
|
|
||||||
|
|
||||||
Subsequent launches of other graphs in the same stream also reuse the physical memory. Thus the memory footprint does not grow here.
|
|
||||||
01: FOOTPRINT: 67108864 bytes
|
|
||||||
02: FOOTPRINT: 67108864 bytes
|
|
||||||
03: FOOTPRINT: 67108864 bytes
|
|
||||||
04: FOOTPRINT: 67108864 bytes
|
|
||||||
05: FOOTPRINT: 67108864 bytes
|
|
||||||
06: FOOTPRINT: 67108864 bytes
|
|
||||||
07: FOOTPRINT: 67108864 bytes
|
|
||||||
|
|
||||||
Check confirms all graphs use a different virtual address.
|
|
||||||
|
|
||||||
Cleaning up example by trimming device memory.
|
|
||||||
FOOTPRINT: 0 bytes
|
|
||||||
|
|
||||||
================================
|
|
||||||
Running simultaneous streams example.
|
|
||||||
Graphs that can run concurrently need separate physical memory. In this example, each graph launched in a separate stream increases the total memory footprint.
|
|
||||||
|
|
||||||
When launching a new graph, CUDA may reuse physical memory from a graph whose execution has already finished -- even if the new graph is being launched in a different stream from the completed graph. Therefore, a kernel node is added to the graphs to increase runtime.
|
|
||||||
|
|
||||||
Initial footprint:
|
|
||||||
FOOTPRINT: 0 bytes
|
|
||||||
|
|
||||||
Each graph launch in a seperate stream grows the memory footprint:
|
|
||||||
01: FOOTPRINT: 67108864 bytes
|
|
||||||
02: FOOTPRINT: 134217728 bytes
|
|
||||||
03: FOOTPRINT: 201326592 bytes
|
|
||||||
04: FOOTPRINT: 268435456 bytes
|
|
||||||
05: FOOTPRINT: 335544320 bytes
|
|
||||||
06: FOOTPRINT: 402653184 bytes
|
|
||||||
07: FOOTPRINT: 402653184 bytes
|
|
||||||
|
|
||||||
Cleaning up example by trimming device memory.
|
|
||||||
FOOTPRINT: 0 bytes
|
|
||||||
|
|
||||||
================================
|
|
||||||
Running unfreed streams example.
|
|
||||||
CUDA cannot reuse phyiscal memory from graphs which do not free their allocations.
|
|
||||||
|
|
||||||
Despite being launched in the same stream, each graph launch grows the memory footprint. Since the allocation is not freed, CUDA keeps the memory valid for use.
|
|
||||||
00: FOOTPRINT: 67108864 bytes
|
|
||||||
01: FOOTPRINT: 134217728 bytes
|
|
||||||
02: FOOTPRINT: 201326592 bytes
|
|
||||||
03: FOOTPRINT: 268435456 bytes
|
|
||||||
04: FOOTPRINT: 335544320 bytes
|
|
||||||
05: FOOTPRINT: 402653184 bytes
|
|
||||||
06: FOOTPRINT: 469762048 bytes
|
|
||||||
07: FOOTPRINT: 536870912 bytes
|
|
||||||
|
|
||||||
Trimming does not impact the memory footprint since the un-freed allocations are still holding onto the memory.
|
|
||||||
FOOTPRINT: 536870912 bytes
|
|
||||||
|
|
||||||
Freeing the allocations does not shrink the footprint.
|
|
||||||
FOOTPRINT: 536870912 bytes
|
|
||||||
|
|
||||||
Since the allocations are now freed, trimming does reduce the footprint even when the graph execs are not yet destroyed.
|
|
||||||
FOOTPRINT: 0 bytes
|
|
||||||
|
|
||||||
Cleaning up example by trimming device memory.
|
|
||||||
FOOTPRINT: 0 bytes
|
|
||||||
|
|
||||||
================================
|
|
||||||
Sample complete.
|
|
@ -1,34 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Driver version is: 12.0
|
|
||||||
Setting up sample.
|
|
||||||
Setup complete.
|
|
||||||
|
|
||||||
Running negateSquares in a stream.
|
|
||||||
Validating negateSquares in a stream...
|
|
||||||
Validation PASSED!
|
|
||||||
|
|
||||||
Running negateSquares in a stream-captured graph.
|
|
||||||
Validating negateSquares in a stream-captured graph...
|
|
||||||
Validation PASSED!
|
|
||||||
|
|
||||||
Running negateSquares in an explicitly constructed graph.
|
|
||||||
Check verified that d_negSquare and d_input share a virtual address.
|
|
||||||
Validating negateSquares in an explicitly constructed graph...
|
|
||||||
Validation PASSED!
|
|
||||||
|
|
||||||
Running negateSquares with d_negSquare freed outside the stream.
|
|
||||||
Check verified that d_negSquare and d_input share a virtual address.
|
|
||||||
Validating negateSquares with d_negSquare freed outside the stream...
|
|
||||||
Validation PASSED!
|
|
||||||
|
|
||||||
Running negateSquares with d_negSquare freed outside the graph.
|
|
||||||
Validating negateSquares with d_negSquare freed outside the graph...
|
|
||||||
Validation PASSED!
|
|
||||||
|
|
||||||
Running negateSquares with d_negSquare freed in a different graph.
|
|
||||||
Validating negateSquares with d_negSquare freed in a different graph...
|
|
||||||
Validation PASSED!
|
|
||||||
|
|
||||||
Cleaning up sample.
|
|
||||||
Cleanup complete. Exiting sample.
|
|
@ -1,48 +0,0 @@
|
|||||||
[[histogram]] - Starting...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
CUDA device [NVIDIA H100 PCIe] has 114 Multi-Processors, Compute 9.0
|
|
||||||
Initializing data...
|
|
||||||
...allocating CPU memory.
|
|
||||||
...generating input data
|
|
||||||
...allocating GPU memory and copying input data
|
|
||||||
|
|
||||||
Starting up 64-bin histogram...
|
|
||||||
|
|
||||||
Running 64-bin GPU histogram for 67108864 bytes (16 runs)...
|
|
||||||
|
|
||||||
histogram64() time (average) : 0.00007 sec, 916944.3386 MB/sec
|
|
||||||
|
|
||||||
histogram64, Throughput = 916944.3386 MB/s, Time = 0.00007 s, Size = 67108864 Bytes, NumDevsUsed = 1, Workgroup = 64
|
|
||||||
|
|
||||||
Validating GPU results...
|
|
||||||
...reading back GPU results
|
|
||||||
...histogram64CPU()
|
|
||||||
...comparing the results...
|
|
||||||
...64-bin histograms match
|
|
||||||
|
|
||||||
Shutting down 64-bin histogram...
|
|
||||||
|
|
||||||
|
|
||||||
Initializing 256-bin histogram...
|
|
||||||
Running 256-bin GPU histogram for 67108864 bytes (16 runs)...
|
|
||||||
|
|
||||||
histogram256() time (average) : 0.00018 sec, 379951.1088 MB/sec
|
|
||||||
|
|
||||||
histogram256, Throughput = 379951.1088 MB/s, Time = 0.00018 s, Size = 67108864 Bytes, NumDevsUsed = 1, Workgroup = 192
|
|
||||||
|
|
||||||
Validating GPU results...
|
|
||||||
...reading back GPU results
|
|
||||||
...histogram256CPU()
|
|
||||||
...comparing the results
|
|
||||||
...256-bin histograms match
|
|
||||||
|
|
||||||
Shutting down 256-bin histogram...
|
|
||||||
|
|
||||||
|
|
||||||
Shutting down...
|
|
||||||
|
|
||||||
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
|
|
||||||
|
|
||||||
[histogram] - Test Summary
|
|
||||||
Test passed
|
|
@ -1,11 +0,0 @@
|
|||||||
Initializing...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
M: 4096 (16 x 256)
|
|
||||||
N: 4096 (16 x 256)
|
|
||||||
K: 4096 (16 x 256)
|
|
||||||
Preparing data for GPU...
|
|
||||||
Required shared memory size: 64 Kb
|
|
||||||
Computing... using high performance kernel compute_gemm_imma
|
|
||||||
Time: 0.629184 ms
|
|
||||||
TOPS: 218.44
|
|
@ -1,4 +0,0 @@
|
|||||||
CUDA inline PTX assembler sample
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Test Successful.
|
|
@ -1,5 +0,0 @@
|
|||||||
CUDA inline PTX assembler sample
|
|
||||||
> Using CUDA Device [0]: NVIDIA H100 PCIe
|
|
||||||
> Using CUDA Device [0]: NVIDIA H100 PCIe
|
|
||||||
> GPU Device has SM 9.0 compute capability
|
|
||||||
Test Successful.
|
|
@ -1,15 +0,0 @@
|
|||||||
[Interval Computing] starting ...
|
|
||||||
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
> GPU Device has Compute Capabilities SM 9.0
|
|
||||||
|
|
||||||
GPU naive implementation
|
|
||||||
Searching for roots in [0.01, 4]...
|
|
||||||
Found 2 intervals that may contain the root(s)
|
|
||||||
i[0] = [0.999655515093009, 1.00011722206639]
|
|
||||||
i[1] = [1.00011907576551, 1.00044661086269]
|
|
||||||
Number of equations solved: 65536
|
|
||||||
Time per equation: 0.616870105266571 us
|
|
||||||
|
|
||||||
Check against Host computation...
|
|
@ -1,9 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
CPU iterations : 2954
|
|
||||||
CPU error : 4.988e-03
|
|
||||||
CPU Processing time: 2525.311035 (ms)
|
|
||||||
GPU iterations : 2954
|
|
||||||
GPU error : 4.988e-03
|
|
||||||
GPU Processing time: 57.967999 (ms)
|
|
||||||
&&&& jacobiCudaGraphs PASSED
|
|
@ -1,7 +0,0 @@
|
|||||||
[./lineOfSight] - Starting...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Line of sight
|
|
||||||
Average time: 0.020620 ms
|
|
||||||
|
|
||||||
Test passed
|
|
@ -1,10 +0,0 @@
|
|||||||
[Matrix Multiply Using CUDA] - Starting...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
MatrixA(320,320), MatrixB(640,320)
|
|
||||||
Computing result using CUDA Kernel...
|
|
||||||
done
|
|
||||||
Performance= 4756.03 GFlop/s, Time= 0.028 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
|
|
||||||
Checking computed result for correctness: Result = PASS
|
|
||||||
|
|
||||||
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
|
|
@ -1,12 +0,0 @@
|
|||||||
[Matrix Multiply CUBLAS] - Starting...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
GPU Device 0: "NVIDIA H100 PCIe" with compute capability 9.0
|
|
||||||
|
|
||||||
MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
|
|
||||||
Computing result using CUBLAS...done.
|
|
||||||
Performance= 10873.05 GFlop/s, Time= 0.018 msec, Size= 196608000 Ops
|
|
||||||
Computing result using host CPU...done.
|
|
||||||
Comparing CUBLAS Matrix Multiply with CPU results: PASS
|
|
||||||
|
|
||||||
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
|
|
@ -1,11 +0,0 @@
|
|||||||
[ matrixMulDrv (Driver API) ]
|
|
||||||
> Using CUDA Device [0]: NVIDIA H100 PCIe
|
|
||||||
> GPU Device has SM 9.0 compute capability
|
|
||||||
Total amount of global memory: 85021163520 bytes
|
|
||||||
> findModulePath found file at <./matrixMul_kernel64.fatbin>
|
|
||||||
> initCUDA loading module: <./matrixMul_kernel64.fatbin>
|
|
||||||
> 32 block size selected
|
|
||||||
Processing time: 0.058000 (ms)
|
|
||||||
Checking computed result for correctness: Result = PASS
|
|
||||||
|
|
||||||
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
|
|
@ -1,6 +0,0 @@
|
|||||||
[ matrixMulDynlinkJIT (CUDA dynamic linking) ]
|
|
||||||
> Device 0: "NVIDIA H100 PCIe" with Compute 9.0 capability
|
|
||||||
> Compiling CUDA module
|
|
||||||
> PTX JIT log:
|
|
||||||
|
|
||||||
Test run success!
|
|
@ -1,9 +0,0 @@
|
|||||||
[Matrix Multiply Using CUDA] - Starting...
|
|
||||||
MatrixA(320,320), MatrixB(640,320)
|
|
||||||
> Using CUDA Device [0]: NVIDIA H100 PCIe
|
|
||||||
> Using CUDA Device [0]: NVIDIA H100 PCIe
|
|
||||||
> GPU Device has SM 9.0 compute capability
|
|
||||||
Computing result using CUDA Kernel...
|
|
||||||
Checking computed result for correctness: Result = PASS
|
|
||||||
|
|
||||||
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
|
|
@ -1,6 +0,0 @@
|
|||||||
> findModulePath found file at <./memMapIpc_kernel64.ptx>
|
|
||||||
> initCUDA loading module: <./memMapIpc_kernel64.ptx>
|
|
||||||
> PTX JIT log:
|
|
||||||
|
|
||||||
Step 0 done
|
|
||||||
Process 0: verifying...
|
|
@ -1,17 +0,0 @@
|
|||||||
./mergeSort Starting...
|
|
||||||
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Allocating and initializing host arrays...
|
|
||||||
|
|
||||||
Allocating and initializing CUDA arrays...
|
|
||||||
|
|
||||||
Initializing GPU merge sort...
|
|
||||||
Running GPU merge sort...
|
|
||||||
Time: 1.344000 ms
|
|
||||||
Reading back GPU merge sort results...
|
|
||||||
Inspecting the results...
|
|
||||||
...inspecting keys array: OK
|
|
||||||
...inspecting keys and values array: OK
|
|
||||||
...stability property: stable!
|
|
||||||
Shutting down...
|
|
@ -1,11 +0,0 @@
|
|||||||
newdelete Starting...
|
|
||||||
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
> Container = Vector test OK
|
|
||||||
|
|
||||||
> Container = Vector, using placement new on SMEM buffer test OK
|
|
||||||
|
|
||||||
> Container = Vector, with user defined datatype test OK
|
|
||||||
|
|
||||||
Test Summary: 3/3 succesfully run
|
|
@ -1,56 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Using GPU 0 (NVIDIA H100 PCIe, 114 SMs, 2048 th/SM max, CC 9.0, ECC on)
|
|
||||||
Decoding images in directory: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/, total 8, batchsize 1
|
|
||||||
Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img1.jpg
|
|
||||||
Image is 3 channels.
|
|
||||||
Channel #0 size: 480 x 640
|
|
||||||
Channel #1 size: 240 x 320
|
|
||||||
Channel #2 size: 240 x 320
|
|
||||||
YUV 4:2:0 chroma subsampling
|
|
||||||
Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img2.jpg
|
|
||||||
Image is 3 channels.
|
|
||||||
Channel #0 size: 480 x 640
|
|
||||||
Channel #1 size: 240 x 320
|
|
||||||
Channel #2 size: 240 x 320
|
|
||||||
YUV 4:2:0 chroma subsampling
|
|
||||||
Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img3.jpg
|
|
||||||
Image is 3 channels.
|
|
||||||
Channel #0 size: 640 x 426
|
|
||||||
Channel #1 size: 320 x 213
|
|
||||||
Channel #2 size: 320 x 213
|
|
||||||
YUV 4:2:0 chroma subsampling
|
|
||||||
Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img4.jpg
|
|
||||||
Image is 3 channels.
|
|
||||||
Channel #0 size: 640 x 426
|
|
||||||
Channel #1 size: 320 x 213
|
|
||||||
Channel #2 size: 320 x 213
|
|
||||||
YUV 4:2:0 chroma subsampling
|
|
||||||
Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img5.jpg
|
|
||||||
Image is 3 channels.
|
|
||||||
Channel #0 size: 640 x 480
|
|
||||||
Channel #1 size: 320 x 240
|
|
||||||
Channel #2 size: 320 x 240
|
|
||||||
YUV 4:2:0 chroma subsampling
|
|
||||||
Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img6.jpg
|
|
||||||
Image is 3 channels.
|
|
||||||
Channel #0 size: 640 x 480
|
|
||||||
Channel #1 size: 320 x 240
|
|
||||||
Channel #2 size: 320 x 240
|
|
||||||
YUV 4:2:0 chroma subsampling
|
|
||||||
Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img7.jpg
|
|
||||||
Image is 3 channels.
|
|
||||||
Channel #0 size: 480 x 640
|
|
||||||
Channel #1 size: 240 x 320
|
|
||||||
Channel #2 size: 240 x 320
|
|
||||||
YUV 4:2:0 chroma subsampling
|
|
||||||
Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img8.jpg
|
|
||||||
Image is 3 channels.
|
|
||||||
Channel #0 size: 480 x 640
|
|
||||||
Channel #1 size: 240 x 320
|
|
||||||
Channel #2 size: 240 x 320
|
|
||||||
YUV 4:2:0 chroma subsampling
|
|
||||||
Total decoding time: 3.19197
|
|
||||||
Avg decoding time per image: 0.398996
|
|
||||||
Avg images per sec: 2.50629
|
|
||||||
Avg decoding time per batch: 0.398996
|
|
@ -1,62 +0,0 @@
|
|||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Using GPU 0 (NVIDIA H100 PCIe, 114 SMs, 2048 th/SM max, CC 9.0, ECC on)
|
|
||||||
Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img1.jpg
|
|
||||||
Image is 3 channels.
|
|
||||||
Channel #0 size: 480 x 640
|
|
||||||
Channel #1 size: 240 x 320
|
|
||||||
Channel #2 size: 240 x 320
|
|
||||||
YUV 4:2:0 chroma subsampling
|
|
||||||
Writing JPEG file: encode_output/img1.jpg
|
|
||||||
Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img2.jpg
|
|
||||||
Image is 3 channels.
|
|
||||||
Channel #0 size: 480 x 640
|
|
||||||
Channel #1 size: 240 x 320
|
|
||||||
Channel #2 size: 240 x 320
|
|
||||||
YUV 4:2:0 chroma subsampling
|
|
||||||
Writing JPEG file: encode_output/img2.jpg
|
|
||||||
Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img3.jpg
|
|
||||||
Image is 3 channels.
|
|
||||||
Channel #0 size: 640 x 426
|
|
||||||
Channel #1 size: 320 x 213
|
|
||||||
Channel #2 size: 320 x 213
|
|
||||||
YUV 4:2:0 chroma subsampling
|
|
||||||
Writing JPEG file: encode_output/img3.jpg
|
|
||||||
Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img4.jpg
|
|
||||||
Image is 3 channels.
|
|
||||||
Channel #0 size: 640 x 426
|
|
||||||
Channel #1 size: 320 x 213
|
|
||||||
Channel #2 size: 320 x 213
|
|
||||||
YUV 4:2:0 chroma subsampling
|
|
||||||
Writing JPEG file: encode_output/img4.jpg
|
|
||||||
Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img5.jpg
|
|
||||||
Image is 3 channels.
|
|
||||||
Channel #0 size: 640 x 480
|
|
||||||
Channel #1 size: 320 x 240
|
|
||||||
Channel #2 size: 320 x 240
|
|
||||||
YUV 4:2:0 chroma subsampling
|
|
||||||
Writing JPEG file: encode_output/img5.jpg
|
|
||||||
Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img6.jpg
|
|
||||||
Image is 3 channels.
|
|
||||||
Channel #0 size: 640 x 480
|
|
||||||
Channel #1 size: 320 x 240
|
|
||||||
Channel #2 size: 320 x 240
|
|
||||||
YUV 4:2:0 chroma subsampling
|
|
||||||
Writing JPEG file: encode_output/img6.jpg
|
|
||||||
Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img7.jpg
|
|
||||||
Image is 3 channels.
|
|
||||||
Channel #0 size: 480 x 640
|
|
||||||
Channel #1 size: 240 x 320
|
|
||||||
Channel #2 size: 240 x 320
|
|
||||||
YUV 4:2:0 chroma subsampling
|
|
||||||
Writing JPEG file: encode_output/img7.jpg
|
|
||||||
Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img8.jpg
|
|
||||||
Image is 3 channels.
|
|
||||||
Channel #0 size: 480 x 640
|
|
||||||
Channel #1 size: 240 x 320
|
|
||||||
Channel #2 size: 240 x 320
|
|
||||||
YUV 4:2:0 chroma subsampling
|
|
||||||
Writing JPEG file: encode_output/img8.jpg
|
|
||||||
Total images processed: 8
|
|
||||||
Total time spent on encoding: 1.9711
|
|
||||||
Avg time/image: 0.246388
|
|
@ -1,35 +0,0 @@
|
|||||||
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
|
|
||||||
Device: 0, NVIDIA H100 PCIe, pciBusID: c1, pciDeviceID: 0, pciDomainID:0
|
|
||||||
|
|
||||||
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
|
|
||||||
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
|
|
||||||
|
|
||||||
P2P Connectivity Matrix
|
|
||||||
D\D 0
|
|
||||||
0 1
|
|
||||||
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
|
|
||||||
D\D 0
|
|
||||||
0 1628.72
|
|
||||||
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
|
|
||||||
D\D 0
|
|
||||||
0 1625.75
|
|
||||||
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
|
|
||||||
D\D 0
|
|
||||||
0 1668.11
|
|
||||||
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
|
|
||||||
D\D 0
|
|
||||||
0 1668.39
|
|
||||||
P2P=Disabled Latency Matrix (us)
|
|
||||||
GPU 0
|
|
||||||
0 2.67
|
|
||||||
|
|
||||||
CPU 0
|
|
||||||
0 2.04
|
|
||||||
P2P=Enabled Latency (P2P Writes) Matrix (us)
|
|
||||||
GPU 0
|
|
||||||
0 2.68
|
|
||||||
|
|
||||||
CPU 0
|
|
||||||
0 2.02
|
|
||||||
|
|
||||||
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
|
|
@ -1,15 +0,0 @@
|
|||||||
[PTX Just In Time (JIT) Compilation (no-qatest)] - Starting...
|
|
||||||
> Using CUDA Device [0]: NVIDIA H100 PCIe
|
|
||||||
> findModulePath <./ptxjit_kernel64.ptx>
|
|
||||||
> initCUDA loading module: <./ptxjit_kernel64.ptx>
|
|
||||||
Loading ptxjit_kernel[] program
|
|
||||||
CUDA Link Completed in 0.000000ms. Linker Output:
|
|
||||||
ptxas info : 0 bytes gmem
|
|
||||||
ptxas info : Compiling entry function 'myKernel' for 'sm_90a'
|
|
||||||
ptxas info : Function properties for myKernel
|
|
||||||
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
|
|
||||||
ptxas info : Used 8 registers
|
|
||||||
info : 0 bytes gmem
|
|
||||||
info : Function properties for 'myKernel':
|
|
||||||
info : used 8 registers, 0 stack, 0 bytes smem, 536 bytes cmem[0], 0 bytes lmem
|
|
||||||
CUDA kernel launched
|
|
@ -1,24 +0,0 @@
|
|||||||
./quasirandomGenerator Starting...
|
|
||||||
|
|
||||||
Allocating GPU memory...
|
|
||||||
Allocating CPU memory...
|
|
||||||
Initializing QRNG tables...
|
|
||||||
|
|
||||||
Testing QRNG...
|
|
||||||
|
|
||||||
quasirandomGenerator, Throughput = 51.2334 GNumbers/s, Time = 0.00006 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 384
|
|
||||||
|
|
||||||
Reading GPU results...
|
|
||||||
Comparing to the CPU results...
|
|
||||||
|
|
||||||
L1 norm: 7.275964E-12
|
|
||||||
|
|
||||||
Testing inverseCNDgpu()...
|
|
||||||
|
|
||||||
quasirandomGenerator-inverse, Throughput = 116.2931 GNumbers/s, Time = 0.00003 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 128
|
|
||||||
Reading GPU results...
|
|
||||||
|
|
||||||
Comparing to the CPU results...
|
|
||||||
L1 norm: 9.439909E-08
|
|
||||||
|
|
||||||
Shutting down...
|
|
@ -1,27 +0,0 @@
|
|||||||
./quasirandomGenerator_nvrtc Starting...
|
|
||||||
|
|
||||||
> Using CUDA Device [0]: NVIDIA H100 PCIe
|
|
||||||
> Using CUDA Device [0]: NVIDIA H100 PCIe
|
|
||||||
> GPU Device has SM 9.0 compute capability
|
|
||||||
Allocating GPU memory...
|
|
||||||
Allocating CPU memory...
|
|
||||||
Initializing QRNG tables...
|
|
||||||
|
|
||||||
Testing QRNG...
|
|
||||||
|
|
||||||
quasirandomGenerator, Throughput = 45.0355 GNumbers/s, Time = 0.00007 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 384
|
|
||||||
|
|
||||||
Reading GPU results...
|
|
||||||
Comparing to the CPU results...
|
|
||||||
|
|
||||||
L1 norm: 7.275964E-12
|
|
||||||
|
|
||||||
Testing inverseCNDgpu()...
|
|
||||||
|
|
||||||
quasirandomGenerator-inverse, Throughput = 94.7508 GNumbers/s, Time = 0.00003 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 128
|
|
||||||
Reading GPU results...
|
|
||||||
|
|
||||||
Comparing to the CPU results...
|
|
||||||
L1 norm: 9.439909E-08
|
|
||||||
|
|
||||||
Shutting down...
|
|
@ -1,9 +0,0 @@
|
|||||||
./radixSortThrust Starting...
|
|
||||||
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
|
|
||||||
Sorting 1048576 32-bit unsigned int keys and values
|
|
||||||
|
|
||||||
radixSortThrust, Throughput = 2276.9744 MElements/s, Time = 0.00046 s, Size = 1048576 elements
|
|
||||||
Test passed
|
|
@ -1,18 +0,0 @@
|
|||||||
./reduction Starting...
|
|
||||||
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Using Device 0: NVIDIA H100 PCIe
|
|
||||||
|
|
||||||
Reducing array of type int
|
|
||||||
|
|
||||||
16777216 elements
|
|
||||||
256 threads (max)
|
|
||||||
64 blocks
|
|
||||||
|
|
||||||
Reduction, Throughput = 49.0089 GB/s, Time = 0.00137 s, Size = 16777216 Elements, NumDevsUsed = 1, Workgroup = 256
|
|
||||||
|
|
||||||
GPU result = 2139353471
|
|
||||||
CPU result = 2139353471
|
|
||||||
|
|
||||||
Test passed
|
|
@ -1,14 +0,0 @@
|
|||||||
reductionMultiBlockCG Starting...
|
|
||||||
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
33554432 elements
|
|
||||||
numThreads: 1024
|
|
||||||
numBlocks: 228
|
|
||||||
|
|
||||||
Launching SinglePass Multi Block Cooperative Groups kernel
|
|
||||||
Average time: 0.102750 ms
|
|
||||||
Bandwidth: 1306.254555 GB/s
|
|
||||||
|
|
||||||
GPU result = 1.992401599884
|
|
||||||
CPU result = 1.992401361465
|
|
@ -1,19 +0,0 @@
|
|||||||
./scalarProd Starting...
|
|
||||||
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Initializing data...
|
|
||||||
...allocating CPU memory.
|
|
||||||
...allocating GPU memory.
|
|
||||||
...generating input data in CPU mem.
|
|
||||||
...copying input data to GPU mem.
|
|
||||||
Data init done.
|
|
||||||
Executing GPU kernel...
|
|
||||||
GPU time: 0.042000 msecs.
|
|
||||||
Reading back GPU result...
|
|
||||||
Checking GPU results...
|
|
||||||
..running CPU scalar product calculation
|
|
||||||
...comparing the results
|
|
||||||
Shutting down...
|
|
||||||
L1 error: 2.745062E-08
|
|
||||||
Test passed
|
|
@ -1,138 +0,0 @@
|
|||||||
./scan Starting...
|
|
||||||
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Allocating and initializing host arrays...
|
|
||||||
Allocating and initializing CUDA arrays...
|
|
||||||
Initializing CUDA-C scan...
|
|
||||||
|
|
||||||
*** Running GPU scan for short arrays (100 identical iterations)...
|
|
||||||
|
|
||||||
Running scan for 4 elements (1703936 arrays)...
|
|
||||||
Validating the results...
|
|
||||||
...reading back GPU results
|
|
||||||
...scanExclusiveHost()
|
|
||||||
...comparing the results
|
|
||||||
...Results Match
|
|
||||||
|
|
||||||
Running scan for 8 elements (851968 arrays)...
|
|
||||||
Validating the results...
|
|
||||||
...reading back GPU results
|
|
||||||
...scanExclusiveHost()
|
|
||||||
...comparing the results
|
|
||||||
...Results Match
|
|
||||||
|
|
||||||
Running scan for 16 elements (425984 arrays)...
|
|
||||||
Validating the results...
|
|
||||||
...reading back GPU results
|
|
||||||
...scanExclusiveHost()
|
|
||||||
...comparing the results
|
|
||||||
...Results Match
|
|
||||||
|
|
||||||
Running scan for 32 elements (212992 arrays)...
|
|
||||||
Validating the results...
|
|
||||||
...reading back GPU results
|
|
||||||
...scanExclusiveHost()
|
|
||||||
...comparing the results
|
|
||||||
...Results Match
|
|
||||||
|
|
||||||
Running scan for 64 elements (106496 arrays)...
|
|
||||||
Validating the results...
|
|
||||||
...reading back GPU results
|
|
||||||
...scanExclusiveHost()
|
|
||||||
...comparing the results
|
|
||||||
...Results Match
|
|
||||||
|
|
||||||
Running scan for 128 elements (53248 arrays)...
|
|
||||||
Validating the results...
|
|
||||||
...reading back GPU results
|
|
||||||
...scanExclusiveHost()
|
|
||||||
...comparing the results
|
|
||||||
...Results Match
|
|
||||||
|
|
||||||
Running scan for 256 elements (26624 arrays)...
|
|
||||||
Validating the results...
|
|
||||||
...reading back GPU results
|
|
||||||
...scanExclusiveHost()
|
|
||||||
...comparing the results
|
|
||||||
...Results Match
|
|
||||||
|
|
||||||
Running scan for 512 elements (13312 arrays)...
|
|
||||||
Validating the results...
|
|
||||||
...reading back GPU results
|
|
||||||
...scanExclusiveHost()
|
|
||||||
...comparing the results
|
|
||||||
...Results Match
|
|
||||||
|
|
||||||
Running scan for 1024 elements (6656 arrays)...
|
|
||||||
Validating the results...
|
|
||||||
...reading back GPU results
|
|
||||||
...scanExclusiveHost()
|
|
||||||
...comparing the results
|
|
||||||
...Results Match
|
|
||||||
|
|
||||||
|
|
||||||
scan, Throughput = 35.1769 MElements/s, Time = 0.00003 s, Size = 1024 Elements, NumDevsUsed = 1, Workgroup = 256
|
|
||||||
|
|
||||||
***Running GPU scan for large arrays (100 identical iterations)...
|
|
||||||
|
|
||||||
Running scan for 2048 elements (3328 arrays)...
|
|
||||||
Validating the results...
|
|
||||||
...reading back GPU results
|
|
||||||
...scanExclusiveHost()
|
|
||||||
...comparing the results
|
|
||||||
...Results Match
|
|
||||||
|
|
||||||
Running scan for 4096 elements (1664 arrays)...
|
|
||||||
Validating the results...
|
|
||||||
...reading back GPU results
|
|
||||||
...scanExclusiveHost()
|
|
||||||
...comparing the results
|
|
||||||
...Results Match
|
|
||||||
|
|
||||||
Running scan for 8192 elements (832 arrays)...
|
|
||||||
Validating the results...
|
|
||||||
...reading back GPU results
|
|
||||||
...scanExclusiveHost()
|
|
||||||
...comparing the results
|
|
||||||
...Results Match
|
|
||||||
|
|
||||||
Running scan for 16384 elements (416 arrays)...
|
|
||||||
Validating the results...
|
|
||||||
...reading back GPU results
|
|
||||||
...scanExclusiveHost()
|
|
||||||
...comparing the results
|
|
||||||
...Results Match
|
|
||||||
|
|
||||||
Running scan for 32768 elements (208 arrays)...
|
|
||||||
Validating the results...
|
|
||||||
...reading back GPU results
|
|
||||||
...scanExclusiveHost()
|
|
||||||
...comparing the results
|
|
||||||
...Results Match
|
|
||||||
|
|
||||||
Running scan for 65536 elements (104 arrays)...
|
|
||||||
Validating the results...
|
|
||||||
...reading back GPU results
|
|
||||||
...scanExclusiveHost()
|
|
||||||
...comparing the results
|
|
||||||
...Results Match
|
|
||||||
|
|
||||||
Running scan for 131072 elements (52 arrays)...
|
|
||||||
Validating the results...
|
|
||||||
...reading back GPU results
|
|
||||||
...scanExclusiveHost()
|
|
||||||
...comparing the results
|
|
||||||
...Results Match
|
|
||||||
|
|
||||||
Running scan for 262144 elements (26 arrays)...
|
|
||||||
Validating the results...
|
|
||||||
...reading back GPU results
|
|
||||||
...scanExclusiveHost()
|
|
||||||
...comparing the results
|
|
||||||
...Results Match
|
|
||||||
|
|
||||||
|
|
||||||
scan, Throughput = 5146.1328 MElements/s, Time = 0.00005 s, Size = 262144 Elements, NumDevsUsed = 1, Workgroup = 256
|
|
||||||
|
|
||||||
Shutting down...
|
|
@ -1,6 +0,0 @@
|
|||||||
./segmentationTreeThrust Starting...
|
|
||||||
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
* Building segmentation tree... done in 24.6388 (ms)
|
|
||||||
* Dumping levels for each tree...
|
|
@ -1,24 +0,0 @@
|
|||||||
Starting shfl_scan
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
> Detected Compute SM 9.0 hardware with 114 multi-processors
|
|
||||||
Starting shfl_scan
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
> Detected Compute SM 9.0 hardware with 114 multi-processors
|
|
||||||
Computing Simple Sum test
|
|
||||||
---------------------------------------------------
|
|
||||||
Initialize test data [1, 1, 1...]
|
|
||||||
Scan summation for 65536 elements, 256 partial sums
|
|
||||||
Partial summing 256 elements with 1 blocks of size 256
|
|
||||||
Test Sum: 65536
|
|
||||||
Time (ms): 0.021504
|
|
||||||
65536 elements scanned in 0.021504 ms -> 3047.619141 MegaElements/s
|
|
||||||
CPU verify result diff (GPUvsCPU) = 0
|
|
||||||
CPU sum (naive) took 0.017810 ms
|
|
||||||
|
|
||||||
Computing Integral Image Test on size 1920 x 1080 synthetic data
|
|
||||||
---------------------------------------------------
|
|
||||||
Method: Fast Time (GPU Timer): 0.008032 ms Diff = 0
|
|
||||||
Method: Vertical Scan Time (GPU Timer): 0.068576 ms
|
|
||||||
CheckSum: 2073600, (expect 1920x1080=2073600)
|
|
@ -1,6 +0,0 @@
|
|||||||
./simpleAWBarrier starting...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Launching normVecByDotProductAWBarrier kernel with numBlocks = 228 blockSize = 576
|
|
||||||
Result = PASSED
|
|
||||||
./simpleAWBarrier completed, returned OK
|
|
@ -1,16 +0,0 @@
|
|||||||
simpleAssert starting...
|
|
||||||
OS_System_Type.release = 5.4.0-131-generic
|
|
||||||
OS Info: <#147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022>
|
|
||||||
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Launch kernel to generate assertion failures
|
|
||||||
|
|
||||||
-- Begin assert output
|
|
||||||
|
|
||||||
|
|
||||||
-- End assert output
|
|
||||||
|
|
||||||
Device assert failed as expected, CUDA error message is: device-side assert triggered
|
|
||||||
|
|
||||||
simpleAssert completed, returned OK
|
|
@ -1,12 +0,0 @@
|
|||||||
simpleAssert_nvrtc starting...
|
|
||||||
Launch kernel to generate assertion failures
|
|
||||||
> Using CUDA Device [0]: NVIDIA H100 PCIe
|
|
||||||
> Using CUDA Device [0]: NVIDIA H100 PCIe
|
|
||||||
> GPU Device has SM 9.0 compute capability
|
|
||||||
|
|
||||||
-- Begin assert output
|
|
||||||
|
|
||||||
|
|
||||||
-- End assert output
|
|
||||||
|
|
||||||
Device assert failed as expected
|
|
@ -1,5 +0,0 @@
|
|||||||
simpleAtomicIntrinsics starting...
|
|
||||||
GPU Device 0: "Hopper" with compute capability 9.0
|
|
||||||
|
|
||||||
Processing time: 2.438000 (ms)
|
|
||||||
simpleAtomicIntrinsics completed, returned OK
|
|
@ -1,6 +0,0 @@
|
|||||||
simpleAtomicIntrinsics_nvrtc starting...
|
|
||||||
> Using CUDA Device [0]: NVIDIA H100 PCIe
|
|
||||||
> Using CUDA Device [0]: NVIDIA H100 PCIe
|
|
||||||
> GPU Device has SM 9.0 compute capability
|
|
||||||
Processing time: 0.171000 (ms)
|
|
||||||
simpleAtomicIntrinsics_nvrtc completed, returned OK
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
x
Reference in New Issue
Block a user