Remove unused bin/x86_64 directory hierarchy

2025-07-01 20:20:29 +08:00 · 2025-05-01 09:53:54 -07:00 · 2025-05-01 09:53:54 -07:00 · ab68d58d59
commit ab68d58d59
parent c70d79cf3b
261 changed files with 0 additions and 151271 deletions
--- a/bin/x86_64/linux/release/APM_BlackScholes.txt
+++ b/bin/x86_64/linux/release/APM_BlackScholes.txt
@ -1,36 +0,0 @@
 [./BlackScholes] - Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 Initializing data...
 ...allocating CPU memory for options.
 ...allocating GPU memory for options.
 ...generating input data in CPU mem.
 ...copying input data to GPU mem.
 Data init done.
 Executing Black-Scholes GPU kernel (512 iterations)...
 Options count             : 8000000
 BlackScholesGPU() time    : 0.048059 msec
 Effective memory bandwidth: 1664.634581 GB/s
 Gigaoptions per second    : 166.463458
 BlackScholes, Throughput = 166.4635 GOptions/s, Time = 0.00005 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128
 Reading back GPU results...
 Checking the results...
 ...running CPU calculations.
 Comparing the results...
 L1 norm: 1.741792E-07
 Max absolute error: 1.192093E-05
 Shutting down...
 ...releasing GPU memory.
 ...releasing CPU memory.
 Shutdown done.
 [BlackScholes] - Test Summary
 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
 Test passed
--- a/bin/x86_64/linux/release/APM_BlackScholes_nvrtc.txt
+++ b/bin/x86_64/linux/release/APM_BlackScholes_nvrtc.txt
@ -1,34 +0,0 @@
 [./BlackScholes_nvrtc] - Starting...
 Initializing data...
 ...allocating CPU memory for options.
 > Using CUDA Device [0]: NVIDIA H100 PCIe
 > Using CUDA Device [0]: NVIDIA H100 PCIe
 > GPU Device has SM 9.0 compute capability
 ...allocating GPU memory for options.
 ...generating input data in CPU mem.
 ...copying input data to GPU mem.
 Data init done.
 Executing Black-Scholes GPU kernel (512 iterations)...
 Options count             : 8000000
 BlackScholesGPU() time    : 0.047896 msec
 Effective memory bandwidth: 1670.268678 GB/s
 Gigaoptions per second    : 167.026868
 BlackScholes, Throughput = 167.0269 GOptions/s, Time = 0.00005 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128
 Reading back GPU results...
 Checking the results...
 ...running CPU calculations.
 Comparing the results...
 L1 norm: 1.741792E-07
 Max absolute error: 1.192093E-05
 Shutting down...
 ...releasing GPU memory.
 ...releasing CPU memory.
 Shutdown done.
 [./BlackScholes_nvrtc] - Test Summary
 Test passed
--- a/bin/x86_64/linux/release/APM_FDTD3d.txt
+++ b/bin/x86_64/linux/release/APM_FDTD3d.txt
@ -1,37 +0,0 @@
 ./FDTD3d Starting...
 Set-up, based upon target device GMEM size...
 getTargetDeviceGlobalMemSize
 cudaGetDeviceCount
 GPU Device 0: "Hopper" with compute capability 9.0
 cudaGetDeviceProperties
 generateRandomData
 FDTD on 376 x 376 x 376 volume with symmetric filter radius 4 for 5 timesteps...
 fdtdReference...
 calloc intermediate
 Host FDTD loop
 	t = 0
 	t = 1
 	t = 2
 	t = 3
 	t = 4
 fdtdReference complete
 fdtdGPU...
 GPU Device 0: "Hopper" with compute capability 9.0
 set block size to 32x16
 set grid size to 12x24
 GPU FDTD loop
 	t = 0 launch kernel
 	t = 1 launch kernel
 	t = 2 launch kernel
 	t = 3 launch kernel
 	t = 4 launch kernel
 fdtdGPU complete
 CompareData (tolerance 0.000100)...
--- a/bin/x86_64/linux/release/APM_HSOpticalFlow.txt
+++ b/bin/x86_64/linux/release/APM_HSOpticalFlow.txt
@ -1,9 +0,0 @@
 HSOpticalFlow Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 Loading "frame10.ppm" ...
 Loading "frame11.ppm" ...
 Computing optical flow on CPU...
 Computing optical flow on GPU...
 L1 error : 0.044308
--- a/bin/x86_64/linux/release/APM_MC_EstimatePiInlineP.txt
+++ b/bin/x86_64/linux/release/APM_MC_EstimatePiInlineP.txt
@ -1,14 +0,0 @@
 Monte Carlo Estimate Pi (with inline PRNG)
 ==========================================
 Estimating Pi on GPU (NVIDIA H100 PCIe)
 Precision:      single
 Number of sims: 100000
 Tolerance:      1.000000e-02
 GPU result:     3.141480e+00
 Expected:       3.141593e+00
 Absolute error: 1.127720e-04
 Relative error: 3.589644e-05
 MonteCarloEstimatePiInlineP, Performance = 746954.33 sims/s, Time = 133.88(ms), NumDevsUsed = 1, Blocksize = 128
--- a/bin/x86_64/linux/release/APM_MC_EstimatePiInlineQ.txt
+++ b/bin/x86_64/linux/release/APM_MC_EstimatePiInlineQ.txt
@ -1,14 +0,0 @@
 Monte Carlo Estimate Pi (with inline QRNG)
 ==========================================
 Estimating Pi on GPU (NVIDIA H100 PCIe)
 Precision:      single
 Number of sims: 100000
 Tolerance:      1.000000e-02
 GPU result:     3.141840e+00
 Expected:       3.141593e+00
 Absolute error: 2.472401e-04
 Relative error: 7.869895e-05
 MonteCarloEstimatePiInlineQ, Performance = 677644.44 sims/s, Time = 147.57(ms), NumDevsUsed = 1, Blocksize = 128
--- a/bin/x86_64/linux/release/APM_MC_EstimatePiP.txt
+++ b/bin/x86_64/linux/release/APM_MC_EstimatePiP.txt
@ -1,14 +0,0 @@
 Monte Carlo Estimate Pi (with batch PRNG)
 =========================================
 Estimating Pi on GPU (NVIDIA H100 PCIe)
 Precision:      single
 Number of sims: 100000
 Tolerance:      1.000000e-02
 GPU result:     3.136320e+00
 Expected:       3.141593e+00
 Absolute error: 5.272627e-03
 Relative error: 1.678329e-03
 MonteCarloEstimatePiP, Performance = 652941.82 sims/s, Time = 153.15(ms), NumDevsUsed = 1, Blocksize = 128
--- a/bin/x86_64/linux/release/APM_MC_EstimatePiQ.txt
+++ b/bin/x86_64/linux/release/APM_MC_EstimatePiQ.txt
@ -1,14 +0,0 @@
 Monte Carlo Estimate Pi (with batch QRNG)
 =========================================
 Estimating Pi on GPU (NVIDIA H100 PCIe)
 Precision:      single
 Number of sims: 100000
 Tolerance:      1.000000e-02
 GPU result:     3.141840e+00
 Expected:       3.141593e+00
 Absolute error: 2.472401e-04
 Relative error: 7.869895e-05
 MonteCarloEstimatePiQ, Performance = 821146.16 sims/s, Time = 121.78(ms), NumDevsUsed = 1, Blocksize = 128
--- a/bin/x86_64/linux/release/APM_MC_SingleAsianOptionP.txt
+++ b/bin/x86_64/linux/release/APM_MC_SingleAsianOptionP.txt
@ -1,13 +0,0 @@
 Monte Carlo Single Asian Option (with PRNG)
 ===========================================
 Pricing option on GPU (NVIDIA H100 PCIe)
 Precision:      single
 Number of sims: 100000
   Spot    |   Strike   |     r      |   sigma    |   tenor    |  Call/Put  |   Value    |  Expected  |
 -----------|------------|------------|------------|------------|------------|------------|------------|
        40 |         35 |       0.03 |        0.2 |   0.333333 |       Call |    5.17083 |    5.16253 |
 MonteCarloSingleAsianOptionP, Performance = 824402.28 sims/s, Time = 121.30(ms), NumDevsUsed = 1, Blocksize = 128
--- a/bin/x86_64/linux/release/APM_MersenneTwisterGP11213.txt
+++ b/bin/x86_64/linux/release/APM_MersenneTwisterGP11213.txt
@ -1,19 +0,0 @@
 ./MersenneTwisterGP11213 Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 Allocating data for 2400000 samples...
 Seeding with 777 ...
 Generating random numbers on GPU...
 Reading back the results...
 Generating random numbers on CPU...
 Comparing CPU/GPU random numbers...
 Max absolute error: 0.000000E+00
 L1 norm: 0.000000E+00
 MersenneTwisterGP11213, Throughput = 74.5342 GNumbers/s, Time = 0.00003 s, Size = 2400000 Numbers
 Shutting down...
--- a/bin/x86_64/linux/release/APM_MonteCarloMultiGPU.txt
+++ b/bin/x86_64/linux/release/APM_MonteCarloMultiGPU.txt
@ -1,29 +0,0 @@
 ./MonteCarloMultiGPU Starting...
 Using single CPU thread for multiple GPUs
 MonteCarloMultiGPU
 ==================
 Parallelization method  = streamed
 Problem scaling         = weak
 Number of GPUs          = 1
 Total number of options = 8192
 Number of paths         = 262144
 main(): generating input data...
 main(): starting 1 host threads...
 main(): GPU statistics, streamed
 GPU Device #0: NVIDIA H100 PCIe
 Options         : 8192
 Simulation paths: 262144
 Total time (ms.): 5.516000
 	Note: This is elapsed time for all to compute.
 Options per sec.: 1485134.210647
 main(): comparing Monte Carlo and Black-Scholes results...
 Shutting down...
 Test Summary...
 L1 norm        : 4.869781E-04
 Average reserve: 14.607882
 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
 Test passed
--- a/bin/x86_64/linux/release/APM_NV12toBGRandResize.txt
+++ b/bin/x86_64/linux/release/APM_NV12toBGRandResize.txt
@ -1,10 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 TEST#1:
  CUDA resize nv12(1920x1080 --> 640x480), batch: 24, average time: 0.024 ms ==> 0.001 ms/frame
  CUDA convert nv12(640x480) to bgr(640x480), batch: 24, average time: 0.061 ms ==> 0.003 ms/frame
 TEST#2:
  CUDA convert nv12(1920x1080) to bgr(1920x1080), batch: 24, average time: 0.405 ms ==> 0.017 ms/frame
  CUDA resize bgr(1920x1080 --> 640x480), batch: 24, average time: 0.318 ms ==> 0.013 ms/frame
--- a/bin/x86_64/linux/release/APM_SobolQRNG.txt
+++ b/bin/x86_64/linux/release/APM_SobolQRNG.txt
@ -1,19 +0,0 @@
 Sobol Quasi-Random Number Generator Starting...
 > number of vectors = 100000
 > number of dimensions = 100
 GPU Device 0: "Hopper" with compute capability 9.0
 Allocating CPU memory...
 Allocating GPU memory...
 Initializing direction numbers...
 Copying direction numbers to device...
 Executing QRNG on GPU...
 Gsamples/s: 7.51315
 Reading results from GPU...
 Executing QRNG on CPU...
 Gsamples/s: 0.232504
 Checking results...
 L1-Error: 0
 Shutting down...
--- a/bin/x86_64/linux/release/APM_StreamPriorities.txt
+++ b/bin/x86_64/linux/release/APM_StreamPriorities.txt
@ -1,6 +0,0 @@
 Starting [./StreamPriorities]...
 GPU Device 0: "Hopper" with compute capability 9.0
 CUDA stream priority range: LOW: 0 to HIGH: -5
 elapsed time of kernels launched to LOW priority stream: 0.661 ms
 elapsed time of kernels launched to HI  priority stream: 0.523 ms
--- a/bin/x86_64/linux/release/APM_UnifiedMemoryPerf.txt
+++ b/bin/x86_64/linux/release/APM_UnifiedMemoryPerf.txt
@ -1,17 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 Running ........................................................
 Overall Time For matrixMultiplyPerf
 Printing Average of 20 measurements in (ms)
 Size_KB	 UMhint	UMhntAs	 UMeasy	  0Copy	MemCopy	CpAsync	CpHpglk	CpPglAs
 4	  0.210	  0.264	  0.332	  0.014	  0.033	  0.026	  0.037	  0.024
 16	  0.201	  0.307	  0.489	  0.025	  0.043	  0.035	  0.046	  0.045
 64	  0.311	  0.381	  0.758	  0.067	  0.084	  0.075	  0.074	  0.063
 256	  0.545	  0.604	  1.429	  0.323	  0.228	  0.212	  0.197	  0.187
 1024	  1.551	  1.444	  2.436	  1.902	  0.831	  0.784	  0.714	  0.728
 4096	  4.960	  4.375	  7.863	 11.966	  3.239	  3.179	  2.908	  2.919
 16384	 18.911	 17.022	 29.696	 77.375	 13.796	 13.757	 12.874	 12.862
 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
--- a/bin/x86_64/linux/release/APM_UnifiedMemoryStreams.txt
+++ b/bin/x86_64/linux/release/APM_UnifiedMemoryStreams.txt
@ -1,44 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 Executing tasks on host / device
 Task [0], thread [0] executing on device (569)
 Task [1], thread [1] executing on device (904)
 Task [2], thread [2] executing on device (529)
 Task [3], thread [3] executing on device (600)
 Task [4], thread [0] executing on device (975)
 Task [5], thread [2] executing on device (995)
 Task [6], thread [1] executing on device (576)
 Task [7], thread [3] executing on device (700)
 Task [8], thread [0] executing on device (716)
 Task [9], thread [2] executing on device (358)
 Task [10], thread [3] executing on device (941)
 Task [11], thread [1] executing on device (403)
 Task [12], thread [0] executing on host (97)
 Task [13], thread [2] executing on device (451)
 Task [14], thread [1] executing on device (789)
 Task [15], thread [0] executing on device (810)
 Task [16], thread [2] executing on device (807)
 Task [17], thread [3] executing on device (756)
 Task [18], thread [0] executing on device (509)
 Task [19], thread [1] executing on device (252)
 Task [20], thread [2] executing on device (515)
 Task [21], thread [3] executing on device (676)
 Task [22], thread [0] executing on device (948)
 Task [23], thread [1] executing on device (944)
 Task [24], thread [3] executing on device (974)
 Task [25], thread [2] executing on device (513)
 Task [26], thread [0] executing on device (207)
 Task [27], thread [1] executing on device (509)
 Task [28], thread [2] executing on device (344)
 Task [29], thread [3] executing on device (198)
 Task [30], thread [0] executing on device (223)
 Task [31], thread [1] executing on device (382)
 Task [32], thread [3] executing on device (980)
 Task [33], thread [2] executing on device (519)
 Task [34], thread [0] executing on host (92)
 Task [35], thread [1] executing on device (677)
 Task [36], thread [2] executing on device (769)
 Task [37], thread [0] executing on device (199)
 Task [38], thread [3] executing on device (845)
 Task [39], thread [0] executing on device (844)
 All Done!
--- a/bin/x86_64/linux/release/APM_alignedTypes.txt
+++ b/bin/x86_64/linux/release/APM_alignedTypes.txt
@ -1,51 +0,0 @@
 [./alignedTypes] - Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 [NVIDIA H100 PCIe] has 114 MP(s) x 128 (Cores/MP) = 14592 (Cores)
 > Compute scaling value = 1.00
 > Memory Size = 49999872
 Allocating memory...
 Generating host input data array...
 Uploading input data to GPU memory...
 Testing misaligned types...
 uint8...
 Avg. time: 1.298781 ms / Copy throughput: 35.853619 GB/s.
 	TEST OK
 uint16...
 Avg. time: 0.676656 ms / Copy throughput: 68.817823 GB/s.
 	TEST OK
 RGBA8_misaligned...
 Avg. time: 0.371437 ms / Copy throughput: 125.367015 GB/s.
 	TEST OK
 LA32_misaligned...
 Avg. time: 0.200531 ms / Copy throughput: 232.213238 GB/s.
 	TEST OK
 RGB32_misaligned...
 Avg. time: 0.154500 ms / Copy throughput: 301.398134 GB/s.
 	TEST OK
 RGBA32_misaligned...
 Avg. time: 0.124531 ms / Copy throughput: 373.930325 GB/s.
 	TEST OK
 Testing aligned types...
 RGBA8...
 Avg. time: 0.364031 ms / Copy throughput: 127.917614 GB/s.
 	TEST OK
 I32...
 Avg. time: 0.363844 ms / Copy throughput: 127.983539 GB/s.
 	TEST OK
 LA32...
 Avg. time: 0.200750 ms / Copy throughput: 231.960205 GB/s.
 	TEST OK
 RGB32...
 Avg. time: 0.122375 ms / Copy throughput: 380.518985 GB/s.
 	TEST OK
 RGBA32...
 Avg. time: 0.122437 ms / Copy throughput: 380.324735 GB/s.
 	TEST OK
 RGBA32_2...
 Avg. time: 0.080563 ms / Copy throughput: 578.010964 GB/s.
 	TEST OK
 [alignedTypes] -> Test Results: 0 Failures
 Shutting down...
 Test passed
--- a/bin/x86_64/linux/release/APM_asyncAPI.txt
+++ b/bin/x86_64/linux/release/APM_asyncAPI.txt
@ -1,7 +0,0 @@
 [./asyncAPI] - Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 CUDA device [NVIDIA H100 PCIe]
 time spent executing by the GPU: 5.34
 time spent by CPU in CUDA calls: 0.03
 CPU executed 55200 iterations while waiting for GPU to finish
--- a/bin/x86_64/linux/release/APM_bandwidthTest.txt
+++ b/bin/x86_64/linux/release/APM_bandwidthTest.txt
@ -1,24 +0,0 @@
 [CUDA Bandwidth Test] - Starting...
 Running on...
 Device 0: NVIDIA H100 PCIe
 Quick Mode
 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			27.9
 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			25.0
 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			1421.4
 Result = PASS
 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
--- a/bin/x86_64/linux/release/APM_batchCUBLAS.txt
+++ b/bin/x86_64/linux/release/APM_batchCUBLAS.txt
@ -1,59 +0,0 @@
 batchCUBLAS Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 ==== Running single kernels ====
 Testing sgemm
 #### args: ta=0 tb=0 m=128 n=128 k=128  alpha = (0xbf800000, -1) beta= (0x40000000, 2)
 #### args: lda=128 ldb=128 ldc=128
 ^^^^ elapsed = 0.00195909 sec  GFLOPS=2.14095
@@@@ sgemm test OK
 Testing dgemm
 #### args: ta=0 tb=0 m=128 n=128 k=128  alpha = (0x0000000000000000, 0) beta= (0x0000000000000000, 0)
 #### args: lda=128 ldb=128 ldc=128
 ^^^^ elapsed = 0.00003910 sec  GFLOPS=107.269
@@@@ dgemm test OK
 ==== Running N=10 without streams ====
 Testing sgemm
 #### args: ta=0 tb=0 m=128 n=128 k=128  alpha = (0xbf800000, -1) beta= (0x00000000, 0)
 #### args: lda=128 ldb=128 ldc=128
 ^^^^ elapsed = 0.00016713 sec  GFLOPS=250.958
@@@@ sgemm test OK
 Testing dgemm
 #### args: ta=0 tb=0 m=128 n=128 k=128  alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0)
 #### args: lda=128 ldb=128 ldc=128
 ^^^^ elapsed = 0.00144100 sec  GFLOPS=29.1069
@@@@ dgemm test OK
 ==== Running N=10 with streams ====
 Testing sgemm
 #### args: ta=0 tb=0 m=128 n=128 k=128  alpha = (0x40000000, 2) beta= (0x40000000, 2)
 #### args: lda=128 ldb=128 ldc=128
 ^^^^ elapsed = 0.00017214 sec  GFLOPS=243.659
@@@@ sgemm test OK
 Testing dgemm
 #### args: ta=0 tb=0 m=128 n=128 k=128  alpha = (0xbff0000000000000, -1) beta= (0x0000000000000000, 0)
 #### args: lda=128 ldb=128 ldc=128
 ^^^^ elapsed = 0.00014997 sec  GFLOPS=279.685
@@@@ dgemm test OK
 ==== Running N=10 batched ====
 Testing sgemm
 #### args: ta=0 tb=0 m=128 n=128 k=128  alpha = (0x3f800000, 1) beta= (0xbf800000, -1)
 #### args: lda=128 ldb=128 ldc=128
 ^^^^ elapsed = 0.00004101 sec  GFLOPS=1022.8
@@@@ sgemm test OK
 Testing dgemm
 #### args: ta=0 tb=0 m=128 n=128 k=128  alpha = (0xbff0000000000000, -1) beta= (0x4000000000000000, 2)
 #### args: lda=128 ldb=128 ldc=128
 ^^^^ elapsed = 0.00004506 sec  GFLOPS=930.803
@@@@ dgemm test OK
 Test Summary
 0 error(s)
--- a/bin/x86_64/linux/release/APM_batchedLabelMarkersAndLabelCompressionNPP.txt
+++ b/bin/x86_64/linux/release/APM_batchedLabelMarkersAndLabelCompressionNPP.txt
@ -1,21 +0,0 @@
 NPP Library Version 12.0.0
 CUDA Driver  Version: 12.0
 CUDA Runtime Version: 12.0
 Input file load succeeded.
 teapot_CompressedMarkerLabelsUF_8Way_512x512_32u succeeded, compressed label count is 155332.
 Input file load succeeded.
 CT_Skull_CompressedMarkerLabelsUF_8Way_512x512_32u succeeded, compressed label count is 414.
 Input file load succeeded.
 PCB_METAL_CompressedMarkerLabelsUF_8Way_509x335_32u succeeded, compressed label count is 3731.
 Input file load succeeded.
 PCB2_CompressedMarkerLabelsUF_8Way_1024x683_32u succeeded, compressed label count is 1224.
 Input file load succeeded.
 PCB_CompressedMarkerLabelsUF_8Way_1280x720_32u succeeded, compressed label count is 1440.
 teapot_CompressedMarkerLabelsUFBatch_8Way_512x512_32u succeeded, compressed label count is 155332.
 CT_Skull_CompressedMarkerLabelsUFBatch_8Way_512x512_32u succeeded, compressed label count is 414.
 PCB_METAL_CompressedMarkerLabelsUFBatch_8Way_509x335_32u succeeded, compressed label count is 3731.
 PCB2_CompressedMarkerLabelsUFBatch_8Way_1024x683_32u succeeded, compressed label count is 1222.
 PCB_CompressedMarkerLabelsUFBatch_8Way_1280x720_32u succeeded, compressed label count is 1447.
--- a/bin/x86_64/linux/release/APM_bf16TensorCoreGemm.txt
+++ b/bin/x86_64/linux/release/APM_bf16TensorCoreGemm.txt
@ -1,11 +0,0 @@
 Initializing...
 GPU Device 0: "Hopper" with compute capability 9.0
 M: 8192 (16 x 512)
 N: 8192 (16 x 512)
 K: 8192 (16 x 512)
 Preparing data for GPU...
 Required shared memory size: 72 Kb
 Computing using high performance kernel = 0 - compute_bf16gemm_async_copy
 Time: 9.149888 ms
 TFLOPS: 120.17
--- a/bin/x86_64/linux/release/APM_binaryPartitionCG.txt
+++ b/bin/x86_64/linux/release/APM_binaryPartitionCG.txt
@ -1,8 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 Launching 228 blocks with 1024 threads...
 Array size = 102400 Num of Odds = 50945 Sum of Odds = 1272565 Sum of Evens 1233938
 ...Done.
--- a/bin/x86_64/linux/release/APM_binomialOptions.txt
+++ b/bin/x86_64/linux/release/APM_binomialOptions.txt
@ -1,22 +0,0 @@
 [./binomialOptions] - Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 Generating input data...
 Running GPU binomial tree...
 Options count            : 1024
 Time steps               : 2048
 binomialOptionsGPU() time: 2.081000 msec
 Options per second       : 492071.098457
 Running CPU binomial tree...
 Comparing the results...
 GPU binomial vs. Black-Scholes
 L1 norm: 2.220214E-04
 CPU binomial vs. Black-Scholes
 L1 norm: 2.220922E-04
 CPU binomial vs. GPU binomial
 L1 norm: 7.997008E-07
 Shutting down...
 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
 Test passed
--- a/bin/x86_64/linux/release/APM_binomialOptions_nvrtc.txt
+++ b/bin/x86_64/linux/release/APM_binomialOptions_nvrtc.txt
@ -1,23 +0,0 @@
 [./binomialOptions_nvrtc] - Starting...
 Generating input data...
 Running GPU binomial tree...
 > Using CUDA Device [0]: NVIDIA H100 PCIe
 > Using CUDA Device [0]: NVIDIA H100 PCIe
 > GPU Device has SM 9.0 compute capability
 Options count            : 1024
 Time steps               : 2048
 binomialOptionsGPU() time: 3021.375000 msec
 Options per second       : 338.918539
 Running CPU binomial tree...
 Comparing the results...
 GPU binomial vs. Black-Scholes
 L1 norm: 2.216577E-04
 CPU binomial vs. Black-Scholes
 L1 norm: 9.435265E-05
 CPU binomial vs. GPU binomial
 L1 norm: 1.513570E-04
 Shutting down...
 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
 Test passed
--- a/bin/x86_64/linux/release/APM_c++11_cuda.txt
+++ b/bin/x86_64/linux/release/APM_c++11_cuda.txt
@ -1,4 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 Read 3223503 byte corpus from ../../../../Samples/0_Introduction/c++11_cuda/warandpeace.txt
 counted 107310 instances of 'x', 'y', 'z', or 'w' in "../../../../Samples/0_Introduction/c++11_cuda/warandpeace.txt"
--- a/bin/x86_64/linux/release/APM_cdpAdvancedQuicksort.txt
+++ b/bin/x86_64/linux/release/APM_cdpAdvancedQuicksort.txt
@ -1,6 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 GPU device NVIDIA H100 PCIe has compute capabilities (SM 9.0)
 Running qsort on 1000000 elements with seed 0, on NVIDIA H100 PCIe
    cdpAdvancedQuicksort PASSED
 Sorted 1000000 elems in 5.015 ms (199.389 Melems/sec)
--- a/bin/x86_64/linux/release/APM_cdpBezierTessellation.txt
+++ b/bin/x86_64/linux/release/APM_cdpBezierTessellation.txt
@ -1,2 +0,0 @@
 Running on GPU 0 (NVIDIA H100 PCIe)
 Computing Bezier Lines (CUDA Dynamic Parallelism Version) ... Done!
--- a/bin/x86_64/linux/release/APM_cdpQuadtree.txt
+++ b/bin/x86_64/linux/release/APM_cdpQuadtree.txt
@ -1,5 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 GPU device NVIDIA H100 PCIe has compute capabilities (SM 9.0)
 Launching CDP kernel to build the quadtree
 Results: OK
--- a/bin/x86_64/linux/release/APM_cdpSimplePrint.txt
+++ b/bin/x86_64/linux/release/APM_cdpSimplePrint.txt
@ -1,23 +0,0 @@
 starting Simple Print (CUDA Dynamic Parallelism)
 GPU Device 0: "Hopper" with compute capability 9.0
 ***************************************************************************
 The CPU launches 2 blocks of 2 threads each. On the device each thread will
 launch 2 blocks of 2 threads each. The GPU we will do that recursively
 until it reaches max_depth=2
 In total 2+8=10 blocks are launched!!! (8 from the GPU)
 ***************************************************************************
 Launching cdp_kernel() with CUDA Dynamic Parallelism:
 BLOCK 1 launched by the host
 BLOCK 0 launched by the host
 |  BLOCK 3 launched by thread 0 of block 1
 |  BLOCK 2 launched by thread 0 of block 1
 |  BLOCK 4 launched by thread 0 of block 0
 |  BLOCK 5 launched by thread 0 of block 0
 |  BLOCK 7 launched by thread 1 of block 0
 |  BLOCK 6 launched by thread 1 of block 0
 |  BLOCK 9 launched by thread 1 of block 1
 |  BLOCK 8 launched by thread 1 of block 1
--- a/bin/x86_64/linux/release/APM_cdpSimpleQuicksort.txt
+++ b/bin/x86_64/linux/release/APM_cdpSimpleQuicksort.txt
@ -1,6 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 Initializing data:
 Running quicksort on 128 elements
 Launching kernel on the GPU
 Validating results: OK
--- a/bin/x86_64/linux/release/APM_clock.txt
+++ b/bin/x86_64/linux/release/APM_clock.txt
@ -1,4 +0,0 @@
 CUDA Clock sample
 GPU Device 0: "Hopper" with compute capability 9.0
 Average clocks/block = 1904.875000
--- a/bin/x86_64/linux/release/APM_clock_nvrtc.txt
+++ b/bin/x86_64/linux/release/APM_clock_nvrtc.txt
@ -1,5 +0,0 @@
 CUDA Clock sample
 > Using CUDA Device [0]: NVIDIA H100 PCIe
 > Using CUDA Device [0]: NVIDIA H100 PCIe
 > GPU Device has SM 9.0 compute capability
 Average clocks/block = 1839.750000
--- a/bin/x86_64/linux/release/APM_concurrentKernels.txt
+++ b/bin/x86_64/linux/release/APM_concurrentKernels.txt
@ -1,8 +0,0 @@
 [./concurrentKernels] - Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 > Detected Compute SM 9.0 hardware with 114 multi-processors
 Expected time for serial execution of 8 kernels = 0.080s
 Expected time for concurrent execution of 8 kernels = 0.010s
 Measured time for sample = 0.010s
 Test passed
--- a/bin/x86_64/linux/release/APM_conjugateGradient.txt
+++ b/bin/x86_64/linux/release/APM_conjugateGradient.txt
@ -1,13 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 > GPU device has 114 Multi-Processors, SM 9.0 compute capabilities
 iteration =   1, residual = 4.449882e+01
 iteration =   2, residual = 3.245218e+00
 iteration =   3, residual = 2.690220e-01
 iteration =   4, residual = 2.307639e-02
 iteration =   5, residual = 1.993140e-03
 iteration =   6, residual = 1.846193e-04
 iteration =   7, residual = 1.693379e-05
 iteration =   8, residual = 1.600115e-06
 Test Summary:  Error amount = 0.000000
--- a/bin/x86_64/linux/release/APM_conjugateGradientCudaGraphs.txt
+++ b/bin/x86_64/linux/release/APM_conjugateGradientCudaGraphs.txt
@ -1,13 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 > GPU device has 114 Multi-Processors, SM 9.0 compute capabilities
 iteration =   1, residual = 4.449882e+01
 iteration =   2, residual = 3.245218e+00
 iteration =   3, residual = 2.690220e-01
 iteration =   4, residual = 2.307639e-02
 iteration =   5, residual = 1.993140e-03
 iteration =   6, residual = 1.846193e-04
 iteration =   7, residual = 1.693379e-05
 iteration =   8, residual = 1.600115e-06
 Test Summary:  Error amount = 0.000000
--- a/bin/x86_64/linux/release/APM_conjugateGradientMultiBlockCG.txt
+++ b/bin/x86_64/linux/release/APM_conjugateGradientMultiBlockCG.txt
@ -1,8 +0,0 @@
 Starting [conjugateGradientMultiBlockCG]...
 GPU Device 0: "Hopper" with compute capability 9.0
 > GPU device has 114 Multi-Processors, SM 9.0 compute capabilities
 GPU Final, residual = 1.600115e-06, kernel execution time = 16.014656 ms
 Test Summary:  Error amount = 0.000000
 &&&& conjugateGradientMultiBlockCG PASSED
--- a/bin/x86_64/linux/release/APM_conjugateGradientMultiDeviceCG.txt
+++ b/bin/x86_64/linux/release/APM_conjugateGradientMultiDeviceCG.txt
@ -1,4 +0,0 @@
 Starting [conjugateGradientMultiDeviceCG]...
 GPU Device 0: "NVIDIA H100 PCIe" with compute capability 9.0
 No two or more GPUs with same architecture capable of concurrentManagedAccess found.
 Waiving the sample
--- a/bin/x86_64/linux/release/APM_conjugateGradientPrecond.txt
+++ b/bin/x86_64/linux/release/APM_conjugateGradientPrecond.txt
@ -1,18 +0,0 @@
 conjugateGradientPrecond starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 GPU selected Device ID = 0
 > GPU device has 114 Multi-Processors, SM 9.0 compute capabilities
 laplace dimension = 128
 Convergence of CG without preconditioning:
  iteration = 564, residual = 9.174634e-13
  Convergence Test: OK
 Convergence of CG using ILU(0) preconditioning:
  iteration = 188, residual = 9.084683e-13
  Convergence Test: OK
 Test Summary:
   Counted total of 0 errors
   qaerr1 = 0.000005 qaerr2 = 0.000003
--- a/bin/x86_64/linux/release/APM_conjugateGradientUM.txt
+++ b/bin/x86_64/linux/release/APM_conjugateGradientUM.txt
@ -1,16 +0,0 @@
 Starting [conjugateGradientUM]...
 GPU Device 0: "Hopper" with compute capability 9.0
 > GPU device has 114 Multi-Processors, SM 9.0 compute capabilities
 iteration =   1, residual = 4.449882e+01
 iteration =   2, residual = 3.245218e+00
 iteration =   3, residual = 2.690220e-01
 iteration =   4, residual = 2.307639e-02
 iteration =   5, residual = 1.993140e-03
 iteration =   6, residual = 1.846193e-04
 iteration =   7, residual = 1.693379e-05
 iteration =   8, residual = 1.600115e-06
 Final residual: 1.600115e-06
 &&&& conjugateGradientUM PASSED
 Test Summary:  Error amount = 0.000000, result = SUCCESS
--- a/bin/x86_64/linux/release/APM_convolutionFFT2D.txt
+++ b/bin/x86_64/linux/release/APM_convolutionFFT2D.txt
@ -1,41 +0,0 @@
 [./convolutionFFT2D] - Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 Testing built-in R2C / C2R FFT-based convolution
 ...allocating memory
 ...generating random input data
 ...creating R2C & C2R FFT plans for 2048 x 2048
 ...uploading to GPU and padding convolution kernel and input data
 ...transforming convolution kernel
 ...running GPU FFT convolution: 33613.444604 MPix/s (0.119000 ms)
 ...reading back GPU convolution results
 ...running reference CPU convolution
 ...comparing the results: rel L2 = 9.395370E-08 (max delta = 1.208283E-06)
 L2norm Error OK
 ...shutting down
 Testing custom R2C / C2R FFT-based convolution
 ...allocating memory
 ...generating random input data
 ...creating C2C FFT plan for 2048 x 1024
 ...uploading to GPU and padding convolution kernel and input data
 ...transforming convolution kernel
 ...running GPU FFT convolution: 29197.081461 MPix/s (0.137000 ms)
 ...reading back GPU FFT results
 ...running reference CPU convolution
 ...comparing the results: rel L2 = 1.067915E-07 (max delta = 9.817303E-07)
 L2norm Error OK
 ...shutting down
 Testing updated custom R2C / C2R FFT-based convolution
 ...allocating memory
 ...generating random input data
 ...creating C2C FFT plan for 2048 x 1024
 ...uploading to GPU and padding convolution kernel and input data
 ...transforming convolution kernel
 ...running GPU FFT convolution: 39603.959017 MPix/s (0.101000 ms)
 ...reading back GPU FFT results
 ...running reference CPU convolution
 ...comparing the results: rel L2 = 1.065127E-07 (max delta = 9.817303E-07)
 L2norm Error OK
 ...shutting down
 Test Summary: 0 errors
 Test passed
--- a/bin/x86_64/linux/release/APM_convolutionSeparable.txt
+++ b/bin/x86_64/linux/release/APM_convolutionSeparable.txt
@ -1,21 +0,0 @@
 [./convolutionSeparable] - Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 Image Width x Height = 3072 x 3072
 Allocating and initializing host arrays...
 Allocating and initializing CUDA arrays...
 Running GPU convolution (16 identical iterations)...
 convolutionSeparable, Throughput = 74676.0329 MPixels/sec, Time = 0.00013 s, Size = 9437184 Pixels, NumDevsUsed = 1, Workgroup = 0
 Reading back GPU results...
 Checking the results...
 ...running convolutionRowCPU()
 ...running convolutionColumnCPU()
 ...comparing the results
 ...Relative L2 norm: 0.000000E+00
 Shutting down...
 Test passed
--- a/bin/x86_64/linux/release/APM_convolutionTexture.txt
+++ b/bin/x86_64/linux/release/APM_convolutionTexture.txt
@ -1,17 +0,0 @@
 [./convolutionTexture] - Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 Initializing data...
 Running GPU rows convolution (10 identical iterations)...
 Average convolutionRowsGPU() time: 0.117200 msecs; //40261.023178 Mpix/s
 Copying convolutionRowGPU() output back to the texture...
 cudaMemcpyToArray() time: 0.067000 msecs; //70426.744514 Mpix/s
 Running GPU columns convolution (10 iterations)
 Average convolutionColumnsGPU() time: 0.116000 msecs; //40677.518412 Mpix/s
 Reading back GPU results...
 Checking the results...
 ...running convolutionRowsCPU()
 ...running convolutionColumnsCPU()
 Relative L2 norm: 0.000000E+00
 Shutting down...
 Test passed
--- a/bin/x86_64/linux/release/APM_cppIntegration.txt
+++ b/bin/x86_64/linux/release/APM_cppIntegration.txt
@ -1,4 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 Hello World.
 Hello World.
--- a/bin/x86_64/linux/release/APM_cppOverload.txt
+++ b/bin/x86_64/linux/release/APM_cppOverload.txt
@ -1,30 +0,0 @@
 C++ Function Overloading starting...
 Device Count: 1
 GPU Device 0: "Hopper" with compute capability 9.0
 Shared Size:   1024
 Constant Size: 0
 Local Size:    0
 Max Threads Per Block: 1024
 Number of Registers: 12
 PTX Version: 90
 Binary Version: 90
 simple_kernel(const int *pIn, int *pOut, int a) PASSED
 Shared Size:   2048
 Constant Size: 0
 Local Size:    0
 Max Threads Per Block: 1024
 Number of Registers: 14
 PTX Version: 90
 Binary Version: 90
 simple_kernel(const int2 *pIn, int *pOut, int a) PASSED
 Shared Size:   2048
 Constant Size: 0
 Local Size:    0
 Max Threads Per Block: 1024
 Number of Registers: 14
 PTX Version: 90
 Binary Version: 90
 simple_kernel(const int *pIn1, const int *pIn2, int *pOut, int a) PASSED
--- a/bin/x86_64/linux/release/APM_cuHook.txt
+++ b/bin/x86_64/linux/release/APM_cuHook.txt
--- a/bin/x86_64/linux/release/APM_cuSolverDn_LinearSolver.txt
+++ b/bin/x86_64/linux/release/APM_cuSolverDn_LinearSolver.txt
@ -1,15 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 step 1: read matrix market format
 Using default input file [../../../../Samples/4_CUDA_Libraries/cuSolverDn_LinearSolver/gr_900_900_crg.mtx]
 sparse matrix A is 900 x 900 with 7744 nonzeros, base=1
 step 2: convert CSR(A) to dense matrix
 step 3: set right hand side vector (b) to 1
 step 4: prepare data on device
 step 5: solve A*x = b
 timing: cholesky =   0.000789 sec
 step 6: evaluate residual
 |b - A*x| = 1.278977E-13
 |A| = 1.600000E+01
 |x| = 2.357708E+01
 |b - A*x|/(|A|*|x|) = 3.390413E-16
--- a/bin/x86_64/linux/release/APM_cuSolverRf.txt
+++ b/bin/x86_64/linux/release/APM_cuSolverRf.txt
@ -1,58 +0,0 @@
 step 1.1: preparation
 step 1.1: read matrix market format
 GPU Device 0: "Hopper" with compute capability 9.0
 Using default input file [../../../../Samples/4_CUDA_Libraries/cuSolverRf/lap2D_5pt_n100.mtx]
 WARNING: cusolverRf only works for base-0
 sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=0
 step 1.2: set right hand side vector (b) to 1
 step 2: reorder the matrix to reduce zero fill-in
        Q = symrcm(A) or Q = symamd(A)
 step 3: B = Q*A*Q^T
 step 4: solve A*x = b by LU(B) in cusolverSp
 step 4.1: create opaque info structure
 step 4.2: analyze LU(B) to know structure of Q and R, and upper bound for nnz(L+U)
 step 4.3: workspace for LU(B)
 step 4.4: compute Ppivot*B = L*U
 step 4.5: check if the matrix is singular
 step 4.6: solve A*x = b
    i.e.  solve B*(Qx) = Q*b
 step 4.7: evaluate residual r = b - A*x (result on CPU)
 (CPU) |b - A*x| = 4.547474E-12
 (CPU) |A| = 8.000000E+00
 (CPU) |x| = 7.513384E+02
 (CPU) |b - A*x|/(|A|*|x|) = 7.565621E-16
 step 5: extract P, Q, L and U from P*B*Q^T = L*U
        L has implicit unit diagonal
 nnzL = 671550, nnzU = 681550
 step 6: form P*A*Q^T = L*U
 step 6.1: P = Plu*Qreroder
 step 6.2: Q = Qlu*Qreorder
 step 7: create cusolverRf handle
 step 8: set parameters for cusolverRf
 step 9: assemble P*A*Q = L*U
 step 10: analyze to extract parallelism
 step 11: import A to cusolverRf
 step 12: refactorization
 step 13: solve A*x = b
 step 14: evaluate residual r = b - A*x (result on GPU)
 (GPU) |b - A*x| = 4.320100E-12
 (GPU) |A| = 8.000000E+00
 (GPU) |x| = 7.513384E+02
 (GPU) |b - A*x|/(|A|*|x|) = 7.187340E-16
 ===== statistics
 nnz(A) = 49600, nnz(L+U) = 1353100, zero fill-in ratio = 27.280242
 ===== timing profile
 reorder A   : 0.003304 sec
 B = Q*A*Q^T : 0.000761 sec
 cusolverSp LU analysis: 0.000188 sec
 cusolverSp LU factor  : 0.069354 sec
 cusolverSp LU solve   : 0.001780 sec
 cusolverSp LU extract : 0.005654 sec
 cusolverRf assemble : 0.002426 sec
 cusolverRf reset    : 0.000021 sec
 cusolverRf refactor : 0.097122 sec
 cusolverRf solve    : 0.123813 sec
--- a/bin/x86_64/linux/release/APM_cuSolverSp_LinearSolver.txt
+++ b/bin/x86_64/linux/release/APM_cuSolverSp_LinearSolver.txt
@ -1,38 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 Using default input file [../../../../Samples/4_CUDA_Libraries/cuSolverSp_LinearSolver/lap2D_5pt_n100.mtx]
 step 1: read matrix market format
 sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=1
 step 2: reorder the matrix A to minimize zero fill-in
        if the user choose a reordering by -P=symrcm, -P=symamd or -P=metis
 step 2.1: no reordering is chosen, Q = 0:n-1
 step 2.2: B = A(Q,Q)
 step 3: b(j) = 1 + j/n
 step 4: prepare data on device
 step 5: solve A*x = b on CPU
 step 6: evaluate residual r = b - A*x (result on CPU)
 (CPU) |b - A*x| = 5.393685E-12
 (CPU) |A| = 8.000000E+00
 (CPU) |x| = 1.136492E+03
 (CPU) |b| = 1.999900E+00
 (CPU) |b - A*x|/(|A|*|x| + |b|) = 5.931079E-16
 step 7: solve A*x = b on GPU
 step 8: evaluate residual r = b - A*x (result on GPU)
 (GPU) |b - A*x| = 1.970424E-12
 (GPU) |A| = 8.000000E+00
 (GPU) |x| = 1.136492E+03
 (GPU) |b| = 1.999900E+00
 (GPU) |b - A*x|/(|A|*|x| + |b|) = 2.166745E-16
 timing chol: CPU =   0.097956 sec , GPU =   0.103812 sec
 show last 10 elements of solution vector (GPU)
 consistent result for different reordering and solver
 x[9990] = 3.000016E+01
 x[9991] = 2.807343E+01
 x[9992] = 2.601354E+01
 x[9993] = 2.380285E+01
 x[9994] = 2.141866E+01
 x[9995] = 1.883070E+01
 x[9996] = 1.599668E+01
 x[9997] = 1.285365E+01
 x[9998] = 9.299423E+00
 x[9999] = 5.147265E+00
--- a/bin/x86_64/linux/release/APM_cuSolverSp_LowlevelCholesky.txt
+++ b/bin/x86_64/linux/release/APM_cuSolverSp_LowlevelCholesky.txt
@ -1,24 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 Using default input file [../../../../Samples/4_CUDA_Libraries/cuSolverSp_LowlevelCholesky/lap2D_5pt_n100.mtx]
 step 1: read matrix market format
 sparse matrix A is 10000 x 10000 with 49600 nonzeros, base=1
 step 2: create opaque info structure
 step 3: analyze chol(A) to know structure of L
 step 4: workspace for chol(A)
 step 5: compute A = L*L^T
 step 6: check if the matrix is singular
 step 7: solve A*x = b
 step 8: evaluate residual r = b - A*x (result on CPU)
 (CPU) |b - A*x| = 3.637979E-12
 (CPU) |A| = 8.000000E+00
 (CPU) |x| = 7.513384E+02
 (CPU) |b - A*x|/(|A|*|x|) = 6.052497E-16
 step 9: create opaque info structure
 step 10: analyze chol(A) to know structure of L
 step 11: workspace for chol(A)
 step 12: compute A = L*L^T
 step 13: check if the matrix is singular
 step 14: solve A*x = b
 (GPU) |b - A*x| = 1.477929E-12
 (GPU) |b - A*x|/(|A|*|x|) = 2.458827E-16
--- a/bin/x86_64/linux/release/APM_cuSolverSp_LowlevelQR.txt
+++ b/bin/x86_64/linux/release/APM_cuSolverSp_LowlevelQR.txt
@ -1,25 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 Using default input file [../../../../Samples/4_CUDA_Libraries/cuSolverSp_LowlevelQR/lap2D_5pt_n32.mtx]
 step 1: read matrix market format
 sparse matrix A is 1024 x 1024 with 3008 nonzeros, base=1
 step 2: create opaque info structure
 step 3: analyze qr(A) to know structure of L
 step 4: workspace for qr(A)
 step 5: compute A = L*L^T
 step 6: check if the matrix is singular
 step 7: solve A*x = b
 step 8: evaluate residual r = b - A*x (result on CPU)
 (CPU) |b - A*x| = 5.329071E-15
 (CPU) |A| = 6.000000E+00
 (CPU) |x| = 5.000000E-01
 (CPU) |b - A*x|/(|A|*|x|) = 1.776357E-15
 step 9: create opaque info structure
 step 10: analyze qr(A) to know structure of L
 step 11: workspace for qr(A)
 GPU buffer size = 3751424 bytes
 step 12: compute A = L*L^T
 step 13: check if the matrix is singular
 step 14: solve A*x = b
 (GPU) |b - A*x| = 4.218847E-15
 (GPU) |b - A*x|/(|A|*|x|) = 1.406282E-15
--- a/bin/x86_64/linux/release/APM_cudaCompressibleMemory.txt
+++ b/bin/x86_64/linux/release/APM_cudaCompressibleMemory.txt
@ -1,9 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 Generic memory compression support is available
 Running saxpy on 167772160 bytes of Compressible memory
 Running saxpy with 228 blocks x 1024 threads = 0.084 ms 5.960 TB/s
 Running saxpy on 167772160 bytes of Non-Compressible memory
 Running saxpy with 228 blocks x 1024 threads = 0.345 ms 1.460 TB/s
 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
--- a/bin/x86_64/linux/release/APM_cudaOpenMP.txt
+++ b/bin/x86_64/linux/release/APM_cudaOpenMP.txt
@ -1,8 +0,0 @@
 ./cudaOpenMP Starting...
 number of host CPUs:	32
 number of CUDA devices:	1
   0: NVIDIA H100 PCIe
 ---------------------------
 CPU thread 0 (of 1) uses CUDA device 0
 ---------------------------
--- a/bin/x86_64/linux/release/APM_cudaTensorCoreGemm.txt
+++ b/bin/x86_64/linux/release/APM_cudaTensorCoreGemm.txt
@ -1,11 +0,0 @@
 Initializing...
 GPU Device 0: "Hopper" with compute capability 9.0
 M: 4096 (16 x 256)
 N: 4096 (16 x 256)
 K: 4096 (16 x 256)
 Preparing data for GPU...
 Required shared memory size: 64 Kb
 Computing... using high performance kernel compute_gemm
 Time: 1.223904 ms
 TFLOPS: 112.30
--- a/bin/x86_64/linux/release/APM_dct8x8.txt
+++ b/bin/x86_64/linux/release/APM_dct8x8.txt
@ -1,32 +0,0 @@
 ./dct8x8 Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 CUDA sample DCT/IDCT implementation
 ===================================
 Loading test image: teapot512.bmp... [512 x 512]... Success
 Running Gold 1 (CPU) version... Success
 Running Gold 2 (CPU) version... Success
 Running CUDA 1 (GPU) version... Success
 Running CUDA 2 (GPU) version... 82435.220134 MPix/s //0.003180 ms
 Success
 Running CUDA short (GPU) version... Success
 Dumping result to teapot512_gold1.bmp... Success
 Dumping result to teapot512_gold2.bmp... Success
 Dumping result to teapot512_cuda1.bmp... Success
 Dumping result to teapot512_cuda2.bmp... Success
 Dumping result to teapot512_cuda_short.bmp... Success
 Processing time (CUDA 1)    : 0.021800 ms
 Processing time (CUDA 2)    : 0.003180 ms
 Processing time (CUDA short): 0.033000 ms
 PSNR Original    <---> CPU(Gold 1)    : 32.527462
 PSNR Original    <---> CPU(Gold 2)    : 32.527309
 PSNR Original    <---> GPU(CUDA 1)    : 32.527184
 PSNR Original    <---> GPU(CUDA 2)    : 32.527054
 PSNR Original    <---> GPU(CUDA short): 32.501888
 PSNR CPU(Gold 1) <---> GPU(CUDA 1)    : 62.845787
 PSNR CPU(Gold 2) <---> GPU(CUDA 2)    : 66.982300
 PSNR CPU(Gold 2) <---> GPU(CUDA short): 40.958466
 Test Summary...
 Test passed
--- a/bin/x86_64/linux/release/APM_deviceQuery.txt
+++ b/bin/x86_64/linux/release/APM_deviceQuery.txt
@ -1,46 +0,0 @@
 ./deviceQuery Starting...
 CUDA Device Query (Runtime API) version (CUDART static linking)
 Detected 1 CUDA Capable device(s)
 Device 0: "NVIDIA H100 PCIe"
  CUDA Driver Version / Runtime Version          12.0 / 12.0
  CUDA Capability Major/Minor version number:    9.0
  Total amount of global memory:                 81082 MBytes (85021163520 bytes)
  (114) Multiprocessors, (128) CUDA Cores/MP:    14592 CUDA Cores
  GPU Max Clock rate:                            1650 MHz (1.65 GHz)
  Memory Clock rate:                             1593 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 52428800 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        233472 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 193 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
 deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.0, CUDA Runtime Version = 12.0, NumDevs = 1
 Result = PASS
--- a/bin/x86_64/linux/release/APM_deviceQueryDrv.txt
+++ b/bin/x86_64/linux/release/APM_deviceQueryDrv.txt
@ -1,43 +0,0 @@
 ./deviceQueryDrv Starting...
 CUDA Device Query (Driver API) statically linked version
 Detected 1 CUDA Capable device(s)
 Device 0: "NVIDIA H100 PCIe"
  CUDA Driver Version:                           12.0
  CUDA Capability Major/Minor version number:    9.0
  Total amount of global memory:                 81082 MBytes (85021163520 bytes)
  (114) Multiprocessors, (128) CUDA Cores/MP:     14592 CUDA Cores
  GPU Max Clock rate:                            1650 MHz (1.65 GHz)
  Memory Clock rate:                             1593 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 52428800 bytes
  Max Texture Dimension Sizes                    1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size (x,y,z):    (2147483647, 65535, 65535)
  Texture alignment:                             512 bytes
  Maximum memory pitch:                          2147483647 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 193 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
 Result = PASS
--- a/bin/x86_64/linux/release/APM_dmmaTensorCoreGemm.txt
+++ b/bin/x86_64/linux/release/APM_dmmaTensorCoreGemm.txt
@ -1,11 +0,0 @@
 Initializing...
 GPU Device 0: "Hopper" with compute capability 9.0
 M: 8192 (8 x 1024)
 N: 8192 (8 x 1024)
 K: 4096 (4 x 1024)
 Preparing data for GPU...
 Required shared memory size: 68 Kb
 Computing using high performance kernel = 0 - compute_dgemm_async_copy
 Time: 30.856800 ms
 FP64 TFLOPS: 17.82
--- a/bin/x86_64/linux/release/APM_dwtHaar1D.txt
+++ b/bin/x86_64/linux/release/APM_dwtHaar1D.txt
@ -1,11 +0,0 @@
 ./dwtHaar1D Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 source file    = "../../../../Samples/5_Domain_Specific/dwtHaar1D/data/signal.dat"
 reference file = "result.dat"
 gold file      = "../../../../Samples/5_Domain_Specific/dwtHaar1D/data/regression.gold.dat"
 Reading signal from "../../../../Samples/5_Domain_Specific/dwtHaar1D/data/signal.dat"
 Writing result to "result.dat"
 Reading reference result from "../../../../Samples/5_Domain_Specific/dwtHaar1D/data/regression.gold.dat"
 Test success!
--- a/bin/x86_64/linux/release/APM_dxtc.txt
+++ b/bin/x86_64/linux/release/APM_dxtc.txt
@ -1,16 +0,0 @@
 ./dxtc Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 Image Loaded '../../../../Samples/5_Domain_Specific/dxtc/data/teapot512_std.ppm', 512 x 512 pixels
 Running DXT Compression on 512 x 512 image...
 16384 Blocks, 64 Threads per Block, 1048576 Threads in Grid...
 dxtc, Throughput = 442.8108 MPixels/s, Time = 0.00059 s, Size = 262144 Pixels, NumDevsUsed = 1, Workgroup = 64
 Checking accuracy...
 RMS(reference, result) = 0.000000
 Test passed
--- a/bin/x86_64/linux/release/APM_eigenvalues.txt
+++ b/bin/x86_64/linux/release/APM_eigenvalues.txt
@ -1,13 +0,0 @@
 Starting eigenvalues
 GPU Device 0: "Hopper" with compute capability 9.0
 Matrix size: 2048 x 2048
 Precision: 0.000010
 Iterations to be timed: 100
 Result filename: 'eigenvalues.dat'
 Gerschgorin interval: -2.894310 / 2.923303
 Average time step 1: 1.032310 ms
 Average time step 2, one intervals: 1.228451 ms
 Average time step 2, mult intervals: 2.694728 ms
 Average time TOTAL: 4.970180 ms
 Test Succeeded!
--- a/bin/x86_64/linux/release/APM_fastWalshTransform.txt
+++ b/bin/x86_64/linux/release/APM_fastWalshTransform.txt
@ -1,17 +0,0 @@
 ./fastWalshTransform Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 Initializing data...
 ...allocating CPU memory
 ...allocating GPU memory
 ...generating data
 Data length: 8388608; kernel length: 128
 Running GPU dyadic convolution using Fast Walsh Transform...
 GPU time: 0.751000 ms; GOP/s: 385.362158
 Reading back GPU results...
 Running straightforward CPU dyadic convolution...
 Comparing the results...
 Shutting down...
 L2 norm: 1.021579E-07
 Test passed
--- a/bin/x86_64/linux/release/APM_fp16ScalarProduct.txt
+++ b/bin/x86_64/linux/release/APM_fp16ScalarProduct.txt
@ -1,5 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 Result native operators	: 644622.000000
 Result intrinsics	: 644622.000000
 &&&& fp16ScalarProduct PASSED
--- a/bin/x86_64/linux/release/APM_globalToShmemAsyncCopy.txt
+++ b/bin/x86_64/linux/release/APM_globalToShmemAsyncCopy.txt
@ -1,11 +0,0 @@
 [globalToShmemAsyncCopy] - Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 MatrixA(1280,1280), MatrixB(1280,1280)
 Running kernel = 0 - AsyncCopyMultiStageLargeChunk
 Computing result using CUDA Kernel...
 done
 Performance= 5289.33 GFlop/s, Time= 0.793 msec, Size= 4194304000 Ops, WorkgroupSize= 256 threads/block
 Checking computed result for correctness: Result = PASS
 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
--- a/bin/x86_64/linux/release/APM_graphMemoryFootprint.txt
+++ b/bin/x86_64/linux/release/APM_graphMemoryFootprint.txt
@ -1,89 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 Driver version is: 12.0
 Running sample.
 ================================
 Running virtual address reuse example.
 Sequential allocations & frees within a single graph enable CUDA to reuse virtual addresses.
 Check confirms that d_a and d_b share a virtual address.
    FOOTPRINT: 67108864 bytes
 Cleaning up example by trimming device memory.
    FOOTPRINT: 0 bytes
 ================================
 Running physical memory reuse example.
 CUDA reuses the same physical memory for allocations from separate graphs when the allocation lifetimes don't overlap.
 Creating the graph execs does not reserve any physical memory.
    FOOTPRINT: 0 bytes
 The first graph launched reserves the memory it needs.
    FOOTPRINT: 67108864 bytes
 A subsequent launch of the same graph in the same stream reuses the same physical memory. Thus the memory footprint does not grow here.
    FOOTPRINT: 67108864 bytes
 Subsequent launches of other graphs in the same stream also reuse the physical memory. Thus the memory footprint does not grow here.
 01:     FOOTPRINT: 67108864 bytes
 02:     FOOTPRINT: 67108864 bytes
 03:     FOOTPRINT: 67108864 bytes
 04:     FOOTPRINT: 67108864 bytes
 05:     FOOTPRINT: 67108864 bytes
 06:     FOOTPRINT: 67108864 bytes
 07:     FOOTPRINT: 67108864 bytes
 Check confirms all graphs use a different virtual address.
 Cleaning up example by trimming device memory.
    FOOTPRINT: 0 bytes
 ================================
 Running simultaneous streams example.
 Graphs that can run concurrently need separate physical memory. In this example, each graph launched in a separate stream increases the total memory footprint.
 When launching a new graph, CUDA may reuse physical memory from a graph whose execution has already finished -- even if the new graph is being launched in a different stream from the completed graph. Therefore, a kernel node is added to the graphs to increase runtime.
 Initial footprint:
    FOOTPRINT: 0 bytes
 Each graph launch in a seperate stream grows the memory footprint:
 01:     FOOTPRINT: 67108864 bytes
 02:     FOOTPRINT: 134217728 bytes
 03:     FOOTPRINT: 201326592 bytes
 04:     FOOTPRINT: 268435456 bytes
 05:     FOOTPRINT: 335544320 bytes
 06:     FOOTPRINT: 402653184 bytes
 07:     FOOTPRINT: 402653184 bytes
 Cleaning up example by trimming device memory.
    FOOTPRINT: 0 bytes
 ================================
 Running unfreed streams example.
 CUDA cannot reuse phyiscal memory from graphs which do not free their allocations.
 Despite being launched in the same stream, each graph launch grows the memory footprint. Since the allocation is not freed, CUDA keeps the memory valid for use.
 00:     FOOTPRINT: 67108864 bytes
 01:     FOOTPRINT: 134217728 bytes
 02:     FOOTPRINT: 201326592 bytes
 03:     FOOTPRINT: 268435456 bytes
 04:     FOOTPRINT: 335544320 bytes
 05:     FOOTPRINT: 402653184 bytes
 06:     FOOTPRINT: 469762048 bytes
 07:     FOOTPRINT: 536870912 bytes
 Trimming does not impact the memory footprint since the un-freed allocations are still holding onto the memory.
    FOOTPRINT: 536870912 bytes
 Freeing the allocations does not shrink the footprint.
    FOOTPRINT: 536870912 bytes
 Since the allocations are now freed, trimming does reduce the footprint even when the graph execs are not yet destroyed.
    FOOTPRINT: 0 bytes
 Cleaning up example by trimming device memory.
    FOOTPRINT: 0 bytes
 ================================
 Sample complete.
--- a/bin/x86_64/linux/release/APM_graphMemoryNodes.txt
+++ b/bin/x86_64/linux/release/APM_graphMemoryNodes.txt
@ -1,34 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 Driver version is: 12.0
 Setting up sample.
 Setup complete.
 Running negateSquares in a stream.
 Validating negateSquares in a stream...
 Validation PASSED!
 Running negateSquares in a stream-captured graph.
 Validating negateSquares in a stream-captured graph...
 Validation PASSED!
 Running negateSquares in an explicitly constructed graph.
 Check verified that d_negSquare and d_input share a virtual address.
 Validating negateSquares in an explicitly constructed graph...
 Validation PASSED!
 Running negateSquares with d_negSquare freed outside the stream.
 Check verified that d_negSquare and d_input share a virtual address.
 Validating negateSquares with d_negSquare freed outside the stream...
 Validation PASSED!
 Running negateSquares with d_negSquare freed outside the graph.
 Validating negateSquares with d_negSquare freed outside the graph...
 Validation PASSED!
 Running negateSquares with d_negSquare freed in a different graph.
 Validating negateSquares with d_negSquare freed in a different graph...
 Validation PASSED!
 Cleaning up sample.
 Cleanup complete. Exiting sample.
--- a/bin/x86_64/linux/release/APM_histogram.txt
+++ b/bin/x86_64/linux/release/APM_histogram.txt
@ -1,48 +0,0 @@
 [[histogram]] - Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 CUDA device [NVIDIA H100 PCIe] has 114 Multi-Processors, Compute 9.0
 Initializing data...
 ...allocating CPU memory.
 ...generating input data
 ...allocating GPU memory and copying input data
 Starting up 64-bin histogram...
 Running 64-bin GPU histogram for 67108864 bytes (16 runs)...
 histogram64() time (average) : 0.00007 sec, 916944.3386 MB/sec
 histogram64, Throughput = 916944.3386 MB/s, Time = 0.00007 s, Size = 67108864 Bytes, NumDevsUsed = 1, Workgroup = 64
 Validating GPU results...
 ...reading back GPU results
 ...histogram64CPU()
 ...comparing the results...
 ...64-bin histograms match
 Shutting down 64-bin histogram...
 Initializing 256-bin histogram...
 Running 256-bin GPU histogram for 67108864 bytes (16 runs)...
 histogram256() time (average) : 0.00018 sec, 379951.1088 MB/sec
 histogram256, Throughput = 379951.1088 MB/s, Time = 0.00018 s, Size = 67108864 Bytes, NumDevsUsed = 1, Workgroup = 192
 Validating GPU results...
 ...reading back GPU results
 ...histogram256CPU()
 ...comparing the results
 ...256-bin histograms match
 Shutting down 256-bin histogram...
 Shutting down...
 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
 [histogram] - Test Summary
 Test passed
--- a/bin/x86_64/linux/release/APM_immaTensorCoreGemm.txt
+++ b/bin/x86_64/linux/release/APM_immaTensorCoreGemm.txt
@ -1,11 +0,0 @@
 Initializing...
 GPU Device 0: "Hopper" with compute capability 9.0
 M: 4096 (16 x 256)
 N: 4096 (16 x 256)
 K: 4096 (16 x 256)
 Preparing data for GPU...
 Required shared memory size: 64 Kb
 Computing... using high performance kernel compute_gemm_imma
 Time: 0.629184 ms
 TOPS: 218.44
--- a/bin/x86_64/linux/release/APM_inlinePTX.txt
+++ b/bin/x86_64/linux/release/APM_inlinePTX.txt
@ -1,4 +0,0 @@
 CUDA inline PTX assembler sample
 GPU Device 0: "Hopper" with compute capability 9.0
 Test Successful.
--- a/bin/x86_64/linux/release/APM_inlinePTX_nvrtc.txt
+++ b/bin/x86_64/linux/release/APM_inlinePTX_nvrtc.txt
@ -1,5 +0,0 @@
 CUDA inline PTX assembler sample
 > Using CUDA Device [0]: NVIDIA H100 PCIe
 > Using CUDA Device [0]: NVIDIA H100 PCIe
 > GPU Device has SM 9.0 compute capability
 Test Successful.
--- a/bin/x86_64/linux/release/APM_interval.txt
+++ b/bin/x86_64/linux/release/APM_interval.txt
@ -1,15 +0,0 @@
 [Interval Computing]  starting ...
 GPU Device 0: "Hopper" with compute capability 9.0
 > GPU Device has Compute Capabilities SM 9.0
 GPU naive implementation
 Searching for roots in [0.01, 4]...
 Found 2 intervals that may contain the root(s)
 i[0] = [0.999655515093009, 1.00011722206639]
 i[1] = [1.00011907576551, 1.00044661086269]
 Number of equations solved: 65536
 Time per equation: 0.616870105266571 us
 Check against Host computation...
--- a/bin/x86_64/linux/release/APM_jacobiCudaGraphs.txt
+++ b/bin/x86_64/linux/release/APM_jacobiCudaGraphs.txt
@ -1,9 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 CPU iterations : 2954
 CPU error : 4.988e-03
 CPU Processing time: 2525.311035 (ms)
 GPU iterations : 2954
 GPU error : 4.988e-03
 GPU Processing time: 57.967999 (ms)
 &&&& jacobiCudaGraphs PASSED
--- a/bin/x86_64/linux/release/APM_libcuhook.so.1.txt
+++ b/bin/x86_64/linux/release/APM_libcuhook.so.1.txt
--- a/bin/x86_64/linux/release/APM_lineOfSight.txt
+++ b/bin/x86_64/linux/release/APM_lineOfSight.txt
@ -1,7 +0,0 @@
 [./lineOfSight] - Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 Line of sight
 Average time: 0.020620 ms
 Test passed
--- a/bin/x86_64/linux/release/APM_matrixMul.txt
+++ b/bin/x86_64/linux/release/APM_matrixMul.txt
@ -1,10 +0,0 @@
 [Matrix Multiply Using CUDA] - Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 MatrixA(320,320), MatrixB(640,320)
 Computing result using CUDA Kernel...
 done
 Performance= 4756.03 GFlop/s, Time= 0.028 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
 Checking computed result for correctness: Result = PASS
 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
--- a/bin/x86_64/linux/release/APM_matrixMulCUBLAS.txt
+++ b/bin/x86_64/linux/release/APM_matrixMulCUBLAS.txt
@ -1,12 +0,0 @@
 [Matrix Multiply CUBLAS] - Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 GPU Device 0: "NVIDIA H100 PCIe" with compute capability 9.0
 MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
 Computing result using CUBLAS...done.
 Performance= 10873.05 GFlop/s, Time= 0.018 msec, Size= 196608000 Ops
 Computing result using host CPU...done.
 Comparing CUBLAS Matrix Multiply with CPU results: PASS
 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
--- a/bin/x86_64/linux/release/APM_matrixMulDrv.txt
+++ b/bin/x86_64/linux/release/APM_matrixMulDrv.txt
@ -1,11 +0,0 @@
 [ matrixMulDrv (Driver API) ]
 > Using CUDA Device [0]: NVIDIA H100 PCIe
 > GPU Device has SM 9.0 compute capability
  Total amount of global memory:     85021163520 bytes
 > findModulePath found file at <./matrixMul_kernel64.fatbin>
 > initCUDA loading module: <./matrixMul_kernel64.fatbin>
 > 32 block size selected
 Processing time: 0.058000 (ms)
 Checking computed result for correctness: Result = PASS
 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
--- a/bin/x86_64/linux/release/APM_matrixMulDynlinkJIT.txt
+++ b/bin/x86_64/linux/release/APM_matrixMulDynlinkJIT.txt
@ -1,6 +0,0 @@
 [ matrixMulDynlinkJIT (CUDA dynamic linking) ]
 > Device 0: "NVIDIA H100 PCIe" with Compute 9.0 capability
 > Compiling CUDA module
 > PTX JIT log:
 Test run success!
--- a/bin/x86_64/linux/release/APM_matrixMul_nvrtc.txt
+++ b/bin/x86_64/linux/release/APM_matrixMul_nvrtc.txt
@ -1,9 +0,0 @@
 [Matrix Multiply Using CUDA] - Starting...
 MatrixA(320,320), MatrixB(640,320)
 > Using CUDA Device [0]: NVIDIA H100 PCIe
 > Using CUDA Device [0]: NVIDIA H100 PCIe
 > GPU Device has SM 9.0 compute capability
 Computing result using CUDA Kernel...
 Checking computed result for correctness: Result = PASS
 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
--- a/bin/x86_64/linux/release/APM_memMapIPCDrv.txt
+++ b/bin/x86_64/linux/release/APM_memMapIPCDrv.txt
@ -1,6 +0,0 @@
 > findModulePath found file at <./memMapIpc_kernel64.ptx>
 > initCUDA loading module: <./memMapIpc_kernel64.ptx>
 > PTX JIT log:
 Step 0 done
 Process 0: verifying...
--- a/bin/x86_64/linux/release/APM_mergeSort.txt
+++ b/bin/x86_64/linux/release/APM_mergeSort.txt
@ -1,17 +0,0 @@
 ./mergeSort Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 Allocating and initializing host arrays...
 Allocating and initializing CUDA arrays...
 Initializing GPU merge sort...
 Running GPU merge sort...
 Time: 1.344000 ms
 Reading back GPU merge sort results...
 Inspecting the results...
 ...inspecting keys array: OK
 ...inspecting keys and values array: OK
 ...stability property: stable!
 Shutting down...
--- a/bin/x86_64/linux/release/APM_newdelete.txt
+++ b/bin/x86_64/linux/release/APM_newdelete.txt
@ -1,11 +0,0 @@
 newdelete Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 > Container = Vector test OK
 > Container = Vector, using placement new on SMEM buffer test OK
 > Container = Vector, with user defined datatype test OK
 Test Summary: 3/3 succesfully run
--- a/bin/x86_64/linux/release/APM_nvJPEG.txt
+++ b/bin/x86_64/linux/release/APM_nvJPEG.txt
@ -1,56 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 Using GPU 0 (NVIDIA H100 PCIe, 114 SMs, 2048 th/SM max, CC 9.0, ECC on)
 Decoding images in directory: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/, total 8, batchsize 1
 Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img1.jpg
 Image is 3 channels.
 Channel #0 size: 480 x 640
 Channel #1 size: 240 x 320
 Channel #2 size: 240 x 320
 YUV 4:2:0 chroma subsampling
 Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img2.jpg
 Image is 3 channels.
 Channel #0 size: 480 x 640
 Channel #1 size: 240 x 320
 Channel #2 size: 240 x 320
 YUV 4:2:0 chroma subsampling
 Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img3.jpg
 Image is 3 channels.
 Channel #0 size: 640 x 426
 Channel #1 size: 320 x 213
 Channel #2 size: 320 x 213
 YUV 4:2:0 chroma subsampling
 Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img4.jpg
 Image is 3 channels.
 Channel #0 size: 640 x 426
 Channel #1 size: 320 x 213
 Channel #2 size: 320 x 213
 YUV 4:2:0 chroma subsampling
 Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img5.jpg
 Image is 3 channels.
 Channel #0 size: 640 x 480
 Channel #1 size: 320 x 240
 Channel #2 size: 320 x 240
 YUV 4:2:0 chroma subsampling
 Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img6.jpg
 Image is 3 channels.
 Channel #0 size: 640 x 480
 Channel #1 size: 320 x 240
 Channel #2 size: 320 x 240
 YUV 4:2:0 chroma subsampling
 Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img7.jpg
 Image is 3 channels.
 Channel #0 size: 480 x 640
 Channel #1 size: 240 x 320
 Channel #2 size: 240 x 320
 YUV 4:2:0 chroma subsampling
 Processing: ../../../../Samples/4_CUDA_Libraries/nvJPEG/images/img8.jpg
 Image is 3 channels.
 Channel #0 size: 480 x 640
 Channel #1 size: 240 x 320
 Channel #2 size: 240 x 320
 YUV 4:2:0 chroma subsampling
 Total decoding time: 3.19197
 Avg decoding time per image: 0.398996
 Avg images per sec: 2.50629
 Avg decoding time per batch: 0.398996
--- a/bin/x86_64/linux/release/APM_nvJPEG_encoder.txt
+++ b/bin/x86_64/linux/release/APM_nvJPEG_encoder.txt
@ -1,62 +0,0 @@
 GPU Device 0: "Hopper" with compute capability 9.0
 Using GPU 0 (NVIDIA H100 PCIe, 114 SMs, 2048 th/SM max, CC 9.0, ECC on)
 Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img1.jpg
 Image is 3 channels.
 Channel #0 size: 480 x 640
 Channel #1 size: 240 x 320
 Channel #2 size: 240 x 320
 YUV 4:2:0 chroma subsampling
 Writing JPEG file: encode_output/img1.jpg
 Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img2.jpg
 Image is 3 channels.
 Channel #0 size: 480 x 640
 Channel #1 size: 240 x 320
 Channel #2 size: 240 x 320
 YUV 4:2:0 chroma subsampling
 Writing JPEG file: encode_output/img2.jpg
 Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img3.jpg
 Image is 3 channels.
 Channel #0 size: 640 x 426
 Channel #1 size: 320 x 213
 Channel #2 size: 320 x 213
 YUV 4:2:0 chroma subsampling
 Writing JPEG file: encode_output/img3.jpg
 Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img4.jpg
 Image is 3 channels.
 Channel #0 size: 640 x 426
 Channel #1 size: 320 x 213
 Channel #2 size: 320 x 213
 YUV 4:2:0 chroma subsampling
 Writing JPEG file: encode_output/img4.jpg
 Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img5.jpg
 Image is 3 channels.
 Channel #0 size: 640 x 480
 Channel #1 size: 320 x 240
 Channel #2 size: 320 x 240
 YUV 4:2:0 chroma subsampling
 Writing JPEG file: encode_output/img5.jpg
 Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img6.jpg
 Image is 3 channels.
 Channel #0 size: 640 x 480
 Channel #1 size: 320 x 240
 Channel #2 size: 320 x 240
 YUV 4:2:0 chroma subsampling
 Writing JPEG file: encode_output/img6.jpg
 Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img7.jpg
 Image is 3 channels.
 Channel #0 size: 480 x 640
 Channel #1 size: 240 x 320
 Channel #2 size: 240 x 320
 YUV 4:2:0 chroma subsampling
 Writing JPEG file: encode_output/img7.jpg
 Processing file: ../../../../Samples/4_CUDA_Libraries/nvJPEG_encoder/images/img8.jpg
 Image is 3 channels.
 Channel #0 size: 480 x 640
 Channel #1 size: 240 x 320
 Channel #2 size: 240 x 320
 YUV 4:2:0 chroma subsampling
 Writing JPEG file: encode_output/img8.jpg
 Total images processed: 8
 Total time spent on encoding: 1.9711
 Avg time/image: 0.246388
--- a/bin/x86_64/linux/release/APM_p2pBandwidthLatencyTest.txt
+++ b/bin/x86_64/linux/release/APM_p2pBandwidthLatencyTest.txt
@ -1,35 +0,0 @@
 [P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
 Device: 0, NVIDIA H100 PCIe, pciBusID: c1, pciDeviceID: 0, pciDomainID:0
 ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
 So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
 P2P Connectivity Matrix
     D\D     0
     0	     1
 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0
     0 1628.72
 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0
     0 1625.75
 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0
     0 1668.11
 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0
     0 1668.39
 P2P=Disabled Latency Matrix (us)
   GPU     0
     0   2.67
   CPU     0
     0   2.04
 P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0
     0   2.68
   CPU     0
     0   2.02
 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
--- a/bin/x86_64/linux/release/APM_ptxjit.txt
+++ b/bin/x86_64/linux/release/APM_ptxjit.txt
@ -1,15 +0,0 @@
 [PTX Just In Time (JIT) Compilation (no-qatest)] - Starting...
 > Using CUDA Device [0]: NVIDIA H100 PCIe
 > findModulePath <./ptxjit_kernel64.ptx>
 > initCUDA loading module: <./ptxjit_kernel64.ptx>
 Loading ptxjit_kernel[] program
 CUDA Link Completed in 0.000000ms. Linker Output:
 ptxas info    : 0 bytes gmem
 ptxas info    : Compiling entry function 'myKernel' for 'sm_90a'
 ptxas info    : Function properties for myKernel
 ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
 ptxas info    : Used 8 registers
 info    : 0 bytes gmem
 info    : Function properties for 'myKernel':
 info    : used 8 registers, 0 stack, 0 bytes smem, 536 bytes cmem[0], 0 bytes lmem
 CUDA kernel launched
--- a/bin/x86_64/linux/release/APM_quasirandomGenerator.txt
+++ b/bin/x86_64/linux/release/APM_quasirandomGenerator.txt
@ -1,24 +0,0 @@
 ./quasirandomGenerator Starting...
 Allocating GPU memory...
 Allocating CPU memory...
 Initializing QRNG tables...
 Testing QRNG...
 quasirandomGenerator, Throughput = 51.2334 GNumbers/s, Time = 0.00006 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 384
 Reading GPU results...
 Comparing to the CPU results...
 L1 norm: 7.275964E-12
 Testing inverseCNDgpu()...
 quasirandomGenerator-inverse, Throughput = 116.2931 GNumbers/s, Time = 0.00003 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 128
 Reading GPU results...
 Comparing to the CPU results...
 L1 norm: 9.439909E-08
 Shutting down...
--- a/bin/x86_64/linux/release/APM_quasirandomGenerator_nvrtc.txt
+++ b/bin/x86_64/linux/release/APM_quasirandomGenerator_nvrtc.txt
@ -1,27 +0,0 @@
 ./quasirandomGenerator_nvrtc Starting...
 > Using CUDA Device [0]: NVIDIA H100 PCIe
 > Using CUDA Device [0]: NVIDIA H100 PCIe
 > GPU Device has SM 9.0 compute capability
 Allocating GPU memory...
 Allocating CPU memory...
 Initializing QRNG tables...
 Testing QRNG...
 quasirandomGenerator, Throughput = 45.0355 GNumbers/s, Time = 0.00007 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 384
 Reading GPU results...
 Comparing to the CPU results...
 L1 norm: 7.275964E-12
 Testing inverseCNDgpu()...
 quasirandomGenerator-inverse, Throughput = 94.7508 GNumbers/s, Time = 0.00003 s, Size = 3145728 Numbers, NumDevsUsed = 1, Workgroup = 128
 Reading GPU results...
 Comparing to the CPU results...
 L1 norm: 9.439909E-08
 Shutting down...
--- a/bin/x86_64/linux/release/APM_radixSortThrust.txt
+++ b/bin/x86_64/linux/release/APM_radixSortThrust.txt
@ -1,9 +0,0 @@
 ./radixSortThrust Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 Sorting 1048576 32-bit unsigned int keys and values
 radixSortThrust, Throughput = 2276.9744 MElements/s, Time = 0.00046 s, Size = 1048576 elements
 Test passed
--- a/bin/x86_64/linux/release/APM_reduction.txt
+++ b/bin/x86_64/linux/release/APM_reduction.txt
@ -1,18 +0,0 @@
 ./reduction Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 Using Device 0: NVIDIA H100 PCIe
 Reducing array of type int
 16777216 elements
 256 threads (max)
 64 blocks
 Reduction, Throughput = 49.0089 GB/s, Time = 0.00137 s, Size = 16777216 Elements, NumDevsUsed = 1, Workgroup = 256
 GPU result = 2139353471
 CPU result = 2139353471
 Test passed
--- a/bin/x86_64/linux/release/APM_reductionMultiBlockCG.txt
+++ b/bin/x86_64/linux/release/APM_reductionMultiBlockCG.txt
@ -1,14 +0,0 @@
 reductionMultiBlockCG Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 33554432 elements
 numThreads: 1024
 numBlocks: 228
 Launching SinglePass Multi Block Cooperative Groups kernel
 Average time: 0.102750 ms
 Bandwidth:    1306.254555 GB/s
 GPU result = 1.992401599884
 CPU result = 1.992401361465
--- a/bin/x86_64/linux/release/APM_scalarProd.txt
+++ b/bin/x86_64/linux/release/APM_scalarProd.txt
@ -1,19 +0,0 @@
 ./scalarProd Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 Initializing data...
 ...allocating CPU memory.
 ...allocating GPU memory.
 ...generating input data in CPU mem.
 ...copying input data to GPU mem.
 Data init done.
 Executing GPU kernel...
 GPU time: 0.042000 msecs.
 Reading back GPU result...
 Checking GPU results...
 ..running CPU scalar product calculation
 ...comparing the results
 Shutting down...
 L1 error: 2.745062E-08
 Test passed
--- a/bin/x86_64/linux/release/APM_scan.txt
+++ b/bin/x86_64/linux/release/APM_scan.txt
@ -1,138 +0,0 @@
 ./scan Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 Allocating and initializing host arrays...
 Allocating and initializing CUDA arrays...
 Initializing CUDA-C scan...
 *** Running GPU scan for short arrays (100 identical iterations)...
 Running scan for 4 elements (1703936 arrays)...
 Validating the results...
 ...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match
 Running scan for 8 elements (851968 arrays)...
 Validating the results...
 ...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match
 Running scan for 16 elements (425984 arrays)...
 Validating the results...
 ...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match
 Running scan for 32 elements (212992 arrays)...
 Validating the results...
 ...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match
 Running scan for 64 elements (106496 arrays)...
 Validating the results...
 ...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match
 Running scan for 128 elements (53248 arrays)...
 Validating the results...
 ...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match
 Running scan for 256 elements (26624 arrays)...
 Validating the results...
 ...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match
 Running scan for 512 elements (13312 arrays)...
 Validating the results...
 ...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match
 Running scan for 1024 elements (6656 arrays)...
 Validating the results...
 ...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match
 scan, Throughput = 35.1769 MElements/s, Time = 0.00003 s, Size = 1024 Elements, NumDevsUsed = 1, Workgroup = 256
 ***Running GPU scan for large arrays (100 identical iterations)...
 Running scan for 2048 elements (3328 arrays)...
 Validating the results...
 ...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match
 Running scan for 4096 elements (1664 arrays)...
 Validating the results...
 ...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match
 Running scan for 8192 elements (832 arrays)...
 Validating the results...
 ...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match
 Running scan for 16384 elements (416 arrays)...
 Validating the results...
 ...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match
 Running scan for 32768 elements (208 arrays)...
 Validating the results...
 ...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match
 Running scan for 65536 elements (104 arrays)...
 Validating the results...
 ...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match
 Running scan for 131072 elements (52 arrays)...
 Validating the results...
 ...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match
 Running scan for 262144 elements (26 arrays)...
 Validating the results...
 ...reading back GPU results
 ...scanExclusiveHost()
 ...comparing the results
 ...Results Match
 scan, Throughput = 5146.1328 MElements/s, Time = 0.00005 s, Size = 262144 Elements, NumDevsUsed = 1, Workgroup = 256
 Shutting down...
--- a/bin/x86_64/linux/release/APM_segmentationTreeThrust.txt
+++ b/bin/x86_64/linux/release/APM_segmentationTreeThrust.txt
@ -1,6 +0,0 @@
 ./segmentationTreeThrust Starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 * Building segmentation tree... done in 24.6388 (ms)
 * Dumping levels for each tree...
--- a/bin/x86_64/linux/release/APM_shfl_scan.txt
+++ b/bin/x86_64/linux/release/APM_shfl_scan.txt
@ -1,24 +0,0 @@
 Starting shfl_scan
 GPU Device 0: "Hopper" with compute capability 9.0
 > Detected Compute SM 9.0 hardware with 114 multi-processors
 Starting shfl_scan
 GPU Device 0: "Hopper" with compute capability 9.0
 > Detected Compute SM 9.0 hardware with 114 multi-processors
 Computing Simple Sum test
 ---------------------------------------------------
 Initialize test data [1, 1, 1...]
 Scan summation for 65536 elements, 256 partial sums
 Partial summing 256 elements with 1 blocks of size 256
 Test Sum: 65536
 Time (ms): 0.021504
 65536 elements scanned in 0.021504 ms -> 3047.619141 MegaElements/s
 CPU verify result diff (GPUvsCPU) = 0
 CPU sum (naive) took 0.017810 ms
 Computing Integral Image Test on size 1920 x 1080 synthetic data
 ---------------------------------------------------
 Method: Fast  Time (GPU Timer): 0.008032 ms Diff = 0
 Method: Vertical Scan  Time (GPU Timer): 0.068576 ms
 CheckSum: 2073600, (expect 1920x1080=2073600)
--- a/bin/x86_64/linux/release/APM_simpleAWBarrier.txt
+++ b/bin/x86_64/linux/release/APM_simpleAWBarrier.txt
@ -1,6 +0,0 @@
 ./simpleAWBarrier starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 Launching normVecByDotProductAWBarrier kernel with numBlocks = 228 blockSize = 576
 Result = PASSED
 ./simpleAWBarrier completed, returned OK
--- a/bin/x86_64/linux/release/APM_simpleAssert.txt
+++ b/bin/x86_64/linux/release/APM_simpleAssert.txt
@ -1,16 +0,0 @@
 simpleAssert starting...
 OS_System_Type.release = 5.4.0-131-generic
 OS Info: <#147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022>
 GPU Device 0: "Hopper" with compute capability 9.0
 Launch kernel to generate assertion failures
 -- Begin assert output
 -- End assert output
 Device assert failed as expected, CUDA error message is: device-side assert triggered
 simpleAssert completed, returned OK
--- a/bin/x86_64/linux/release/APM_simpleAssert_nvrtc.txt
+++ b/bin/x86_64/linux/release/APM_simpleAssert_nvrtc.txt
@ -1,12 +0,0 @@
 simpleAssert_nvrtc starting...
 Launch kernel to generate assertion failures
 > Using CUDA Device [0]: NVIDIA H100 PCIe
 > Using CUDA Device [0]: NVIDIA H100 PCIe
 > GPU Device has SM 9.0 compute capability
 -- Begin assert output
 -- End assert output
 Device assert failed as expected
--- a/bin/x86_64/linux/release/APM_simpleAtomicIntrinsics.txt
+++ b/bin/x86_64/linux/release/APM_simpleAtomicIntrinsics.txt
@ -1,5 +0,0 @@
 simpleAtomicIntrinsics starting...
 GPU Device 0: "Hopper" with compute capability 9.0
 Processing time: 2.438000 (ms)
 simpleAtomicIntrinsics completed, returned OK
--- a/bin/x86_64/linux/release/APM_simpleAtomicIntrinsics_nvrtc.txt
+++ b/bin/x86_64/linux/release/APM_simpleAtomicIntrinsics_nvrtc.txt
@ -1,6 +0,0 @@
 simpleAtomicIntrinsics_nvrtc starting...
 > Using CUDA Device [0]: NVIDIA H100 PCIe
 > Using CUDA Device [0]: NVIDIA H100 PCIe
 > GPU Device has SM 9.0 compute capability
 Processing time: 0.171000 (ms)
 simpleAtomicIntrinsics_nvrtc completed, returned OK
--- a/Show More
+++ b/Show More
		`@ -1,2 +0,0 @@`
			`Running on GPU 0 (NVIDIA H100 PCIe)`
			`Computing Bezier Lines (CUDA Dynamic Parallelism Version) ... Done!`