[simpleMultiCopy] - Starting... > Using CUDA device [0]: NVIDIA H100 PCIe [NVIDIA H100 PCIe] has 114 MP(s) x 128 (Cores/MP) = 14592 (Cores) > Device name: NVIDIA H100 PCIe > CUDA Capability 9.0 hardware with 114 multi-processors > scale_factor = 1.00 > array_size = 4194304 Relevant properties of this CUDA device (X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property "deviceOverlap") (X) Can overlap two CPU<>GPU data transfers with GPU kernel execution (Compute Capability >= 2.0 AND (Tesla product OR Quadro 4000/5000/6000/K5000) Measured timings (throughput): Memcpy host to device : 0.610592 ms (27.476966 GB/s) Memcpy device to host : 0.694048 ms (24.172991 GB/s) Kernel : 0.033408 ms (5021.915549 GB/s) Theoretical limits for speedup gained from overlapped data transfers: No overlap at all (transfer-kernel-transfer): 1.338048 ms Compute can overlap with one transfer: 1.304640 ms Compute can overlap with both data transfers: 0.694048 ms Average measured timings over 10 repetitions: Avg. time when execution fully serialized : 1.325424 ms Avg. time when overlapped using 4 streams : 1.203120 ms Avg. speedup gained (serialized - overlapped) : 0.122304 ms Measured throughput: Fully serialized execution : 25.315998 GB/s Overlapped using 4 streams : 27.889513 GB/s