[Matrix Multiply Using CUDA] - Starting... GPU Device 0: "Hopper" with compute capability 9.0 MatrixA(320,320), MatrixB(640,320) Computing result using CUDA Kernel... done Performance= 4756.03 GFlop/s, Time= 0.028 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block Checking computed result for correctness: Result = PASS NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.