[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Hopper" with compute capability 9.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 4756.03 GFlop/s, Time= 0.028 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.