immaTensorCoreGemm -maxrregcount=255 cudaMallocManaged cudaDeviceSynchronize cudaFuncSetAttribute cudaEventCreate cudaEventRecord cudaEventSynchronize cudaEventElapsedTime cudaFree whole ./ ../ ../../common/inc Matrix Multiply WMMA Tensor Cores true immaTensorCoreGemm.cu 1:CUDA Basic Topics sm72 sm75 sm80 x86_64 linux aarch64 windows7 ppc64le linux 7.2 Tensor Core GEMM Integer MMA exe