tf32TensorCoreGemm --std=c++11 cudaMalloc cudaDeviceSynchronize cudaFuncSetAttribute cudaEventCreate cudaEventRecord cudaEventSynchronize cudaEventElapsedTime cudaFree whole ./ ../ ../../Common Matrix Multiply WMMA Tensor Cores matrix multiply Async copy CPP11 GCC 5.1.0 true tf32TensorCoreGemm.cu 1:CUDA Basic Topics sm80 sm86 x86_64 linux aarch64 windows7 ppc64le linux 8.0 tf32 Tensor Core GEMM exe