tf32TensorCoreGemm --std=c++11 cudaMalloc cudaDeviceSynchronize cudaFuncSetAttribute cudaEventCreate cudaEventRecord cudaEventSynchronize cudaEventElapsedTime cudaFree whole ./ ../ ../../common/inc Matrix Multiply WMMA Tensor Cores matrix multiply Async copy CPP11 GCC 5.0.0 true tf32TensorCoreGemm.cu 1:CUDA Basic Topics sm80 x86_64 linux aarch64 windows7 ppc64le linux 8.0 tf32 Tensor Core GEMM exe