bf16TensorCoreGemm --std=c++11 cudaMallocManaged cudaDeviceSynchronize cudaFuncSetAttribute cudaEventCreate cudaEventRecord cudaEventSynchronize cudaEventElapsedTime cudaFree whole ./ ../ ../../Common Matrix Multiply WMMA Tensor Cores matrix multiply Async copy CPP11 GCC 5.1.0 true bf16TensorCoreGemm.cu 1:CUDA Basic Topics sm80 sm86 x86_64 linux aarch64 windows7 ppc64le linux 8.0 bfloat16 Tensor Core GEMM exe