bf16TensorCoreGemm --std=c++11 cudaMallocManaged cudaDeviceSynchronize cudaFuncSetAttribute cudaEventCreate cudaEventRecord cudaEventSynchronize cudaEventElapsedTime cudaFree whole ./ ../ ../../common/inc Matrix Multiply WMMA Tensor Cores matrix multiply Async copy CPP11 GCC 5.0.0 true bf16TensorCoreGemm.cu 1:CUDA Basic Topics sm80 x86_64 linux aarch64 windows7 ppc64le linux 8.0 bfloat16 Tensor Core GEMM exe