globalToShmemTMACopy - Global Memory to Shared Memory TMA Copy

Description

This sample shows how to use the CUDA driver API and inline PTX assembly to copy a 2D tile of a tensor into shared memory. It also demonstrates arrive-wait barrier for synchronization.

Key Concepts

CUDA Runtime API, CUDA Driver API, PTX ISA, CPP11 CUDA

Supported SM Architectures

This sample requires compute capability 9.0 or higher.

SM 9.0

Supported OSes

Linux, Windows, QNX

Supported CPU Architecture

x86_64, ppc64le, armv7l, aarch64

CUDA APIs involved

CUDA Runtime API

cudaMalloc, cudaMemcpy, cudaFree, cudaDeviceSynchronize

CUDA Driver API

cudaMalloc, cudaMemcpy, cudaFree, cudaDeviceSynchronize

CUDA PTX ISA

Dependencies needed to build/run

CPP11

Prerequisites

Download and install the CUDA Toolkit 12.2 for your corresponding platform. Make sure the dependencies mentioned in Dependencies section above are installed.

1.3 KiB

Raw Blame History

globalToShmemTMACopy - Global Memory to Shared Memory TMA Copy

Description

Key Concepts

Supported SM Architectures

Supported OSes

Supported CPU Architecture

CUDA APIs involved

CUDA Runtime API

CUDA Driver API

CUDA PTX ISA

Dependencies needed to build/run

Prerequisites

Build and Run

References (for more details)

1.3 KiB Raw Blame History

globalToShmemTMACopy - Global Memory to Shared Memory TMA Copy

Description

Key Concepts

Supported SM Architectures

Supported OSes

Supported CPU Architecture

CUDA APIs involved

CUDA Runtime API

CUDA Driver API

CUDA PTX ISA

Dependencies needed to build/run

Prerequisites

Build and Run

References (for more details)

1.3 KiB

Raw Blame History