cuBLASDx总结
cuBLASDx简介
cuBLAS 设备扩展(cuBLASDx)库使您能够在自己的 CUDA kernel 内部执行 cuBLAS 中提供的部分线性代数函数。目前该功能仅限于通用矩阵乘法(GEMM)。将线性代数与其他操作融合,可以降低延迟并提升应用程序的整体性能。
cuBLASDx 库目前提供以下特性:
- 可嵌入 CUDA kernel 的 BLAS GEMM 例程。
- 高性能,避免了不必要的全局内存数据搬运。
- 高度可定制,支持根据不同需求调整 GEMM 例程(如矩阵尺寸、精度、数据类型、目标 CUDA 架构等)。
- 灵活的累加与融合方式,可在共享内存(shared memory)或寄存器(registers)中进行计算。
- 支持将 BLAS 计算与其他操作融合,从而减少访问全局内存的次数。
- 与未来版本的 CUDA Toolkit 兼容。
cuBLASDx(cuBLAS Device Extensions)是 NVIDIA 在 CUDA Toolkit 11.0+ 中引入的一个轻量级、高性能库扩展,它允许开发者直接在 CUDA kernel 内部调用高度优化的 BLAS 函数,而无需像传统 cuBLAS 那样通过主机端 API 启动单独的 kernel。
简单说:cuBLASDx = 可嵌入 CUDA kernel 的 cuBLAS GEMM。
前置知识
理解本文内容需要知道cute Layout。
核心概念
- cuBLASDx目前只支持BLAS的GEMM函数
- 有3种GEMM可以调用
- Shared memory API: \(\mathbf{C}_{m\times n} = {\alpha} \times \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + {\beta} \times \mathbf{C}_{m\times n}\)
- Register API with accumulator: \(\mathbf{C}_{m\times n} = \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + \mathbf{C}_{m\times n}\)
- Register API without accumulator: \(\mathbf{C}_{m\times n} = \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n}\)
- 说明:对Shared memory API,\(\mathbf{A}/\mathbf{B}/\mathbf{C}\)必须在SMEM中;对Register API,\(\mathbf{A}/\mathbf{B}\)必须在SMEM中,\(\mathbf{C}\)必须在RMEM中
- GEMM的输入输出精度(\(\mathbf{A}/\mathbf{B}/\mathbf{C}\)的精度)和GEMM中的计算精度(\(\times\)和\(+\))解耦;在做GEMM计算之前,\(\mathbf{A}/\mathbf{B}/\mathbf{C}\)要做类型转换,转换到定义的GEMM计算精度(可以通过
GEMM::execute()的入参指定转换函数,不指定的话,类型转换必须可以通过implicit conversion进行,否则会有编译时报错) - \(\mathbf{A}/\mathbf{B}/\mathbf{C}\)可以以cute Layout格式或者以指针的方式传给GEMM API;cute Layout已经包含了内存布局信息,使用指针时,内存布局信息来自对GEMM的描述。TODO: 增加一个用例表示如果Layout信息和GEMM描述信息有冲突,会发生什么
-
开发者提供: • 对GEMM的逻辑描述(Description Operators) • 对GEMM的执行描述(Execution Operators)。每个细节的描述称为一个operator。Description Operators和Execution Operators加在一起称为function descriptor,也称为BLAS。代码示例:
#include <cublasdx.hpp> using BLAS = decltype(cublasdx::Size<8, 16, 32>() + cublasdx::Precision<double>() + cublasdx::Type<cublasdx::type::complex>() + cublasdx::Arrangement<cublasdx::col_major, cublasdx::col_major>() + cublasdx::Function<cublasdx::function::MM>() + cublasdx::SM<700>()); -
对GEMM的逻辑描述(Description Operators)包括:
Operator 默认值 描述 Size <M, N, K>无 GEMM的大小。 Arrangement<ArrA, ArrB, ArrC>row_major,col_major,col_major\(\mathbf{A}/\mathbf{B}/\mathbf{C}\)的majorness。 Precision<PA, PB, PC>float,float,float\(\mathbf{A}/\mathbf{B}/\mathbf{C}\)的计算精度;必须全是浮点数或者全是整数。 Type<type>type::real\(\mathbf{A}/\mathbf{B}/\mathbf{C}\)的类型,实数或是复数( type::realortype::complex).LeadingDimension<LDA, LDB, LDC>由 Size和Arrangement定义\(\mathbf{A}/\mathbf{B}/\mathbf{C}\)的Leading Dimensions。 Alignment <AlignA, AlignB, AlignC>alignof(BLAS::a_value_type), …\(\mathbf{A}/\mathbf{B}/\mathbf{C}\)的Alignments(以bytes为单位)。 SM<CC>无 目标CUDA架构的SM。 -
对GEMM的执行描述(Execution Operators)包括:
Operator 默认值 描述 Block无 创建在CUDA block中执行的BLAS函数。 BlockDim<X, Y, Z>BLAS::suggested_block_dim()的返回值配置执行BLAS函数的线程数。 X*Y*Z必须大于等于32,最好是32的整数倍。 - cuBLASDx根据function descriptor计算
- GEMM需要的SMEM总大小(
get_shared_storage_size(),get_shared_storage_size_ab()),以及\(\mathbf{A}/\mathbf{B}/\mathbf{C}\)各自占用的SMEM大小(slice_shared_memory(),slice_shared_memory_ab()) - \(\mathbf{A}/\mathbf{B}/\mathbf{C}\)在GMEM和SMEM中的Layout(
BLAS::get_layout_<gmem/smem>_<a/b/c>()算出的Layout不带优化;BLAS::suggest_layout_<gmem/smem>_<a/b/c>()算出的Layout会根据具体的SM做MMA优化和copy优化) - 选择合适的MMA指令
- MMA指令的tiling(矩阵计算的Shape)
- 以及参与GEMM计算的thread的register fragment
- Launch kernel需要的Block Dim
- GEMM需要的SMEM总大小(
- 创建function descriptor后,可以通过
Traits返回其中包含的GEMM相关信息,比如:- 如果function descriptor包含Description Operators,则
cublasdx::is_blas<BLAS>::value为真 - 如果function descriptor里有且只有一个
Size + Function + SMoperator,则cublasdx::is_complete_blas<BLAS>::value为真 - 如果function descriptor包含Description Operators和Execution Operators,则
cublasdx::is_blas_execution<BLAS>::value为真 TODO:写个例子,只有Block - 如果
cublasdx::is_complete_blas<BLAS>::value和cublasdx::is_blas_execution<BLAS>::value同时为真,则cublasdx::is_complete_blas_execution<BLAS>::value为真
- 如果function descriptor包含Description Operators,则
-
RMEM是每个thread独占的且十分有限,GEMM的输入输出必须分布于参与GEMM的一组thread中所有RMEM中。把tensor数据分布于一组thread的RMEM的方式称为partitioning。cuBLASDx中partitioner包含了和partitioning有关的所有信息。获取partitioner有三种方式:
// #1a 默认的partitioner auto partitioner = BLAS::get_partitioner(); // #1b 带优化的partitioner auto partitioner = BLAS::suggest_partitioner(); // "Register API without accumulator"方式的GEMM返回值包括partitioner auto [c_register_fragment, partitioner] = BLAS().execute(a_shared_tensor, b_shared_tensor);partitioner的选择要和使用的Layout的对应,否则对性能有影响:
BLAS::get_partitioner()配合get_layout_smem_*()使用BLAS::suggest_partitioner()配合suggest_layout_smem_*()使用
-
partitioner有如下API:
// Partitioning properties __device__ bool is_predicated(); __device__ bool is_thread_active(); // Accumulator creation, creates a register cublasdx::tensor __device__ constexpr auto make_accumulator_fragment(); // This method will return a non-owning view of its argument’s subtensor assigned to the calling thread, corresponding to its local register fragment. template<class CTensor> __forceinline__ __device__ auto partition_like_C(CTensor && ctensor) const; // These 2 functions extend functionality of is_predicated() allowing to map local register fragment index to its source (global or shared) tensor index, as well as check if this index is in bounds. template<class ... Coords> __forceinline__ __device__ auto map_fragment_index(Coords&& ... coords) const; // check if the fragment index is in bounds template<class ... Coords> __forceinline__ __device__ bool is_index_in_bounds(Coords&& ... coords) const;说明:
is_predicated():元素在各线程之间的划分是通过对Size(M,N,K)operator所定义的问题规模进行分块(tiling),并将其映射到多个 MMA(Matrix Multiply-Accumulate)指令上实现的。每条 MMA 指令负责计算一个特定形状的子块。当问题的整体形状不能被底层 MMA 指令的原始计算形状整除时(is_predicated()返回true;能整除则返回false),那些“多余”的元素不会从内存中读取,而是用 0 填充;在存储结果时,这些填充的元素也会被跳过。is_thread_active():由于 cuBLASDx 支持在 CUDA threadblock大小与BlockDimoperator不匹配的 kernel 中执行,并非所有线程都会参与 GEMM 运算。这意味着某些线程可能未被分配任何计算元素。可以通过调用is_thread_active()成员函数来精确判断当前调用线程是否属于这种情况(即是否未被分配任务)。
- cuBLASDx支持两种tensor拷贝操作:• GMEM和SMEM的双向拷贝 • SMEM/GMEM和RMEM的双向拷贝。
- GMEM和SMEM的双向拷贝:该拷贝操作是协同完成的(cooperative operation)。所有线程(由 NumThreads 或 BLAS::block_dim 指定)都将参与此次拷贝。该函数会考虑给定的内存对齐方式,并在可能的情况下尝试vectorized load/store。
template<uint32_t NumThreads, // Number of threads performing copy operation uint32_t AlignmentInBytes, // Pointer alignment of src and dst tensor (minimum of them if they are different) class SrcEngine, class SrcLayout, class DstEngine, class DstLayout> __forceinline__ __device__ void copy(const unsigned int tid, // Thread index in CUDA block const cublasdx::tensor<SrcEngine, SrcLayout>& src, cublasdx::tensor<DstEngine, DstLayout>& dst) // Assumes pointers in both dst and src tensors are not extra aligned template<uint32_t NumThreads, // Number of threads performing copy operation class SrcEngine, class SrcLayout, class DstEngine, class DstLayout> __forceinline__ __device__ void copy(const unsigned int tid, // Thread index in CUDA block const cublasdx::tensor<SrcEngine, SrcLayout>& src, cublasdx::tensor<DstEngine, DstLayout>& dst) template<class BLAS, // BLAS description which provides the number of threads uint32_t AlignmentInBytes, // Pointer alignment of src and dst tensor (minimum of them if they are different) class SrcEngine, class SrcLayout, class DstEngine, class DstLayout> __forceinline__ __device__ void copy(const cublasdx::tensor<SrcEngine, SrcLayout>& src, cublasdx::tensor<DstEngine, DstLayout>& dst) -
SMEM/GMEM和RMEM的双向拷贝(SMEM<->RMEM,GMEM<->RMEM)
// #1 Store fragment: partition and copy from register fragment to global / shared memory tensor template<unsigned AlignmentInBytes, // Alignment of source tensor pointer class TRC, class CFragLayout, // Register Memory Fragment Tensor class TC, class CLayout, // Global or Shared Memory tensor class Partitioner> __forceinline__ __device__ copy_fragment(tensor<TRC, CFragLayout> const& tS, // Entire non-partitioned global / shared tensor tensor<TC, CLayout> & tD, // Calling thread's register fragment tensor Partitioner const& p); // #2 Load fragment: partition and copy from global / shared memory tensor to register fragment template<unsigned AlignmentInBytes, // Alignment of source tensor pointer class TRC, class CFragLayout, // Register Memory Fragment Tensor class TC, class CLayout, // Global or Shared Memory tensor class Partitioner> __forceinline__ __device__ copy_fragment(tensor<TC, CLayout> const& tS, tensor<TRC, CFragLayout> & tD, Partitioner const& p);
GEMM的调用步骤
-
定义GEMM的function descriptor,比如:
#include <cublasdx.hpp> using namespace cublasdx; using GEMM = decltype(Size<32, 32, 32>() + Precision<double>() + Type<type::real>() + Function<function::MM>() + Arrangement<cublasdx::row_major, cublasdx::col_major>() + SM<700>() + Block() + BlockDim<256>()); -
准备GEMM的输入输出A/B/C。往往A/B/C需要从GMEM加载到SMEM或者RMEM,A/B/C在GMEM/SMEM/RMEM的布局都以Tensor的格式存在。Tensor包含的布局信息由cuBLASDx自动计算
GMEM/SMEM:使用
cublasdx::make_tensor创建Tensortemplate<class GEMM> __global__ void gemm_kernel(GEMM::c_value_type alpha, GEMM::a_value_type *a, GEMM::b_value_type *b, GEMM::c_value_type beta, GEMM::c_value_type *c) { extern __shared__ __align__(16) char smem[]; // Make global memory tensor auto a_global_tensor = cublasdx::make_tensor(a, GEMM::get_layout_gmem_a()); auto b_global_tensor = cublasdx::make_tensor(b, GEMM::get_layout_gmem_b()); auto c_global_tensor = cublasdx::make_tensor(c, GEMM::get_layout_gmem_c()); // Make shared memory tensor auto [smem_a, smem_b, smem_c] = slice_shared_memory<GEMM>(smem); // smem_<a/b/c> are aligned to cublasdx::alignment_of<GEMM>::<a/b/c> auto a_shared_tensor = cublasdx::make_tensor(smem_a, GEMM::get_layout_smem_a()); auto b_shared_tensor = cublasdx::make_tensor(smem_b, GEMM::get_layout_smem_b()); auto c_shared_tensor = cublasdx::make_tensor(smem_c, GEMM::get_layout_smem_c()); }RMEM:使用
partitioner创建Fragment(也称为1D Tensor)auto partitioner = BLAS::get_partitioner(); auto c_fragment_accumulator = partitioner.make_accumulator_fragment(); // Now you can access it as a regular 1D tensor: auto val_0 = c_fragment_accumulator(0); -
把A/B/C Tensor从GMEM拷贝到SMEM(cooperative operation)
// Load data from global memory tensor to shared memory tensor using alignment = cublasdx::alignment_of<GEMM>; cublasdx::copy<GEMM, alignment::a>(a_global_tensor, a_shared_tensor); // <a/b/c>_shared_tensor, created from smem_<a/b/c>, is aligned to alignment::<a/b/c> cublasdx::copy<GEMM, alignment::b>(b_global_tensor, b_shared_tensor); cublasdx::copy<GEMM, alignment::c>(c_global_tensor, c_shared_tensor); cublasdx::copy_wait();说明:SMEM到GMEM的拷贝类似
-
把A/B/C Tensor从GMEM(矩阵C)或者SMEM(矩阵A/B)拷贝到RMEM(per-thread operation)
// Load data from global memory tensor to shared memory tensor using alignment = cublasdx::alignment_of<GEMM>; auto partitioner = GEMM::get_partitioner(); auto c_fragment_accumulator = partitioner.make_accumulator_fragment(); // Load data from global to registers cublasdx::copy_fragment<alignment::a>(c_global_tensor, c_fragment_accumulator, partitioner); // Load data from shared to registers cublasdx::copy_fragment<alignment::a>(c_shared_tensor, c_fragment_accumulator, partitioner);说明:RMEM到GMEM/SMEM的拷贝类似
-
调用GEMM
Shared memory API
#include <cublasdx.hpp> using namespace cublasdx; template<class GEMM> __global__ void gemm_kernel_shared(const typename GEMM::c_value_type alpha, const typename GEMM::a_value_type* a, const typename GEMM::b_value_type* b, const typename GEMM::c_value_type beta, typename GEMM::c_value_type* c) { extern __shared__ __align__(16) char smem[]; // Make global memory tensor auto a_global_tensor = cublasdx::make_tensor(a, GEMM::get_layout_gmem_a()); auto b_global_tensor = cublasdx::make_tensor(b, GEMM::get_layout_gmem_b()); auto c_global_tensor = cublasdx::make_tensor(c, GEMM::get_layout_gmem_c()); // Make shared memory tensor auto [smem_a, smem_b, smem_c] = cublasdx::slice_shared_memory<GEMM>(smem); auto a_shared_tensor = cublasdx::make_tensor(smem_a, GEMM::get_layout_smem_a()); auto b_shared_tensor = cublasdx::make_tensor(smem_b, GEMM::get_layout_smem_b()); auto c_shared_tensor = cublasdx::make_tensor(smem_c, GEMM::get_layout_smem_c()); // Load data from global memory tensor to shared memory tensor using alignment = cublasdx::alignment_of<GEMM>; cublasdx::copy<GEMM, alignment::a>(a_global_tensor, a_shared_tensor); cublasdx::copy<GEMM, alignment::b>(b_global_tensor, b_shared_tensor); cublasdx::copy<GEMM, alignment::c>(c_global_tensor, c_shared_tensor); cublasdx::copy_wait(); // Execute GEMM GEMM().execute(alpha, a_shared_tensor, b_shared_tensor, beta, c_shared_tensor); __syncthreads(); // Store data from shared memory tensor to global memory tensor cublasdx::copy<GEMM, alignment::c>(c_shared_tensor, c_global_tensor); }Register API with accumulator
#include <cublasdx.hpp> using namespace cublasdx; template<class GEMM> __global__ void gemm_kernel_registers_accumulation(const typename GEMM::a_value_type* a, const typename GEMM::b_value_type* b, typename GEMM::c_value_type* c) { extern __shared__ __align__(16) char smem[]; // Make global memory tensor auto a_global_tensor = cublasdx::make_tensor(a, GEMM::get_layout_gmem_a()); auto b_global_tensor = cublasdx::make_tensor(b, GEMM::get_layout_gmem_b()); auto c_global_tensor = cublasdx::make_tensor(c, GEMM::get_layout_gmem_c()); // Make shared memory tensor auto [smem_a, smem_b] = cublasdx::slice_shared_memory_ab<GEMM>(smem); auto a_shared_tensor = cublasdx::make_tensor(smem_a, GEMM::get_layout_smem_a()); auto b_shared_tensor = cublasdx::make_tensor(smem_b, GEMM::get_layout_smem_b()); // Load data from global memory tensor to shared memory tensor using alignment = cublasdx::alignment_of<GEMM>; cublasdx::copy<GEMM, alignment::a>(a_global_tensor, a_shared_tensor); cublasdx::copy<GEMM, alignment::b>(b_global_tensor, b_shared_tensor); cublasdx::copy_wait(); // Get default data partitioner auto partitioner = GEMM::get_partitioner(); // Create register fragment Accumulator auto c_register_fragment = partitioner.make_accumulator_fragment(); // Partition Global C for GEMM and load appropriate elements into register fragment cublasdx::copy_fragment<alignment::c>(c_global_tensor, c_register_fragment, partitioner); // Execute GEMM with accumulation GEMM().execute(a_shared_tensor, b_shared_tensor, c_register_fragment); // Partition Global C for GEMM and store appropriate elements to global memory cublasdx::copy_fragment<alignment::c>(c_register_fragment, c_global_tensor, partitioner); }Register API without accumulator
#include <cublasdx.hpp> using namespace cublasdx; template<class GEMM> __global__ void gemm_kernel_registers(const typename GEMM::a_value_type* a, const typename GEMM::b_value_type* b, typename GEMM::c_value_type* c) { extern __shared__ __align__(16) char smem[]; // Make global memory tensor auto a_global_tensor = cublasdx::make_tensor(a, GEMM::get_layout_gmem_a()); auto b_global_tensor = cublasdx::make_tensor(b, GEMM::get_layout_gmem_b()); auto c_global_tensor = cublasdx::make_tensor(c, GEMM::get_layout_gmem_c()); // Make shared memory tensor auto [smem_a, smem_b] = cublasdx::slice_shared_memory_ab<GEMM>(smem); auto a_shared_tensor = cublasdx::make_tensor(smem_a, GEMM::get_layout_smem_a()); auto b_shared_tensor = cublasdx::make_tensor(smem_b, GEMM::get_layout_smem_b()); // Load data from global memory tensor to shared memory tensor using alignment = cublasdx::alignment_of<GEMM>; cublasdx::copy<GEMM, alignment::a>(a_global_tensor, a_shared_tensor); cublasdx::copy<GEMM, alignment::b>(b_global_tensor, b_shared_tensor); cublasdx::copy_wait(); // Execute GEMM and get register fragment results and data partitioner in return auto [c_register_fragment, partitioner] = GEMM().execute(a_shared_tensor, b_shared_tensor); // Partition Global C for GEMM and store appropriate elements to global memory cublasdx::copy_fragment<alignment::c>(c_register_fragment, c_global_tensor, partitioner); } -
Launch GEMM kernel。kernel launch需要知道threadblock dimension和SMEM大小,这两个信息由cuBLASDx计算(
BLAS::block_dim,cublasdx::get_shared_storage_size,cublasdx::get_shared_storage_size_ab)。#include <cublasdx.hpp> using namespace cublasdx; // Kernels are unfolded in their appropriate sections above template<class GEMM> __global__ void gemm_kernel_shared(...); template<class GEMM> __global__ void gemm_kernel_registers_accumulation(...); template<class GEMM> __global__ void gemm_kernel_registers(...); // CUDA_CHECK_AND_EXIT - marco checks if function returns cudaSuccess; if not it prints the error code and exits the program void introduction_example(value_type alpha, value_type *a, value_type *b, value_type beta, value_type *c) { using GEMM = decltype(Size<32, 32, 32>() + Precision<double>() + Type<type::real>() + Arrangement<cublasdx::row_major, cublasdx::col_major>() + Function<function::MM>()); + SM<700>() + Block()); // Shared memory API: C = alpha * A * B + beta * C // Invokes kernel with GEMM::block_dim threads in CUDA block gemm_kernel_shared<GEMM><<<1, GEMM::block_dim, cublasdx::get_shared_storage_size<GEMM>()>>>(1.0, a, b, 1.0, c); // Register fragment Accumulation API: C = A * B + C // Invokes kernel with GEMM::block_dim threads in CUDA block gemm_kernel_registers_accumulation<GEMM><<<1, GEMM::block_dim, cublasdx::get_shared_storage_size_ab<GEMM>()>>>(a, b, c); // Register fragment API: C = A * B // Invokes kernel with GEMM::block_dim threads in CUDA block gemm_kernel_registers<GEMM><<<1, GEMM::block_dim, cublasdx::get_shared_storage_size_ab<GEMM>()>>>(a, b, c); }
完整示例
template<class GEMM>
__global__ void gemm_kernel_shared(const typename GEMM::c_value_type alpha,
const typename GEMM::a_value_type* a,
const typename GEMM::b_value_type* b,
const typename GEMM::c_value_type beta,
typename GEMM::c_value_type* c) {
extern __shared__ __align__(16) char smem[];
// Make global memory tensor
auto a_global_tensor = cublasdx::make_tensor(a, GEMM::get_layout_gmem_a());
auto b_global_tensor = cublasdx::make_tensor(b, GEMM::get_layout_gmem_b());
auto c_global_tensor = cublasdx::make_tensor(c, GEMM::get_layout_gmem_c());
// Make shared memory tensor
auto [smem_a, smem_b, smem_c] = cublasdx::slice_shared_memory<GEMM>(smem);
auto a_shared_tensor = cublasdx::make_tensor(smem_a, GEMM::get_layout_smem_a());
auto b_shared_tensor = cublasdx::make_tensor(smem_b, GEMM::get_layout_smem_b());
auto c_shared_tensor = cublasdx::make_tensor(smem_c, GEMM::get_layout_smem_c());
// Load data from global memory tensor to shared memory tensor
using alignment = cublasdx::alignment_of<GEMM>;
cublasdx::copy<GEMM, alignment::a>(a_global_tensor, a_shared_tensor);
cublasdx::copy<GEMM, alignment::b>(b_global_tensor, b_shared_tensor);
cublasdx::copy<GEMM, alignment::c>(c_global_tensor, c_shared_tensor);
cublasdx::copy_wait();
// Execute GEMM
GEMM().execute(alpha, a_shared_tensor, b_shared_tensor, beta, c_shared_tensor);
__syncthreads();
// Store data from shared memory tensor to global memory tensor
cublasdx::copy<GEMM, alignment::c>(c_shared_tensor, c_global_tensor);
}
template<class GEMM>
__global__ void gemm_kernel_registers_accumulation(const typename GEMM::a_value_type* a,
const typename GEMM::b_value_type* b,
typename GEMM::c_value_type* c) {
extern __shared__ __align__(16) char smem[];
// Make global memory tensor
auto a_global_tensor = cublasdx::make_tensor(a, GEMM::get_layout_gmem_a());
auto b_global_tensor = cublasdx::make_tensor(b, GEMM::get_layout_gmem_b());
auto c_global_tensor = cublasdx::make_tensor(c, GEMM::get_layout_gmem_c());
// Make shared memory tensor
auto [smem_a, smem_b] = cublasdx::slice_shared_memory_ab<GEMM>(smem);
auto a_shared_tensor = cublasdx::make_tensor(smem_a, GEMM::get_layout_smem_a());
auto b_shared_tensor = cublasdx::make_tensor(smem_b, GEMM::get_layout_smem_b());
// Load data from global memory tensor to shared memory tensor
using alignment = cublasdx::alignment_of<GEMM>;
cublasdx::copy<GEMM, alignment::a>(a_global_tensor, a_shared_tensor);
cublasdx::copy<GEMM, alignment::b>(b_global_tensor, b_shared_tensor);
cublasdx::copy_wait();
// Get default partitioner
auto partitioner = GEMM::get_partitioner();
// Create Register Fragment Accumulator
auto c_register_fragment = partitioner.make_accumulator_fragment();
// Partition Global C for GEMM and load appropriate elements into register fragment
cublasdx::copy_fragment<alignment::c>(c_global_tensor, c_register_fragment, partitioner);
// Execute GEMM with accumulation
GEMM().execute(a_shared_tensor, b_shared_tensor, c_register_fragment);
// Partition Global C for GEMM and store appropriate elements to global memory
cublasdx::copy_fragment<alignment::c>(c_register_fragment, c_global_tensor, partitioner);
}
template<class GEMM>
__global__ void gemm_kernel_registers(const typename GEMM::a_value_type* a,
const typename GEMM::b_value_type* b,
typename GEMM::c_value_type* c) {
extern __shared__ __align__(16) char smem[];
// Make global memory tensor
auto a_global_tensor = cublasdx::make_tensor(a, GEMM::get_layout_gmem_a());
auto b_global_tensor = cublasdx::make_tensor(b, GEMM::get_layout_gmem_b());
auto c_global_tensor = cublasdx::make_tensor(c, GEMM::get_layout_gmem_c());
// Make shared memory tensor
auto [smem_a, smem_b] = cublasdx::slice_shared_memory_ab<GEMM>(smem);
auto a_shared_tensor = cublasdx::make_tensor(smem_a, GEMM::get_layout_smem_a());
auto b_shared_tensor = cublasdx::make_tensor(smem_b, GEMM::get_layout_smem_b());
// Load data from global memory tensor to shared memory tensor
using alignment = cublasdx::alignment_of<GEMM>;
cublasdx::copy<GEMM, alignment::a>(a_global_tensor, a_shared_tensor);
cublasdx::copy<GEMM, alignment::b>(b_global_tensor, b_shared_tensor);
cublasdx::copy_wait();
// Execute GEMM and get register fragment results and data partitioner in return
auto [c_register_fragment, partitioner] = GEMM().execute(a_shared_tensor, b_shared_tensor);
// Partition Global C for GEMM and store appropriate elements to global memory
cublasdx::copy_fragment<alignment::c>(c_register_fragment, c_global_tensor, partitioner);
}
template<unsigned int Arch>
int introduction_example() {
using GEMM = decltype(cublasdx::Size<32, 32, 32>()
+ cublasdx::Precision<double>()
+ cublasdx::Type<cublasdx::type::real>()
+ cublasdx::Arrangement<cublasdx::row_major, cublasdx::col_major>()
+ cublasdx::Function<cublasdx::function::MM>()
+ cublasdx::SM<Arch>()
+ cublasdx::Block()
+ cublasdx::BlockDim<256>());
using value_type = typename example::uniform_value_type_t<GEMM>;
constexpr auto global_a_size = example::global_memory_size_of<GEMM>::a_size;
constexpr auto global_b_size = example::global_memory_size_of<GEMM>::b_size;
constexpr auto global_c_size = example::global_memory_size_of<GEMM>::c_size;
// Allocate managed memory for A, B, C matrices in one go
value_type* abc;
auto size = global_a_size + global_b_size + global_c_size;
auto size_bytes = size * sizeof(value_type);
CUDA_CHECK_AND_EXIT(cudaMallocManaged(&abc, size_bytes));
// Generate data
for (size_t i = 0; i < size; i++) {
abc[i] = double(i / size);
}
value_type* a = abc;
value_type* b = abc + global_a_size;
value_type* c = abc + global_a_size + global_b_size;
// Shared Memory API: C = alpha * A * B + beta * C
// Invokes kernel with GEMM::block_dim threads in CUDA block
gemm_kernel_shared<GEMM><<<1, GEMM::block_dim, cublasdx::get_shared_storage_size<GEMM>()>>>(1.0, a, b, 1.0, c);
gemm_kernel_shared<GEMM><<<1, GEMM::block_dim, cublasdx::get_shared_storage_size<GEMM>()>>>(1.0, a, b, 1.0, c);
CUDA_CHECK_AND_EXIT(cudaPeekAtLastError());
// Register Fragment Accumulation API: C = A * B + C
// Invokes kernel with GEMM::block_dim threads in CUDA block
gemm_kernel_registers_accumulation<GEMM><<<1, GEMM::block_dim, cublasdx::get_shared_storage_size_ab<GEMM>()>>>(a, b, c);
CUDA_CHECK_AND_EXIT(cudaPeekAtLastError());
// Register Fragment API: C = A * B
// Invokes kernel with GEMM::block_dim threads in CUDA block
gemm_kernel_registers<GEMM><<<1, GEMM::block_dim, cublasdx::get_shared_storage_size_ab<GEMM>()>>>(a, b, c);
CUDA_CHECK_AND_EXIT(cudaPeekAtLastError());
CUDA_CHECK_AND_EXIT(cudaDeviceSynchronize());
CUDA_CHECK_AND_EXIT(cudaFree(abc));
std::cout << "Success" << std::endl;
return 0;
}
struct introduction_example_functor {
template<int Arch>
int operator()(std::integral_constant<int, Arch>) {
return introduction_example<Arch>();
}
};
int main(int, char**) {
return example::sm_runner(introduction_example_functor{});
}
References
- https://docs.nvidia.com/cuda/cublasdx/index.html
- https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#dim-quantization