Selecting the best possible General Matrix Multiplication (GEMM) kernel for a specific problem and hardware is a significant challenge. The performance of a…
Selecting the best possible General Matrix Multiplication (GEMM) kernel for a specific problem and hardware is a significant challenge. The performance of a GEMM kernel is determined by an array of compile-time and runtime meta-parameters: CTA, warp and instruction level tile sizes, kernel schedules, rasterization strategies, cluster dimensions, split-k factors, and so on.