How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

Sat, 25 Apr 2026 12:00:00 +0000

In this post, I’ll iteratively optimize an implementation of matrix multiplication written in CUDA. This post is a recreation of Si Boehm’s excellent worklog, used here as a template to demonstrate Tufte-CSS features: side notes, margin notes, side figures, full-width figures, code blocks, and references. My goal is not to build a cuBLAS replacement, but to deeply understand the most important performance characteristics of the GPUs that are used for modern deep learning. This includes coalescing global memory accesses, shared memory caching and occupancy optimizations, among others.

Performance on Share what you know

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog