Optimizing The (Square) Matrix Multiplication

Hüseyin Tuğrul BÜYÜKIŞIK

Nov 24, 2025

Worklog

Goal: 74.5 TFLOPS With B200 (https://www.techpowerup.com/gpu-specs/b200.c4210)

Current: 56 TFLOPS

Setup

Input: 32 bit FP
Hardware: B200 GPU (Blackwell Architecture, SM100, 228 kB shared-memory capacity per SM unit, maximum 64 resident warps per SM unit)
Metric: TFLOPS

Baseline

Approach: naive

5 TFLOPS.
Bottlenecked by everything from non-coalesced access to low compute intensity, branching, and non-cached loads.

Results

Version	Change	TFLOPs
v0	Baseline	5
v1	Linear Shared Memory Tiling	12
v2	Area Shared Memory Tiling	28
v3	Register Tiling	34
v4	Asynchronous Data Prefetching	51
v5	Increasing Instruction-Level Parallelism + lower latency / higher throughput intrinsics such as __ffma2_rn	56
todo	use multi-casted sub-tiles with cluster-launch and add warp-tiling	would be closer to peak

Tiling

Tiling is important to reduce the number of memory fetch operations to a slower memory subsystem. One can achieve a higher performance by tiling for every level of memory in use.

Shared Memory Tiling

Shared-memory tiling is the easiest one to apply. It's capacity is shared by all CUDA threads within a block. This makes it easy to manage i/o from global memory. Instead of randomly accessing global memory, it allows threads to load data from/to global memory in coalesced manner. Coalescence means the data is packed together for a warp so that the threads of warp use minimal amount of memory fetch operations on hardware.

Coalescence makes global-memory access efficient. The cost is the increased shared-memory allocated per block which may reduce occupancy of kernel. Too much shared memory allocation reduces the available blocks in-flight and makes thread-level parallelism harder to hide latency.

If shared memory is accessed randomly too, then it should not be made with bank-conflicts. Currently in latest GPUs, shared memory has 32 memory banks. These banks serve data to threads independently. But when two or more threads access different addresses that are mapped to the same bank, the access is serialized. Because a bank can only serve 1 data at a time. This is no problem when pointers are same. It's same data broadcasted. But when pointers are different, the bank runs more than once, serialized for n times.

To avoid bank conflicts in shared-memory, there are simple tricks:

padding the stride between threads: pro = simple. con = more smem allocation can decrease occupancy.
shuffling the access order between threads without padding: pro = no extra allocation. con = index calculation overhead.
extra kernel to preprocess the data and change the access pattern: pro = no allocation, not overhead. con = extra kernel adds latency to the stream's sync.

Register Tiling

Register tiling is similar to shared-memory tiling but its only private to the thread using the registers. To make sure that the tile of registers stays in registers, and not spilled to local-memory (that is just L1 cached global memory with extra caching latency), some conditions must be met:

access to each array element must use static indexing that must be known in compile-time. Dynamic indexing doesn't work for registers, so any loop-variant index makes it spill to the slower local-memory. If indexing is calculated inside a loop, then the loop must be unrolled to arrive at static indexing.
size of tile should not surpass the limit of block (64k registers per block) and limit of thread (255 registers per thread). These values change between different CUDA architectures. One also needs to consider other parts of the kernel using extra registers (unless they do use equal amount and then re-used automatically by compiler's decision for tiling). The scope of the tile should be used only where its required, to keep the register-pressure low, or to let the compiler re-use registers efficiently.

Once the data is in registers, any instruction will have maximum throughput. Fused multiply-add is the most important instruction for matrix-multiplication and its best used with registers directly. Even when used with shared-memory or global-memory, the compiler still puts the data into registers before calling it. So when the data is already in registers, this extra data movement doesn't happen and fma runs at maximum throughput.

Since shared-memory has only 32 reads per cycle throughput for whole SM unit, register-tiling improves this throughput by nearly 10 times. For example, fmaf instruction can use 3 registers per cycle per thread even when using all of the cores in SM unit. But when using shared-memory, only 32 threads can get only 1 parameter of fmaf from smem. Maximum TFLOPS can only be achieved by use of registers. So, it is important to prepare the data in registers before doing many-to-many computations (such as sub-matrix to sub-matrix computations in inner-product or line to line computations in outer-product).

Notes

B200 Has HBM memory and it adds a lot of latency to each global-memory access within CUDA kernel. Hiding this latency is not easy, especially when data doesn't fit L1 cache nor L2 cache.
B200 also supports extra instructions to overcome the difficulty of hiding the extra HBM latency