Goal: 74.5 TFLOPS With B200 (https://www.techpowerup.com/gpu-specs/b200.c4210)
Approach: naive
| Version | Change | TFLOPs |
|---|---|---|
| v0 | Baseline | 5 |
| v1 | Linear Shared Memory Tiling | 12 |
| v2 | Area Shared Memory Tiling | 28 |
| v3 | Register Tiling | 34 |
| v4 | Asynchronous Data Prefetching | 51 |
| v5 | Increasing Instruction-Level Parallelism + lower latency / higher throughput intrinsics such as __ffma2_rn | 56 |
| todo | use multi-casted sub-tiles with cluster-launch and add warp-tiling | would be closer to peak |
Tiling is important to reduce the number of memory fetch operations to a slower memory subsystem. One can achieve a higher performance by tiling for every level of memory in use.
Shared-memory tiling is the easiest one to apply. It's capacity is shared by all CUDA threads within a block. This makes it easy to manage i/o from global memory. Instead of randomly accessing global memory, it allows threads to load data from/to global memory in coalesced manner. Coalescence means the data is packed together for a warp so that the threads of warp use minimal amount of memory fetch operations on hardware.
Coalescence makes global-memory access efficient. The cost is the increased shared-memory allocated per block which may reduce occupancy of kernel. Too much shared memory allocation reduces the available blocks in-flight and makes thread-level parallelism harder to hide latency.
If shared memory is accessed randomly too, then it should not be made with bank-conflicts. Currently in latest GPUs, shared memory has 32 memory banks. These banks serve data to threads independently. But when two or more threads access different addresses that are mapped to the same bank, the access is serialized. Because a bank can only serve 1 data at a time. This is no problem when pointers are same. It's same data broadcasted. But when pointers are different, the bank runs more than once, serialized for n times.
To avoid bank conflicts in shared-memory, there are simple tricks:
Register tiling is similar to shared-memory tiling but its only private to the thread using the registers. To make sure that the tile of registers stays in registers, and not spilled to local-memory (that is just L1 cached global memory with extra caching latency), some conditions must be met:
Once the data is in registers, any instruction will have maximum throughput. Fused multiply-add is the most important instruction for matrix-multiplication and its best used with registers directly. Even when used with shared-memory or global-memory, the compiler still puts the data into registers before calling it. So when the data is already in registers, this extra data movement doesn't happen and fma runs at maximum throughput.
Since shared-memory has only 32 reads per cycle throughput for whole SM unit, register-tiling improves this throughput by nearly 10 times. For example, fmaf instruction can use 3 registers per cycle per thread even when using all of the cores in SM unit. But when using shared-memory, only 32 threads can get only 1 parameter of fmaf from smem. Maximum TFLOPS can only be achieved by use of registers. So, it is important to prepare the data in registers before doing many-to-many computations (such as sub-matrix to sub-matrix computations in inner-product or line to line computations in outer-product).