tensara

MXFP8 GEMM

HARD

Compute matrix multiplication where both matrix $A$ and matrix $B$ are stored in MXFP8 format. The equation below defines reference semantics for correctness; optimized kernels should decode/apply scales on-the-fly and avoid materializing full FP32 $A_{\mathrm{dequant}}$ or $B_{\mathrm{dequant}}$ .

c_{ij} = \sum_{\ell=0}^{K-1} A_{\mathrm{dequant},i\ell} \, B_{\mathrm{dequant},j\ell}.

Note: $B$ is stored in row-major as $N \times K$ (i.e. $B_{\mathrm{dequant}}$ is $N \times K$ ), so the multiplication is effectively $C = A_{\mathrm{dequant}} \, B_{\mathrm{dequant}}^T$ .

Input

$q_a$ : MXFP8 payload bytes for matrix $A$ of shape $M \times K$ (row-major)
$scale_a$ : per-block E8M0 scale bytes for $A$ , logical shape $M \times K/32$
$q_b$ : MXFP8 payload bytes for matrix $B$ of shape $N \times K$ (row-major; transposed before multiply)
$scale_b$ : per-block E8M0 scale bytes for $B$ , logical shape $N \times K/32$
$M$ , $N$ , $K$ : matrix dimensions ( $K$ and $N$ divisible by 32)

Output

$c$ : FP32 matrix of shape $M \times N$ where $c = A_{\mathrm{dequant}} \, B_{\mathrm{dequant}}^T$

Notes

Check out the MXFP8 format for more background.
We use torch.scaled_mm as the reference implementation for correctness. scaled_mm expects the second matrix in column-major layout; the reference therefore passes $B_{\mathrm{dequant}}^T$ (shape $K \times N$ ) as the second argument so the result remains $c = A_{\mathrm{dequant}} B_{\mathrm{dequant}}$ (logically unchanged).
Scale tensors passed as $scale\_a$ / $scale\_b$ are assumed to already be laid out in the same swizzled blockwise format that scaled_mm uses for MXFP8.
You should treat these pointers as already-swizzled 32x4x4 layout scale storage and must not apply an additional swizzle.

Test Case Sizes

1024 x 1024 x 1024
2048 x 1024 x 2048
4096 x 2048 x 4096
4096 x 4096 x 4096
8192 x 4096 x 8192

Console

Sample Run Results

Hit "Run" to test your code with sample inputs

Loading editor...

CUDA C++ environment

Desktop Required for Code Submission

For the best coding experience, please switch to a desktop device to write and submit your solution.