Quantize an input FP16 matrix into the NVFP4 format. It uses a two-level scaling strategy: a global scale moves the entire tensor into the representable range of a block (FP4 FP8), then a local per-block scale moves each block into FP4 range. A rough outline is to:
[M, K] matrix into contiguous blocks of 16 elements along K.scale output):Now, you can quantize to FP4 E2M1:
fp16 pointer to row-major tensor of shape fp32 scalar, the global scale factor defined as:uint8 pointer, packed E2M1 quantized values of shape uint8 pointer, FP8 E4M3 per-block scale factors in the swizzled 128x4 layout (see below)Instead of storing the scale factors in naive row-major order, they must be arranged in a swizzled layout for tensor core consumption.
To do this, we first tile the 2D array into 128-row 4-column tiles (pad M to a multiple of 128, this will be needed to pass the sample). Then, within each 128-row M-tile, reorder the 128 rows as a 32 4 column-major block. That is, rows 0..31 go first, then 32..63, 64..95, 96..127, but interleaved column-first so that rows 32 apart in logical space become adjacent in memory. Thus, the memory order is: 0, 32, 64, 96, 1, 33, 65, 97, etc. Check out the cuBLAS 1D Block Scaling Factors Layout documentation for more info.
We use FlashInfer's nvfp4_quantize with SfLayout.layout_128x4 (the default layout) as the ground truth. Submissions are validated by dequantizing both the reference and submitted outputs via e2m1_and_ufp8sf_scale_to_float and checking closeness.
Sample Run Results
Hit "Run" to test your code with sample inputs
Loading editor...
CUDA C++ environment
For the best coding experience, please switch to a desktop device to write and submit your solution.