Per-Tensor and Per-Block Scaling Strategies for Effective FP8 Training

Decorative image. In this blog post, we’ll break down the main FP8 scaling strategies—per-tensor scaling, delayed and current scaling, and per-block scaling (including the…

In this blog post, we’ll break down the main FP8 scaling strategies—per-tensor scaling, delayed and current scaling, and per-block scaling (including the Blackwell-backed MXFP8 format)—and explain why each is essential for maintaining numerical stability and accuracy during low-precision training. Understanding these approaches will help with choosing the right recipe for your own FP8 workflows.

Source

Leave a Reply Cancel reply