Pretraining frontier-scale LLMs in FP8 is now customary follow, however transferring to 4-bit floating level has remained an open analysis drawback as a result of narrower codecs compress dynamic vary and amplify quantization error at lengthy token horizons. A brand new analysis from NVIDIA describes a pretraining methodology constructed round NVFP4, a 4-bit microscaling format supported natively by Blackwell Tensor Cores, and validates it by pretraining a 12-billion-parameter hybrid Mamba-Transformer on 10 trillion tokens. The analysis staff state that is the longest publicly documented coaching run in 4-bit precision to this point. The ensuing mannequin attains 62.58% on MMLU-Pro 5-shot versus 62.62% for the FP8 baseline, and is supported in NVIDIA’s Transformer Engine.

What NVFP4 Actually is

To perceive why NVFP4 is vital, it helps to revisit how microscaling codecs work. In a microscaling (MX) format, a contiguous block of low-precision parts shares a single scale issue, which is used to map the block again into a wider numerical vary through the matrix multiply. MXFP4 makes use of 32-element blocks the place every factor is saved as E2M1 — 1 signal bit, 2 exponent bits, 1 mantissa bit — encoding solely the values ±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, and ±6. Block scale components are saved in UE8M0, which restricts them to powers of two.

NVFP4 modifications three issues. First, the block dimension drops from 32 to 16 parts, narrowing the dynamic vary every scale has to cowl. Second, block scale components are saved in E4M3 relatively than UE8M0, buying and selling exponent vary for mantissa precision so the per-block amax (absolute most) will be mapped a lot nearer to the FP4 most representable. Third, NVFP4 provides a second scaling degree: an FP32 per-tensor scale that remaps values so the E4M3 block scales themselves keep in vary. The result’s that at least 6.25% of values in every block — the per-block amax — are represented at near-FP8 precision, whereas the rest sit in FP4.

On NVIDIA Blackwell, FP4 GEMMs run at 4× BF16 throughput on GB200 and 6× on GB300, which interprets to roughly 2× and three× speedups over FP8. Operand reminiscence footprint is roughly halved in comparison with FP8.

What’s Quantized — and What Isn’t

Only the GEMMs inside linear (fully-connected) layers Fprop, Dgrad, and Wgrad really run in NVFP4. Embeddings, the output projection head, normalization layers, non-linearities, and all consideration elements (softmax and the query-key and a spotlight score-value batched GEMMs) keep in BF16 or FP32. Model weights, weight gradients used for accumulation throughout microbatches and data-parallel replicas, and optimizer states are stored in FP32. Tensor parallel reductions run in BF16.

The Four-Part Training Methodology

Quantizing each linear-layer GEMM to NVFP4 with default settings (1×16 block scaling in every single place, round-to-nearest-even on each tensor, no transforms) diverges early in coaching. NVIDIA’s strategy stabilizes it with 4 elements, and ablation research on the 12B mannequin present every is critical.

Selective excessive precision: Linear layers within the first two and the ultimate eight of the 62 blocks (about 16% of all linear layers) are stored in BF16. Ablations indicated that the ultimate blocks are the delicate ones as a result of they require extra dynamic vary than FP4 supplies; preserving solely the ultimate 4 blocks in BF16 was additionally sufficient for secure convergence.

Random Hadamard Transforms (RHT): Outliers in weight gradients are unfold into an roughly Gaussian distribution by multiplying the enter tiles with a 16×16 Hadamard matrix mixed with a random ±1 signal vector. Because the orthogonal transforms cancel contained in the dot-product, no math correction is required within the GEMM. The d=16 dimension was chosen empirically: d=4 damage convergence, d=128 gave comparable outcomes. RHT is utilized solely to the inputs of the weight-gradient (Wgrad) GEMM, and a single random signal vector is shared throughout all linear layers. Randomization itself was a no-op at the 1.2B scale however measurably improved the 12B run.

Two-dimensional (2D) block scaling for weights: Standard NVFP4 scales 1×16 blocks alongside the dot-product dimension. Because the backward go transposes the burden tensor, the ahead and backward passes find yourself with totally different quantized weights, breaking the chain rule. NVIDIA’s repair is to scale weights in 16×16 blocks so the identical quantized illustration is utilized in each passes. Activations and gradients hold 1×16 scaling, since they’re much less delicate to this inconsistency.

Stochastic rounding on gradients: Round-to-nearest-even introduces systematic bias when utilized to gradient tensors. Stochastic rounding rounds probabilistically based mostly on distance to the 2 nearest representable values, eradicating that bias. The analysis staff explicitly notes in analysis paper that stochastic rounding is detrimental when utilized to forward-pass tensors, so it’s restricted to gradients.

Results on the 12B Hybrid Mamba-Transformer

The 12B mannequin makes use of the Nemotron-Nano-12B-v2-Base structure — 62 blocks (6 Self-Attention, 28 FFN, 28 Mamba-2), hidden dimension 5120, FFN dimension 20480 — educated with a Warmup-Stable-Decay schedule (fixed LR by way of 80% of coaching, decay over the ultimate 20%), batch dimension 736, sequence size 8192. The FP8 reference baseline follows the DeepSeek-V3 methodology: E4M3 parts, 128×128 weight blocks, 1×128 activation and gradient blocks, with the primary block and final two blocks stored in BF16.

NVFP4 validation loss stays inside 1% of the FP8 baseline through the secure part and widens to barely above 1.5% throughout decay. Downstream accuracy is comparable throughout most benchmarks: MMLU 76.57% vs 77.36%, GSM8K CoT 92.27% vs 89.08%, MATH 81.48% vs 83.32%, AGIEval English CoT 70.31% vs 67.01%. Coding reveals the most important hole — HumanEval+ 57.43% vs 59.93%, MBPP+ 55.91% vs 59.11% — which the analysis staff attributes partly to noisy final-checkpoint analysis. The analysis staff additionally paperwork a precision-switching method: transitioning the ahead go from NVFP4 to BF16 beginning at 8.2T tokens (about 18% of the schedule) lowered relative loss error from 1.5% to 0.5%.

NVFP4 vs MXFP4

On a separate 8B hybrid Mamba-Transformer educated on 1T tokens, NVFP4 reached a relative loss error of about 1.5% versus BF16, whereas MXFP4 stayed close to 2.5%. To shut the hole, MXFP4 required 1.36T tokens to match the NVFP4 1T-token loss — a 36% token overhead. The analysis staff attributes the distinction to NVFP4’s smaller block dimension and E4M3 scales, which protect extra of the FP4 dynamic vary than MXFP4’s power-of-two UE8M0 scales (which may waste as much as one binade and the ±4, ±6 samples within the worst case).

Marktechpost’s Visual Explainer

● NVIDIA Technical Report

Pretraining Large Language Models with NVFP4

A 4-bit floating-point coaching recipe validated on a 12-billion-parameter hybrid Mamba-Transformer educated on 10 trillion tokens — the longest publicly documented 4-bit pretraining run to this point.

12B

Parameters

10T

Training Tokens

62.58%

MMLU-Pro (vs 62.62 FP8)

SOURCE — arXiv:2509.25149v2 · NVIDIA · Available in Transformer Engine

01 — Context

Why transfer from FP8 to 4-bit pretraining

FP8 coaching is now customary for frontier LLM pretraining. Moving to FP4 guarantees a 2× to three× enhance in arithmetic throughput over FP8 and roughly half the operand reminiscence — however narrower codecs compress dynamic vary and amplify quantization error at lengthy token horizons.

The problem is to protect coaching stability and downstream accuracy throughout multi-trillion-token runs. This report presents a recipe that does each, utilizing NVFP4, a 4-bit microscaling format with native assist on NVIDIA Blackwell Tensor Cores.

GB200 Throughput

BF16 baseline 1×
FP8 2×
FP4 (NVFP4) 4×

GB300 Throughput

BF16 baseline 1×
FP8 2×
FP4 (NVFP4) 6×

02 — The Format

What NVFP4 really shops

Each factor is encoded as E2M1 — 1 signal, 2 exponent, 1 mantissa bit — representing considered one of: ±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6.

Every block of 16 contiguous parts shares a single E4M3 scale issue. A second FP32 per-tensor scale sits on prime to maintain the E4M3 block scales in vary. The outcome: at least 6.25% of values in every block (the per-block amax) sit at near-FP8 precision.

FP8 scale

0.5

-2

-4

-1

-3

0.5

-1

E4M3 block scale
Block amax (mapped to FP4 max)
16 FP4 parts

03 — Format Comparison

How NVFP4 differs from MXFP4

NVFP4 makes three design modifications to the microscaling strategy that meaningfully enhance illustration constancy at 4 bits.

MXFP4

Block dimension 32
Element E2M1
Block scale UE8M0
Scale sort Power of two
Tensor scale None

NVFP4

Block dimension 16
Element E2M1
Block scale E4M3
Scale sort Fractional
Tensor scale FP32

MXFP4’s power-of-two UE8M0 scales can waste as much as one binade of dynamic vary and lose the ±4 and ±6 FP4 samples after scale rounding. NVFP4’s E4M3 scales map the block amax a lot nearer to the FP4 most.

04 — Scope

What runs in NVFP4 — and what doesn’t

Only the three GEMMs inside linear layers — Fprop, Dgrad, and Wgrad — really run in NVFP4. Everything else stays in greater precision.

In NVFP4

Linear Fprop GEMM
Linear Dgrad GEMM
Linear Wgrad GEMM

In BF16 / FP32

Embeddings · Output head
Normalization layers
Non-linearities
Attention (softmax, QK, score-V)
Master weights · Optimizer states
TP reductions (BF16)

The “FP4 coaching” label applies to essentially the most compute-heavy GEMMs, to not the complete ahead and backward graph.

05 — The Recipe

Four methods required for convergence

Quantizing each linear-layer GEMM to NVFP4 with default settings — 1×16 block scaling in every single place, round-to-nearest-even, no transforms — diverges early in coaching. The recipe stabilizes it with 4 elements. Ablations present every is critical.

Selective High Precision

Keep ~16% of linear layers in BF16, concentrated within the closing blocks. For the 12B mannequin: first 2 + closing 8 of 62 blocks.

Random Hadamard Transforms (RHT)

16×16 Hadamard matrix + random ±1 signal vector, utilized solely to Wgrad inputs. d=4 was worse; d=128 was just like d=16.

2D Block Scaling for Weights

16×16 block scales for weights so ahead and backward see the identical quantized illustration. Activations and gradients hold 1×16 scaling.

Stochastic Rounding on Gradients

Probabilistic rounding removes systematic gradient bias. Detrimental on forward-pass tensors — prohibit to gradients solely.

06 — Training Setup

The 12B hybrid Mamba-Transformer

The mannequin makes use of the Nemotron-Nano-12B-v2-Base structure: 62 blocks consisting of 6 Self-Attention, 28 FFN, and 28 Mamba-2 blocks.

Architecture

Blocks 62
Hidden dim 5120
FFN dim 20480
Q heads 40
KV heads 8
Mamba state dim 128

Training

Tokens 10T
Batch dimension 736
Sequence size 8192
Schedule WSD 80/20
Peak LR 4.5e-4
Weight decay 0.1

FP8 reference baseline follows DeepSeek-V3: E4M3 parts, 128×128 weight blocks, 1×128 activation/gradient blocks, with the primary block and final two in BF16.

07 — Downstream Results

NVFP4 matches FP8 throughout most benchmarks

Validation loss stays inside 1% of FP8 through the secure part, widening to barely above 1.5% throughout decay. Downstream accuracies tracked beneath.

Benchmark	FP8	NVFP4
MMLU-Pro 5-shot	62.62	62.58
MMLU	77.36	76.57
AGIEval English CoT	67.01	70.31
GSM8K CoT	89.08	92.27
MATH	83.32	81.48
MGSM	81.87	85.53
HumanEval+	59.93	57.43
MBPP+	59.11	55.91
ARC Challenge	91.81	91.81

Coding reveals the widest hole. Switching the ahead go to BF16 at 8.2T tokens (final 18%) reduces relative loss error from 1.5% to 0.5%.

08 — Format Efficiency

NVFP4 vs MXFP4 on the identical 8B mannequin

On an 8B hybrid Mamba-Transformer educated on the identical information, NVFP4 converged to a meaningfully higher loss than MXFP4 in the identical token price range.

Loss vs BF16 @ 1T tokens

NVFP4 ~1.5% hole
MXFP4 ~2.5% hole

Tokens to match NVFP4 loss

NVFP4 1.00T
MXFP4 1.36T (+36%)

The 36% token overhead interprets immediately into longer coaching time. Smaller block dimension and E4M3 scales protect extra of the FP4 dynamic vary than MXFP4’s UE8M0 design.

09 — Practitioner Takeaways

What this unlocks for AI engineers

4-bit pretraining at multi-trillion-token scale is now reproducible with a recognized recipe, on Blackwell {hardware}, through Transformer Engine.

✓

Throughput & reminiscence

FP4 GEMMs run 2× sooner than FP8 on GB200 and three× on GB300. Operand reminiscence roughly halved.

✓

Reproducible recipe

Selective BF16 layers + 16×16 RHT on Wgrad + 2D weight scaling + stochastic rounding on gradients.

→

Open questions

Quantizing all linear layers, extending NVFP4 to consideration and communication paths, scaling legal guidelines for FP4 throughout parameter counts and horizons.

⌘

Availability

NVFP4 coaching is supported in NVIDIA Transformer Engine. Source: arXiv:2509.25149v2.

MARKTECHPOST · AI analysis, deeply defined.

Key Takeaways

NVIDIA's analysis staff pretrained a 12B hybrid Mamba-Transformer on 10T tokens in NVFP4 — the longest publicly documented 4-bit coaching run — matching FP8 on MMLU-Pro at 62.58% vs 62.62%.
NVFP4 makes use of 16-element blocks with E4M3 scales plus an FP32 per-tensor scale, preserving the ±4 and ±6 samples that MXFP4's 32-element UE8M0 design can lose to power-of-two rounding.
Four methods are required for convergence — none are optionally available: ~16% of linear layers in BF16, 16×16 Random Hadamard Transforms on Wgrad inputs, 2D 16×16 weight scaling, and stochastic rounding on gradients solely.
Only linear-layer GEMMs run in NVFP4 — consideration, embeddings, normalization, non-linearities, grasp weights, gradients, and optimizer states all keep in BF16 or FP32.
On an 8B mannequin, MXFP4 wanted 1.36T tokens (36% extra) to match NVFP4's loss at 1T tokens, whereas FP4 GEMMs ship 2× FP8 throughput on GB200 and three× on GB300.

Check out the Paper here. Also, be at liberty to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The publish NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon appeared first on MarkTechPost.

NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon

What NVFP4 Actually is

What’s Quantized — and What Isn’t

The Four-Part Training Methodology

Results on the 12B Hybrid Mamba-Transformer

NVFP4 vs MXFP4

Marktechpost’s Visual Explainer

Pretraining Large Language Models with NVFP4

Why transfer from FP8 to 4-bit pretraining

What NVFP4 really shops

How NVFP4 differs from MXFP4

What runs in NVFP4 — and what doesn’t

Four methods required for convergence

The 12B hybrid Mamba-Transformer

NVFP4 matches FP8 throughout most benchmarks

NVFP4 vs MXFP4 on the identical 8B mannequin

What this unlocks for AI engineers

Key Takeaways

Google AI Releases DeepPolisher: A New Deep Learning Tool that Improves the Accuracy of Genome Assemblies by Precisely Correcting Base-Level Errors

DeepSeek Just Released a 3B OCR Model: A 3B VLM Designed for High-Performance OCR and Structured Document Conversion

A Coding Deep Dive into Differentiable Computer Vision with Kornia Using Geometry Optimization, LoFTR Matching, and GPU Augmentations

Genesis AI Releases Nyx, Quadrants, and Genesis World 1.0 Physics Platform for Scalable Robotics Foundation Model Evaluation

Zhipu AI Just Released GLM-4.5 Series: Redefining Open-Source Agentic AI with Hybrid Reasoning

A Coding Implementation to Build and Train Advanced Architectures with Residual Connections, Self-Attention, and Adaptive Optimization Using JAX, Flax, and Optax

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

What NVFP4 Actually is

What’s Quantized — and What Isn’t

The Four-Part Training Methodology

Results on the 12B Hybrid Mamba-Transformer

NVFP4 vs MXFP4

Marktechpost’s Visual Explainer

Pretraining Large Language Models with NVFP4

Why transfer from FP8 to 4-bit pretraining

What NVFP4 really shops

How NVFP4 differs from MXFP4

What runs in NVFP4 — and what doesn’t

Four methods required for convergence

The 12B hybrid Mamba-Transformer

NVFP4 matches FP8 throughout most benchmarks

NVFP4 vs MXFP4 on the identical 8B mannequin

What this unlocks for AI engineers

Key Takeaways

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!