NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B

Knowledge distillation (KD) transfers “darkish data” from a big instructor mannequin to a smaller pupil. The pupil learns from the instructor’s full output chance distribution over tokens, not simply appropriate solutions. This is finished through per-position Kullback–Leibler (KL) divergence over next-token chance distributions.

This formulation requires a shared tokenizer. A practitioner dedicated to Llama-3.2-1B can not leverage stronger lecturers with incompatible tokenizers — similar to Phi-4-mini or Qwen3-4B — as a result of token positions don’t correspond throughout vocabularies. This additionally prevents multi-teacher distillation throughout tokenizer households.

NVIDIA researchers launched X-Token, a logit-distribution-based methodology for cross-tokenizer KD (Knowledge distillation). It operates as a drop-in alternative for the usual KD loss, requiring no auxiliary trainable parts and no architectural adjustments.

The Problem X-Token is Solving

Two prior approaches dominate cross-tokenizer KD. ULD (Universal Logit Distillation) sidesteps vocabulary alignment by rank-sorting each distributions and minimizing L1 distance. It discards token identification totally. GOLD provides span alignment and a hybrid loss. It partitions tokens right into a 1-to-1 string-matched widespread subset, skilled with KL divergence, and an unusual the rest, skilled with ULD-style rank matching. GOLD is the present cutting-edge.

The analysis group identifies two structural failures in GOLD’s design:

Failure 1: Uncommon-token failure– When tokenizers fragment textual content otherwise, crucial tokens fall into the unequalled unusual subset. Llama-3 packs multi-digit numbers as single tokens — “201” is one token. Qwen3 splits them digit by digit: “2”, “0”, “1”. Under GOLD, all 1,100 of Llama’s two- and three-digit numerals (100 two-digit, 1,000 three-digit) fall into the unusual set when Qwen3-4B is the instructor. Those tokens obtain two varieties of dangerous sign: identity-agnostic noise from rank-based ULD matching, and suppressive gradients from the common-KL time period performing via the full-vocabulary softmax. The end result: GSM8k accuracy drops to 2.56 below GOLD with Qwen3-4B, in comparison with 12.89 for same-tokenizer KD from a weaker Llama-3.2-3B instructor.

Failure 2: Over-conservative matching– GOLD makes use of strict string equality to outline the widespread subset. A pupil token Hundreds corresponds to instructor tokens Hund adopted by reds below teacher-side re-tokenization, however strict matching discards this pair. Useful alignment sign is misplaced even when the correspondence is well-formed.

These two failures require reverse treatments: remove the partition when crucial tokens are misaligned, and loosen up it when alignment is structurally sound.

How X-Token Works

X-Token has three parts: span alignment, a projection matrix W, and two complementary loss formulations — P-KL and H-KL.

Span Alignment

Teacher and pupil tokenizers produce sequences of various lengths for a similar textual content. X-Token makes use of dynamic-programming (DP) span alignment, grouping tokens into chunks the place every chunk-pair decodes to the identical underlying textual content substring. A sequence-rule merge then combines per-token possibilities inside every chunk right into a single chunk-level distribution to be used within the distillation loss. The alignment is cached per sequence and provides no per-step coaching overhead.

The analysis group additionally identifies a failure in TRL’s surface-substring alignment, which is utilized in TRL’s GOLD coach. TRL accumulates per-side decoded buffers and flushes solely when each buffers match as equal uncooked strings. A byte-level disagreement — similar to Llama-3 auto-prepending <bos> whereas Qwen-3 doesn’t — prevents future flushes and forces all remaining tokens into one mis-grouped super-group at finish of sequence. The DP method handles this with a single hole transfer, no matter sequence size.

The Projection Matrix W

After alignment, instructor and pupil distributions nonetheless function over completely different vocabularies. The projection matrix W ∈ ℝ^V_S|×|V_T| maps every pupil token to a weighted mixture of instructor tokens, bridging the vocabulary mismatch.

W is constructed deterministically in two passes:

Pass 1 (exact-match): For each (pupil token, instructor token) pair whose decoded strings match after canonicalization, set W[s, t] = 1. Canonicalization unifies house prefixes (Ġ, _, ␣), newlines, byte-fallback tokens of the shape <0xHH>, and model-specific particular tokens throughout tokenizer households.

Pass 2 (multi-token rule): For every pupil token with out a precise match, re-tokenize its decoded textual content below the instructor tokenizer. If the ensuing sequence has size ≤ 4, assign exponentially-decayed weights: W[s, τᵢ] = β·γⁱ with (β, γ) = (0.9, 0.1). A length-2 span receives normalized weights (0.909, 0.091). A length-3 span receives (0.9009, 0.0901, 0.0090). A length-4 span receives (0.9000, 0.0900, 0.0090, 0.0009). The main sub-token receives the very best weight as a result of it usually carries essentially the most informative chance mass — for instance, “_inter” in [“_inter”, “national”] or “_20” in [“_20”, “24”].

Each row is truncated to its top-4 entries and row-normalized. Because every row of W is non-negative and sums to 1, left-multiplication by W⊤ is probability-preserving: if p_S is a chance vector, W^⊤p_S can be a legitimate chance vector over V_T. W is constructed as soon as earlier than coaching and might optionally be collectively refined with the scholar below P-KL.

P-KL: Addressing Erroneous and Suppressive Gradients

P-KL removes the partition totally. It tasks the scholar distribution p̂_S^(ok) into instructor vocabulary house through W:

$tilde{p}_S^{(ok)}[t] = sum_{sinmathcal{V}_S} W[s, t] cdot hat{p}_S^{(ok)}[s]$

Then it computes KL divergence immediately between instructor and projected pupil:

$frac{partialmathcal{L}_{widespread}}{partial z_{j}} = p_S[j] cdot M_{mathcal{C}}(T)$

There isn’t any unusual set, so rank-based ULD noise is eradicated. The suppressive gradient downside can be eradicated: the projection routes the scholar’s chance mass for “201” immediately onto {2, 0, 1} within the instructor vocabulary through W.

The analysis group formally proves (Proposition 1) that GOLD’s common-KL time period induces non-negative gradients on each unusual pupil logit. The gradient on an unusual pupil logit j is: ∂ℒ_widespread/∂z_j = p_S[j] · M_C(T), the place M_C(T), is the instructor chance mass on the widespread subset. Under gradient descent, this all the time drives z_j downward — suppressing each unusual token’s chance whatever the ground-truth token.

H-KL: Relaxing the 1-to-1 Matching

H-KL applies when the partition is structurally sound — that’s, when crucial tokens land within the widespread subset. In that case, GOLD’s direct KL on identity-aligned pairs delivers sharper per-pair supervision than P-KL’s projection, which blends pupil chance mass throughout a number of instructor tokens. The alternative is to make the partition much less wasteful by stress-free the strict string-equality criterion.

H-KL retains GOLD’s hybrid loss construction however expands the widespread set C utilizing W. For every pupil token s, it selects the top-ranked instructor token t* = argmax_{t’∈V_T} W[s, t’], and provides (s, t*) to C. Exact matches are preserved since they obtain weight 1 in W, the very best attainable. Near-equivalent pairs like (Hundreds, Hund) — excluded by GOLD — are actually admitted. The expanded C feeds the identical hybrid loss: direct KL on widespread pairs, ULD on the rest.

Selecting Between P-KL and H-KL

The choice makes use of a protection audit over token classes within the pupil vocabulary. For math duties, multi-digit numerals are the crucial class. Table 8 within the analysis paper reveals: below Qwen3-4B, 0 out of 100 two-digit Llama numerals and 0 out of 1,000 three-digit Llama numerals seem in C. Under Phi-4-mini-Instruct, all 100 two-digit and all 1,000 three-digit numerals seem in C. ASCII punctuation and single-digit numerals are totally coated in each circumstances.

The rule: use P-KL when crucial tokens fall outdoors C (Qwen3-4B), and H-KL when the partition is sound (Phi-4-mini-Instruct). Table 2 within the analysis paper reveals the mode reversal is sharp: P-KL outperforms H-KL by +3.55 avg. on Qwen3-4B, whereas H-KL outperforms P-KL by +1.68 avg. on Phi-4-mini.

Multi-Teacher Distillation

X-Token extends to a number of lecturers. Each instructor has its personal projection matrix W_m and loss choice. For same-tokenizer lecturers, normal token-level KL is used. The multi-teacher loss aggregates per-teacher losses with weights α_m:

$mathcal{L}_{KD,multi} = sum_{m=1}^{M}alpha_{m}frac{1}{|mathcal{Okay}_{m}|}sum_{kinmathcal{Okay}_{m}}mathcal{L}_{*,m}^{(ok)}$

The analysis group evaluates static and confidence-adaptive weighting schemes. Confidence-adaptive variants compute α_m from cross-entropy, Shannon entropy, or most predicted chance of the instructor’s distribution. Static weighting outperforms adaptive schemes in each multi-teacher setups evaluated.

Dynamic KD/CE Scaling

Training combines the distillation loss ℒ_KD with next-token cross-entropy ℒ_CE. Because these phrases differ in magnitude and shift throughout coaching, X-Token rescales the KD time period at every step to match the size of ℒ_CE:

$mathcal{L} = textual content{sg}(mathcal{L}_{CE} / mathcal{L}_{KD}) cdot mathcal{L}_{KD} + mathcal{L}_{CE}$

the place sg(·) is stop-gradient. Table 4 within the paper reveals dynamic scaling outperforms three fixed-weight settings (KD-heavy, balanced, CE-heavy) on the Qwen3-4B (P-KL) pair.

Experiments and Results

Student: Llama-3.2-1B. Teachers: Llama-3.2-3B (similar tokenizer), Qwen3-4B, and Phi-4-mini-Instruct. Training knowledge: NemotronClimbMix dataset, 30,000 steps, batch measurement 768, context size 4096. Optimizer: AdamW, studying price 5×10⁻⁵, 5% warmup with cosine decay, weight decay 0.1, gradient clipping 1.0. Each experiment is possible on a single NVIDIA H100 GPU; the analysis group used 128 H100s to hurry up iteration.

Evaluation: 3-shot accuracy on MMLU, GSM8k, MATH-Hendrycks, Winogrande, and HellaSwag.

Key outcomes:

Setting	Method	Avg.
No distillation	Llama-1B (base)	33.96
No distillation	Continued pre-training	36.63
Same tokenizer	Llama-3B → 1B (KL)	38.40
Cross-tokenizer	Qwen-4B, ULD	36.77
Cross-tokenizer	Qwen-4B, GOLD	35.03
Cross-tokenizer	Qwen-4B, X-Token (P-KL)	38.85
Cross-tokenizer	Phi-mini, ULD	38.31
Cross-tokenizer	Phi-mini, GOLD	38.66
Cross-tokenizer	Phi-mini, X-Token (H-KL)	39.18
Multi-teacher	Phi-mini + Llama-3B (X-Token)	40.48

On Qwen-4B (P-KL regime): GOLD reaches 35.03 avg., beneath even continued pre-training with out a instructor (36.63). This confirms the partition is actively dangerous when crucial tokens are misaligned. Pure ULD (36.77) already improves over GOLD, indicating the partition is the first failure supply. P-KL additional improves to 38.85 avg. (+3.82 over GOLD). GSM8k alone strikes from 2.56 to fifteen.54, surpassing same-tokenizer KD from Llama-3.2-3B (12.89) on that benchmark.

On Phi-mini (H-KL regime): GOLD reaches 38.66 avg. — an inexpensive baseline the place the partition is structurally sound. H-KL improves to 39.18 avg. (+0.52 over GOLD). P-KL utilized to Phi-mini drops to 37.50 avg., confirming that the fallacious loss mode hurts even when W is on the market.

Multi-teacher: Phi-mini (H-KL, α=0.8) + Llama-3B (normal KL, α=0.2) below static weighting reaches 40.48 avg. This is +2.08 over same-family KD from Llama-3B alone, and +1.30 over one of the best single cross-tokenizer end result (39.18). Combining Phi-mini + Qwen-4B — two lecturers with overlapping reasoning strengths — scores solely 38.49, beneath one of the best single instructor. Adding Qwen-4B as a 3rd instructor yields 40.15, with math/reasoning degrading (GSM8k 20.39 → 19.18) whereas commonsense improves barely. Teacher complementarity, not instructor rely, drives beneficial properties.

Strengths and What to Watch

Strengths:

The suppressive gradient downside in GOLD’s hybrid loss is formally proved (Proposition 1), not simply noticed empirically
W is constructed rule-based from tokenizer strings alone; no coaching knowledge or realized parameters wanted at initialization
Dynamic KD/CE scaling removes the necessity to tune mounted loss weights; it outperforms three fixed-weight baselines in ablations
Multi-teacher extension provides no architectural adjustments; every instructor makes use of its personal W_m and acceptable loss
The protection audit for P-KL vs H-KL choice is an outlined, reproducible criterion primarily based on per-category token retention in C

What to Watch:

Experiments use solely Llama-3.2-1B as the scholar below continued pre-training; bigger college students and instruction-tuned settings should not evaluated
Only three instructor pairs are examined; low-overlap tokenizer households (SentencePiece, byte-level BPE) are left for future work
Static weighting outperforms confidence-adaptive weighting in all examined multi-teacher setups, however why?
The multi-token rule in Pass 2 skips pupil tokens whose decoded textual content re-tokenizes to sequences longer than 4 below the instructor; these rows stay zero in W

Marktechpost’s Visual Explainer

■ X-Token — NVIDIA Research
1 / 8

01 — Background

What is Knowledge Distillation?

Knowledge distillation (KD) transfers “darkish data” from a big instructor mannequin to a smaller pupil mannequin. The pupil learns from the instructor’s full next-token chance distribution, not simply the right reply.

This is finished through per-position KL divergence over the instructor’s output distribution at each token place within the sequence.

The constraint: normal KD requires a shared tokenizer. If Llama-3.2-1B is the scholar, it can not be taught from Qwen3-4B or Phi-4-mini — their token vocabularies don’t align. Token positions haven’t any correspondence throughout completely different tokenizer households.

Llama
Student tokenizer

Qwen / Phi
Incompatible lecturers

≠ Match
Vocab mismatch

02 — The Problem

Two Structural Failures in GOLD

GOLD is the prior state-of-the-art cross-tokenizer KD methodology. It partitions tokens right into a string-matched widespread subset (skilled with KL) and an unusual the rest (skilled with ULD rank-matching).

NVIDIA researchers recognized two distinct failures:

Uncommon-token failure: Critical tokens fall into the unequalled subset. Llama packs “201” as one token. Qwen splits it into “2”, “0”, “1”. All 1,100 multi-digit Llama numerals fall into the unusual set below Qwen3-4B. They obtain identity-agnostic noise and suppressive gradients — GSM8k drops to 2.56.

Over-conservative matching: Strict string equality discards well-formed pairs. Student token Hundreds maps to instructor tokens Hund + reds, however GOLD drops this alignment totally.

03 — Solution

X-Token: Three Core Components

X-Token is a logit-distribution-based cross-tokenizer KD methodology. It requires no auxiliary trainable parts and no architectural adjustments — it’s a drop-in alternative for the usual KD loss.

Span Alignment: DP-based alignment teams tokens into chunks that decode to the identical textual content substring. Cached per sequence — zero per-step overhead.

Projection Matrix W: A sparse matrix W ∈ ℝ⁼|V_S|×|V_T|⁽ maps every pupil token to a weighted mixture of instructor tokens, bridging the vocabulary hole.

Two Loss Modes: P-KL removes the partition totally. H-KL retains the partition however relaxes matching through top-1 mappings below W. Each targets a distinct failure mode.

04 — Projection Matrix W

How W is Constructed

W is constructed deterministically earlier than coaching in two passes. No coaching knowledge or realized parameters are required at initialization.

Exact-match go: For each (pupil, instructor) token pair whose decoded strings match after canonicalization, set W[s,t] = 1. Canonicalization unifies house prefixes, newlines, byte-fallback tokens, and particular tokens throughout households.

Multi-token rule go: For unmatched pupil tokens, re-tokenize their decoded textual content below the instructor. Assign decayed weights W[s,τᵢ] = β·γⁱ with (β,γ) = (0.9, 0.1). A 2-token span will get (0.909, 0.091). Each row is truncated to top-4 entries and row-normalized.

Because every row sums to 1, Wᵀ is probability-preserving: Wᵀp_S is a legitimate chance vector over V_T with out extra normalization.

05 — Loss Formulations

P-KL vs H-KL: When to Use Each

Selection is predicated on a protection audit: measure what fraction of crucial token classes (e.g. multi-digit numerals) seem within the widespread set C.

Property	P-KL	H-KL
Partition	Removed totally	Retained, relaxed
Matching	Full vocab through W	Top-1 below W
Use when	Critical tokens fall outdoors C	Partition is sound
Teacher instance	Qwen3-4B	Phi-4-mini-Instruct
Avg. achieve vs GOLD	+3.82	+0.52

Applying the fallacious mode reverses outcomes: P-KL on Phi-mini drops to 37.50 avg. vs H-KL’s 39.18.

06 — Results

Benchmark Results on Llama-3.2-1B (3-shot)

Student: Llama-3.2-1B — skilled on NemotronClimbMix, 30K steps, batch 768, context 4096.

Method	GSM8k	Avg.
Llama-1B (base)	5.69	33.96
Continued pre-training	10.25	36.63
Same-tokenizer KD (Llama-3B)	12.89	38.40
Qwen-4B, GOLD	2.56	35.03
Qwen-4B, X-Token (P-KL)	15.54	38.85
Phi-mini, GOLD	16.50	38.66
Phi-mini, X-Token (H-KL)	19.11	39.18
Phi-mini + Llama-3B (Multi)	20.39	40.48

07 — Multi-Teacher Distillation

Teacher Complementarity Drives Gains

X-Token extends to a number of lecturers. Each will get its personal projection matrix W_m and loss mode. The aggregated loss makes use of per-teacher weights α_m.

Key discovering: static weighting outperforms confidence-adaptive weighting in all examined setups. Phi-mini (α=0.8) + Llama-3B (α=0.2) achieves one of the best end result.

Teacher Combination	Avg.	Note
Phi-mini solely (H-KL)	39.18	Best single
Phi-mini + Llama-3B	40.48	Complementary
Phi-mini + Qwen-4B	38.49	Overlapping
Phi-mini + Qwen-4B + Llama-3B	40.15	third instructor hurts math

Combining two reasoning-heavy lecturers (Phi-mini + Qwen-4B) scores beneath one of the best single instructor. Teacher range issues greater than instructor rely.

08 — Key Takeaways

What to Remember About X-Token

GOLD’s partition actively harms coaching when crucial tokens (e.g., multi-digit numerals) fall into the unusual set — P-KL eliminates the partition totally utilizing projection matrix W.

H-KL retains the partition however relaxes matching to top-1 mappings below W — finest when the partition is structurally sound.

The projection matrix W is constructed rule-based earlier than coaching from tokenizer strings alone; no realized parameters required at init.

Multi-teacher beneficial properties (+1.3 over single-teacher) come from instructor complementarity, not from including extra lecturers with overlapping strengths.

GSM8k recovers from 2.56 (GOLD) to 15.54 (P-KL) — a 6× achieve that exceeds same-tokenizer KD from a stronger Llama-3.2-3B instructor.

arXiv: 2605.21699 — Institution: NVIDIA

Key Takeaways

X-Token identifies two distinct, reverse failure modes in GOLD: uncommon-token suppression (repair: take away the partition with P-KL) and over-conservative matching (repair: loosen up it with H-KL).
The projection matrix W is constructed rule-based from tokenizer strings earlier than coaching; it may possibly optionally be collectively refined with the scholar for extra beneficial properties.
P-KL on Qwen3-4B improves over GOLD by +3.82 avg. and recovers GSM8k from 2.56 to fifteen.54.
Multi-teacher distillation beneficial properties (+1.3 over single-teacher) come from instructor complementarity, not simply from including extra lecturers.
Loss mode choice (P-KL vs H-KL) is set by a protection audit on token classes; making use of the fallacious mode reverses the rating.

Check out the Research Paper. Also, be at liberty to observe us on Twitter and don’t neglect to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The put up NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B appeared first on MarkTechPost.

NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B

The Problem X-Token is Solving