NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B
Knowledge distillation (KD) transfers “darkish data” from a big instructor mannequin to a smaller pupil. The pupil learns from the instructor’s full output chance distribution over tokens, not simply appropriate solutions. This is finished through per-position Kullback–Leibler (KL) divergence over next-token chance distributions.
This formulation requires a shared tokenizer. A practitioner dedicated to Llama-3.2-1B can not leverage stronger lecturers with incompatible tokenizers — similar to Phi-4-mini or Qwen3-4B — as a result of token positions don’t correspond throughout vocabularies. This additionally prevents multi-teacher distillation throughout tokenizer households.
NVIDIA researchers launched X-Token, a logit-distribution-based methodology for cross-tokenizer KD (Knowledge distillation). It operates as a drop-in alternative for the usual KD loss, requiring no auxiliary trainable parts and no architectural adjustments.
The Problem X-Token is Solving
Two prior approaches dominate cross-tokenizer KD. ULD (Universal Logit Distillation) sidesteps vocabulary alignment by rank-sorting each distributions and minimizing L1 distance. It discards token identification totally. GOLD provides span alignment and a hybrid loss. It partitions tokens right into a 1-to-1 string-matched widespread subset, skilled with KL divergence, and an unusual the rest, skilled with ULD-style rank matching. GOLD is the present cutting-edge.
The analysis group identifies two structural failures in GOLD’s design:
Failure 1: Uncommon-token failure– When tokenizers fragment textual content otherwise, crucial tokens fall into the unequalled unusual subset. Llama-3 packs multi-digit numbers as single tokens — “201” is one token. Qwen3 splits them digit by digit: “2”, “0”, “1”. Under GOLD, all 1,100 of Llama’s two- and three-digit numerals (100 two-digit, 1,000 three-digit) fall into the unusual set when Qwen3-4B is the instructor. Those tokens obtain two varieties of dangerous sign: identity-agnostic noise from rank-based ULD matching, and suppressive gradients from the common-KL time period performing via the full-vocabulary softmax. The end result: GSM8k accuracy drops to 2.56 below GOLD with Qwen3-4B, in comparison with 12.89 for same-tokenizer KD from a weaker Llama-3.2-3B instructor.
Failure 2: Over-conservative matching– GOLD makes use of strict string equality to outline the widespread subset. A pupil token Hundreds corresponds to instructor tokens Hund adopted by reds below teacher-side re-tokenization, however strict matching discards this pair. Useful alignment sign is misplaced even when the correspondence is well-formed.
These two failures require reverse treatments: remove the partition when crucial tokens are misaligned, and loosen up it when alignment is structurally sound.
How X-Token Works
X-Token has three parts: span alignment, a projection matrix W, and two complementary loss formulations — P-KL and H-KL.
Span Alignment
Teacher and pupil tokenizers produce sequences of various lengths for a similar textual content. X-Token makes use of dynamic-programming (DP) span alignment, grouping tokens into chunks the place every chunk-pair decodes to the identical underlying textual content substring. A sequence-rule merge then combines per-token possibilities inside every chunk right into a single chunk-level distribution to be used within the distillation loss. The alignment is cached per sequence and provides no per-step coaching overhead.
The analysis group additionally identifies a failure in TRL’s surface-substring alignment, which is utilized in TRL’s GOLD coach. TRL accumulates per-side decoded buffers and flushes solely when each buffers match as equal uncooked strings. A byte-level disagreement — similar to Llama-3 auto-prepending <bos> whereas Qwen-3 doesn’t — prevents future flushes and forces all remaining tokens into one mis-grouped super-group at finish of sequence. The DP method handles this with a single hole transfer, no matter sequence size.
The Projection Matrix W
After alignment, instructor and pupil distributions nonetheless function over completely different vocabularies. The projection matrix W ∈ ℝVS|×|VT| maps every pupil token to a weighted mixture of instructor tokens, bridging the vocabulary mismatch.
W is constructed deterministically in two passes:
Pass 1 (exact-match): For each (pupil token, instructor token) pair whose decoded strings match after canonicalization, set W[s, t] = 1. Canonicalization unifies house prefixes (Ġ, _, ␣), newlines, byte-fallback tokens of the shape <0xHH>, and model-specific particular tokens throughout tokenizer households.
Pass 2 (multi-token rule): For every pupil token with out a precise match, re-tokenize its decoded textual content below the instructor tokenizer. If the ensuing sequence has size ≤ 4, assign exponentially-decayed weights: W[s, τᵢ] = β·γⁱ with (β, γ) = (0.9, 0.1). A length-2 span receives normalized weights (0.909, 0.091). A length-3 span receives (0.9009, 0.0901, 0.0090). A length-4 span receives (0.9000, 0.0900, 0.0090, 0.0009). The main sub-token receives the very best weight as a result of it usually carries essentially the most informative chance mass — for instance, “_inter” in [“_inter”, “national”] or “_20” in [“_20”, “24”].
Each row is truncated to its top-4 entries and row-normalized. Because every row of W is non-negative and sums to 1, left-multiplication by W⊤ is probability-preserving: if pS is a chance vector, W⊤pS can be a legitimate chance vector over VT. W is constructed as soon as earlier than coaching and might optionally be collectively refined with the scholar below P-KL.
P-KL: Addressing Erroneous and Suppressive Gradients
P-KL removes the partition totally. It tasks the scholar distribution p̂S(ok) into instructor vocabulary house through W:
Then it computes KL divergence immediately between instructor and projected pupil:
There isn’t any unusual set, so rank-based ULD noise is eradicated. The suppressive gradient downside can be eradicated: the projection routes the scholar’s chance mass for “201” immediately onto {2, 0, 1} within the instructor vocabulary through W.
The analysis group formally proves (Proposition 1) that GOLD’s common-KL time period induces non-negative gradients on each unusual pupil logit. The gradient on an unusual pupil logit j is: ∂ℒwidespread/∂zj = pS[j] · MC(T), the place MC(T), is the instructor chance mass on the widespread subset. Under gradient descent, this all the time drives zj downward — suppressing each unusual token’s chance whatever the ground-truth token.
H-KL: Relaxing the 1-to-1 Matching
H-KL applies when the partition is structurally sound — that’s, when crucial tokens land within the widespread subset. In that case, GOLD’s direct KL on identity-aligned pairs delivers sharper per-pair supervision than P-KL’s projection, which blends pupil chance mass throughout a number of instructor tokens. The alternative is to make the partition much less wasteful by stress-free the strict string-equality criterion.
H-KL retains GOLD’s hybrid loss construction however expands the widespread set C utilizing W. For every pupil token s, it selects the top-ranked instructor token t* = argmax_{t’∈V_T} W[s, t’], and provides (s, t*) to C. Exact matches are preserved since they obtain weight 1 in W, the very best attainable. Near-equivalent pairs like (Hundreds, Hund) — excluded by GOLD — are actually admitted. The expanded C feeds the identical hybrid loss: direct KL on widespread pairs, ULD on the rest.
Selecting Between P-KL and H-KL
The choice makes use of a protection audit over token classes within the pupil vocabulary. For math duties, multi-digit numerals are the crucial class. Table 8 within the analysis paper reveals: below Qwen3-4B, 0 out of 100 two-digit Llama numerals and 0 out of 1,000 three-digit Llama numerals seem in C. Under Phi-4-mini-Instruct, all 100 two-digit and all 1,000 three-digit numerals seem in C. ASCII punctuation and single-digit numerals are totally coated in each circumstances.

The rule: use P-KL when crucial tokens fall outdoors C (Qwen3-4B), and H-KL when the partition is sound (Phi-4-mini-Instruct). Table 2 within the analysis paper reveals the mode reversal is sharp: P-KL outperforms H-KL by +3.55 avg. on Qwen3-4B, whereas H-KL outperforms P-KL by +1.68 avg. on Phi-4-mini.

Multi-Teacher Distillation
X-Token extends to a number of lecturers. Each instructor has its personal projection matrix W_m and loss choice. For same-tokenizer lecturers, normal token-level KL is used. The multi-teacher loss aggregates per-teacher losses with weights αm:
The analysis group evaluates static and confidence-adaptive weighting schemes. Confidence-adaptive variants compute α_m from cross-entropy, Shannon entropy, or most predicted chance of the instructor’s distribution. Static weighting outperforms adaptive schemes in each multi-teacher setups evaluated.

Dynamic KD/CE Scaling
Training combines the distillation loss ℒKD with next-token cross-entropy ℒCE. Because these phrases differ in magnitude and shift throughout coaching, X-Token rescales the KD time period at every step to match the size of ℒCE:
the place sg(·) is stop-gradient. Table 4 within the paper reveals dynamic scaling outperforms three fixed-weight settings (KD-heavy, balanced, CE-heavy) on the Qwen3-4B (P-KL) pair.

Experiments and Results
Student: Llama-3.2-1B. Teachers: Llama-3.2-3B (similar tokenizer), Qwen3-4B, and Phi-4-mini-Instruct. Training knowledge: NemotronClimbMix dataset, 30,000 steps, batch measurement 768, context size 4096. Optimizer: AdamW, studying price 5×10⁻⁵, 5% warmup with cosine decay, weight decay 0.1, gradient clipping 1.0. Each experiment is possible on a single NVIDIA H100 GPU; the analysis group used 128 H100s to hurry up iteration.
Evaluation: 3-shot accuracy on MMLU, GSM8k, MATH-Hendrycks, Winogrande, and HellaSwag.
Key outcomes:
| Setting | Method | Avg. |
|---|---|---|
| No distillation | Llama-1B (base) | 33.96 |
| No distillation | Continued pre-training | 36.63 |
| Same tokenizer | Llama-3B → 1B (KL) | 38.40 |
| Cross-tokenizer | Qwen-4B, ULD | 36.77 |
| Cross-tokenizer | Qwen-4B, GOLD | 35.03 |
| Cross-tokenizer | Qwen-4B, X-Token (P-KL) | 38.85 |
| Cross-tokenizer | Phi-mini, ULD | 38.31 |
| Cross-tokenizer | Phi-mini, GOLD | 38.66 |
| Cross-tokenizer | Phi-mini, X-Token (H-KL) | 39.18 |
| Multi-teacher | Phi-mini + Llama-3B (X-Token) | 40.48 |
On Qwen-4B (P-KL regime): GOLD reaches 35.03 avg., beneath even continued pre-training with out a instructor (36.63). This confirms the partition is actively dangerous when crucial tokens are misaligned. Pure ULD (36.77) already improves over GOLD, indicating the partition is the first failure supply. P-KL additional improves to 38.85 avg. (+3.82 over GOLD). GSM8k alone strikes from 2.56 to fifteen.54, surpassing same-tokenizer KD from Llama-3.2-3B (12.89) on that benchmark.
On Phi-mini (H-KL regime): GOLD reaches 38.66 avg. — an inexpensive baseline the place the partition is structurally sound. H-KL improves to 39.18 avg. (+0.52 over GOLD). P-KL utilized to Phi-mini drops to 37.50 avg., confirming that the fallacious loss mode hurts even when W is on the market.
Multi-teacher: Phi-mini (H-KL, α=0.8) + Llama-3B (normal KL, α=0.2) below static weighting reaches 40.48 avg. This is +2.08 over same-family KD from Llama-3B alone, and +1.30 over one of the best single cross-tokenizer end result (39.18). Combining Phi-mini + Qwen-4B — two lecturers with overlapping reasoning strengths — scores solely 38.49, beneath one of the best single instructor. Adding Qwen-4B as a 3rd instructor yields 40.15, with math/reasoning degrading (GSM8k 20.39 → 19.18) whereas commonsense improves barely. Teacher complementarity, not instructor rely, drives beneficial properties.
Strengths and What to Watch
Strengths:
- The suppressive gradient downside in GOLD’s hybrid loss is formally proved (Proposition 1), not simply noticed empirically
- W is constructed rule-based from tokenizer strings alone; no coaching knowledge or realized parameters wanted at initialization
- Dynamic KD/CE scaling removes the necessity to tune mounted loss weights; it outperforms three fixed-weight baselines in ablations
- Multi-teacher extension provides no architectural adjustments; every instructor makes use of its personal W_m and acceptable loss
- The protection audit for P-KL vs H-KL choice is an outlined, reproducible criterion primarily based on per-category token retention in C
What to Watch:
- Experiments use solely Llama-3.2-1B as the scholar below continued pre-training; bigger college students and instruction-tuned settings should not evaluated
- Only three instructor pairs are examined; low-overlap tokenizer households (SentencePiece, byte-level BPE) are left for future work
- Static weighting outperforms confidence-adaptive weighting in all examined multi-teacher setups, however why?
- The multi-token rule in Pass 2 skips pupil tokens whose decoded textual content re-tokenizes to sequences longer than 4 below the instructor; these rows stay zero in W
Marktechpost’s Visual Explainer
1 / 8
Key Takeaways
- X-Token identifies two distinct, reverse failure modes in GOLD: uncommon-token suppression (repair: take away the partition with P-KL) and over-conservative matching (repair: loosen up it with H-KL).
- The projection matrix W is constructed rule-based from tokenizer strings earlier than coaching; it may possibly optionally be collectively refined with the scholar for extra beneficial properties.
- P-KL on Qwen3-4B improves over GOLD by +3.82 avg. and recovers GSM8k from 2.56 to fifteen.54.
- Multi-teacher distillation beneficial properties (+1.3 over single-teacher) come from instructor complementarity, not simply from including extra lecturers.
- Loss mode choice (P-KL vs H-KL) is set by a protection audit on token classes; making use of the fallacious mode reverses the rating.
Check out the Research Paper. Also, be at liberty to observe us on Twitter and don’t neglect to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us
The put up NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B appeared first on MarkTechPost.
