Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models
Pre-training giant language fashions is pricey sufficient that even modest effectivity enhancements can translate into significant price and time financial savings. Nous Research is releasing Token Superposition Training (TST), a technique that considerably reduces pre-training wall-clock time at fastened compute with out touching the mannequin structure, optimizer, tokenizer, parallelism technique, or coaching information.
At the 10B-A1B mixture-of-experts scale, TST reaches a decrease last coaching loss than a matched-FLOPs baseline whereas consuming 4,768 B200-GPU-hours versus the baseline’s 12,311 — roughly a 2.5x discount in whole pre-training time.

The Problem TST is Solving
Modern LLM pre-training is closely data-driven. Recent coaching regimes routinely overtrain effectively past compute-optimal estimates, and uncooked textual content throughput. How a lot information a mannequin can course of per FLOP has grow to be a key lever. Subword tokenizers like BPE already enhance throughput by compressing sequences; and the analysis suggests a lot of the BPE benefit over byte-level fashions comes merely from shorter sequences, which implies the mannequin sees extra textual content per unit of compute.
TST asks whether or not that throughput lever might be pulled additional throughout coaching, independently of the tokenizer and with out completely altering the mannequin.
How TST Works: Two Phases
TST modifies the usual pre-training loop in two sequential phases:
Phase 1 — Superposition: For the primary r fraction of whole coaching steps (the paper finds r ∈ [0.2, 0.4] to be shut to optimum throughout examined scales), the mannequin doesn’t obtain particular person tokens. Instead, the enter sequence of size L is segmented into non-overlapping baggage of s contiguous tokens. In the embedding layer, every bag is collapsed right into a single latent “s-token” by averaging the s token embeddings. The transformer then processes a sequence of size L/s.
Crucially, every TST step is stored equal-FLOPs to a typical coaching step by growing the information sequence size by s occasions throughout the superposition section. Because every latent place corresponds to s supply tokens, the mannequin ingests s occasions as a lot textual content per unit of compute — that is what drives the throughput acquire.
On the output aspect, every latent place predicts the following bag of s tokens somewhat than a single subsequent token. The commonplace cross-entropy loss is changed with a multi-hot cross-entropy (MCE) loss, which assigns equal chance mass 1/s to every token within the goal bag. The MCE loss reduces to a easy imply of normal cross-entropy phrases over the s targets — it may be carried out utilizing the present fused CE kernels already current in any main pre-training library, with out writing a brand new kernel or including an auxiliary head.
Phase 2 — Recovery: After the superposition section, coaching resumes from the saved checkpoint with commonplace next-token prediction for the remaining 1 - r steps. The TST code is absolutely eliminated at this boundary to keep away from any experimental contamination. A transient loss spike happens on the transition, usually between 1 and a couple of nats, which resolves inside just a few thousand steps. After that, the recovered mannequin crosses under the equal-FLOPs baseline and stays there.
The mannequin produced on the finish of Phase 2 is architecturally equivalent to one produced by typical pre-training, with the identical next-token prediction inference habits.
What the Experiments Show
TST was validated at 4 scales: 270M and 600M dense (SmolLM2 shapes tailored to the Llama3 modeling code, with the Llama3-8B tokenizer and untied enter/output embeddings — which makes the 270M mannequin equal in measurement to SmolLM2-135M and the 600M to SmolLM2-360M), 3B dense (SmolLM3 form), and a 10B-A1B MoE within the Qwen3 household. Training used the DCLM dataset for the smaller runs and a 50/50 mixture of DCLM and FineWeb-Edu for the MoE run. All runs used AdamW with the Warmup-Stable-Decay studying fee schedule and had been run in TorchTitan beneath FSDP parallelism, on 64 NVIDIA B200 GPUs for the bigger fashions and eight B200 GPUs for the smaller ones.
At the 3B scale with bag measurement s = 6 and step ratio r = 0.3, TST at 20,000 steps reaches a last lack of 2.676 — almost matching a 36,000-step baseline at 2.677 — whereas utilizing 247 B200-GPU-hours versus 443. The 20k-step TST run scores 62.4 on HellaSwag and 66.3 on ARC-Easy, versus 62.3 and 65.9 for the 36k baseline.
At the 10B-A1B MoE scale with s = 16 and r ≈ 0.25, the TST run processes 2T information tokens and achieves a last lack of 2.236, under the baseline’s 2.252 after 1.05T tokens, whereas beating it on all 4 reported benchmarks: HellaSwag (71.2 vs. 70.1), ARC-Easy (74.2 vs. 73.8), ARC-Challenge (47.3 vs. 46.3), and MMLU (39.0 vs. 37.4).
The analysis group presents three comparability views in opposition to the baseline — equal-FLOPs, equal-loss, and equal-data. Under equal-FLOPs and equal-loss situations, TST persistently wins. Under equal whole token consumption, the baseline wins, as a result of TST’s efficient compute funds per information token is smaller. This is a crucial boundary situation that determines the place TST applies.
Two Distinct Mechanisms
An ablation examine isolates the input-side and output-side parts. Both independently outperform the baseline; combining them produces additional enchancment with out indicators of interference. The authors interpret this as proof that TST is 2 orthogonal mechanisms somewhat than a single trick.
The output-side mechanism — next-bag-of-tokens prediction — is conceptually associated to multi-token prediction (MTP). Unlike MTP, which provides okay impartial prediction heads and further parameters, TST retains a single output head and replaces solely the goal. This makes it the least costly member of a rising class of future-signal auxiliary aims. Unlike MTP, it reveals constant features throughout all examined scales together with small fashions the place MTP has been proven to degrade efficiency.
The input-side mechanism has no direct analog within the current pre-training literature. The analysis group provides two believable explanations: it could implicitly regularize the embedding geometry (since many random s-grams of tokens should stay linearly separable as soon as averaged), or it could act as a type of pre-pre-training, exposing the mannequin to a coarser model of the true information earlier than fine-resolution language modeling begins.
A focused ablation immediately assessments what occurs when illustration continuity is damaged. The analysis group runs a 3B TST experiment the place the enter embedding and output LM head are randomly re-initialized initially of Phase 2. The consequence: last loss jumps to 2.938 — worse than each the TST run (2.676) and the usual baseline (2.808). The Phase 1 TST steps contributed nothing to the ultimate mannequin. This confirms that shared representations throughout each phases aren’t incidental to TST’s success — they’re what makes it work.
Marktechpost’s Visual Explainer
Key Takeaways
- Nous Research's Token Superposition Training (TST) cuts LLM pre-training time by up to 2.5x at matched FLOPs — no structure, tokenizer, or optimizer adjustments required.
- Phase 1 averages contiguous token embeddings into baggage and predicts the following bag through multi-hot cross-entropy; Phase 2 reverts to commonplace next-token prediction from the identical checkpoint.
- Validated at 270M, 600M, 3B dense, and 10B-A1B MoE — TST beats the baseline on loss and downstream evals (HellaSwag, ARC, MMLU) throughout all scales.
- Optimal hyperparameters: bag measurement s ∈ [3–8] for smaller fashions, step ratio r ∈ [0.2, 0.4]; shared embeddings throughout each phases are important — re-initializing them makes TST worse than the baseline.
- Trade-off: TST consumes extra uncooked information tokens per compute funds — greatest suited to compute-bound coaching; the output-only variant is the choice for data-bound settings.
Check out the Paper and Project. Also, be happy to comply with us on Twitter and don’t overlook to be part of our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The publish Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models appeared first on MarkTechPost.
