Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference
Speculative decoding is a way for dashing up massive language mannequin inference. A small, quick draft mannequin proposes a number of tokens. The massive goal mannequin verifies them in parallel. If accepted, inference is quicker. If rejected, the system falls again gracefully.
EAGLE Team, vLLM Team, and TorchSpec Team has launched the EAGLE collection together with EAGLE 1, EAGLE 2, and EAGLE 3 has turn out to be one of the broadly adopted and virtually deployed households of speculative decoding algorithms throughout each analysis and manufacturing programs. Today, that household will get a focused reliability improve with introduction of EAGLE 3.1.
What was Going Wrong
While speculative decoding performs properly in managed settings, efficiency usually degrades below totally different chat templates, long-context inputs, or out-of-distribution system prompts.
The EAGLE staff traced this fragility to a phenomenon known as attention drift as hypothesis depth will increase, the drafter step by step shifts consideration away from sink tokens and towards its personal generated tokens.
In easier phrases: the drafter is a small mannequin that predicts future tokens. As hypothesis will get deeper, it begins attending to its personal prior outputs as an alternative of the unique context. This degrades acceptance size and output stability.
Two underlying points have been recognized. First, the fused enter illustration turns into more and more imbalanced as higher-layer hidden states dominate the drafter enter. Second, hidden-state magnitude grows throughout hypothesis steps because of the unnormalized residual path. Together, these results make the drafter progressively much less secure at deeper hypothesis depths.
Two Architectural Fixes in EAGLE 3.1
To tackle consideration drift, EAGLE 3.1 comes with two key architectural enhancements: FC normalization after every goal hidden state and earlier than the FC layer, and feeding post-norm hidden states into the following decoding step.
FC normalization stabilizes the hidden states that the drafter receives from the goal mannequin. Without it, hidden-state magnitude grows throughout steps, making the drafter more and more unreliable. Applying normalization at every step retains the inputs bounded.
The post-norm design makes the tactic behave extra like recursively invoking the drafter throughout decoding steps, slightly than merely appending extra layers to the goal mannequin.

What These Fixes Deliver
Compared with EAGLE 3, EAGLE 3.1 demonstrates: higher training-time to inference-time extrapolation, stronger long-context robustness, greater resilience to talk template and system immediate variation, and extra secure acceptance size throughout numerous serving environments.
In long-context workloads, EAGLE 3.1 achieves as much as 2× longer acceptance size in contrast with EAGLE 3.
Training Infrastructure: TorchSpec
TorchSpec now gives environment friendly coaching help for EAGLE 3.1 and future speculative decoding algorithms. By reducing coaching overhead and simplifying experimentation workflows, TorchSpec helps speed up iteration and exploration for next-generation speculative decoding analysis and deployment.
Based on TorchSpec and vLLM, the analysis staff additionally educated and open-sourced an EAGLE 3.1 draft mannequin for Kimi K2.6, obtainable on HuggingFace. The mannequin serves for example of deploying EAGLE 3.1 with TorchSpec coaching and vLLM serving help on a real-world serving mannequin
vLLM Integration: Config-Driven and Backward-Compatible
EAGLE 3.1 lands in vLLM as a config-driven extension of the prevailing EAGLE 3 implementation. The integration consists of FC normalization help, post-norm hidden-state suggestions, and elimination of hardcoded assumptions round goal hidden states.
Backward compatibility with present EAGLE 3 checkpoints is totally preserved. EAGLE 3.1 draft fashions might be plugged immediately via the identical speculative-decoding code path.
vllm serve nvidia/Kimi-K2.6-NVFP4
--trust-remote-code
--tensor-parallel-size 4
--tool-call-parser kimi_k2
--enable-auto-tool-choice
--reasoning-parser kimi_k2
--attention-backend tokenspeed_mla
--speculative-config '{"mannequin":"lightseekorg/kimi-k2.6-eagle3.1-mla","technique":"eagle3","num_speculative_tokens":3}'
--language-model-only
Benchmark Results on Kimi K2.6
The analysis staff benchmarked the Kimi K2.6 EAGLE 3.1 draft mannequin on Kimi-K2.6-NVFP4 with vLLM (TP=4, GB200, non-disagg) on the SPEED-Bench coding dataset. EAGLE 3.1 delivers 2.03× greater per-user output throughput at concurrency 1. The speedup stays significant as concurrency scales: 1.71× at C=4 and 1.66× at C=16.
Marktechpost’s Visual Explainer
1 / 7
AI & ML Research, Simplified.
Key Takeaways
- EAGLE 3.1 fixes consideration drift — a newly recognized instability the place the drafter loses deal with sink tokens at deeper hypothesis depths.
- Two architectural modifications — FC normalization and post-norm hidden-state suggestions — stabilize the drafter throughout hypothesis steps.
- In long-context workloads, EAGLE 3.1 delivers as much as 2× longer acceptance size in contrast with EAGLE 3.
- Benchmarks on Kimi-K2.6-NVFP4 present 2.03× per-user output throughput at concurrency 1, dropping to 1.66× at C=16.
- EAGLE 3.1 is backward-compatible with EAGLE 3 checkpoints and is already merged into vLLM important, transport in v0.22.0.
Check out the Technical details. Also, be happy to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us
The submit Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference appeared first on MarkTechPost.
