|

Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference

https://vllm.ai/blog/2026-05-26-eagle-3-1

Speculative decoding is a way for dashing up massive language mannequin inference. A small, quick draft mannequin proposes a number of tokens. The massive goal mannequin verifies them in parallel. If accepted, inference is quicker. If rejected, the system falls again gracefully.

EAGLE Team, vLLM Team, and TorchSpec Team has launched the EAGLE collection together with EAGLE 1, EAGLE 2, and EAGLE 3 has turn out to be one of the broadly adopted and virtually deployed households of speculative decoding algorithms throughout each analysis and manufacturing programs. Today, that household will get a focused reliability improve with introduction of EAGLE 3.1.

What was Going Wrong

While speculative decoding performs properly in managed settings, efficiency usually degrades below totally different chat templates, long-context inputs, or out-of-distribution system prompts.

The EAGLE staff traced this fragility to a phenomenon known as attention drift as hypothesis depth will increase, the drafter step by step shifts consideration away from sink tokens and towards its personal generated tokens.

In easier phrases: the drafter is a small mannequin that predicts future tokens. As hypothesis will get deeper, it begins attending to its personal prior outputs as an alternative of the unique context. This degrades acceptance size and output stability.

Two underlying points have been recognized. First, the fused enter illustration turns into more and more imbalanced as higher-layer hidden states dominate the drafter enter. Second, hidden-state magnitude grows throughout hypothesis steps because of the unnormalized residual path. Together, these results make the drafter progressively much less secure at deeper hypothesis depths.

Two Architectural Fixes in EAGLE 3.1

To tackle consideration drift, EAGLE 3.1 comes with two key architectural enhancements: FC normalization after every goal hidden state and earlier than the FC layer, and feeding post-norm hidden states into the following decoding step.

FC normalization stabilizes the hidden states that the drafter receives from the goal mannequin. Without it, hidden-state magnitude grows throughout steps, making the drafter more and more unreliable. Applying normalization at every step retains the inputs bounded.

The post-norm design makes the tactic behave extra like recursively invoking the drafter throughout decoding steps, slightly than merely appending extra layers to the goal mannequin.

https://vllm.ai/blog/2026-05-26-eagle-3-1
https://vllm.ai/weblog/2026-05-26-eagle-3-1

What These Fixes Deliver

Compared with EAGLE 3, EAGLE 3.1 demonstrates: higher training-time to inference-time extrapolation, stronger long-context robustness, greater resilience to talk template and system immediate variation, and extra secure acceptance size throughout numerous serving environments.

In long-context workloads, EAGLE 3.1 achieves as much as 2× longer acceptance size in contrast with EAGLE 3.

Training Infrastructure: TorchSpec

TorchSpec now gives environment friendly coaching help for EAGLE 3.1 and future speculative decoding algorithms. By reducing coaching overhead and simplifying experimentation workflows, TorchSpec helps speed up iteration and exploration for next-generation speculative decoding analysis and deployment.

Based on TorchSpec and vLLM, the analysis staff additionally educated and open-sourced an EAGLE 3.1 draft mannequin for Kimi K2.6, obtainable on HuggingFace. The mannequin serves for example of deploying EAGLE 3.1 with TorchSpec coaching and vLLM serving help on a real-world serving mannequin

vLLM Integration: Config-Driven and Backward-Compatible

EAGLE 3.1 lands in vLLM as a config-driven extension of the prevailing EAGLE 3 implementation. The integration consists of FC normalization help, post-norm hidden-state suggestions, and elimination of hardcoded assumptions round goal hidden states.

Backward compatibility with present EAGLE 3 checkpoints is totally preserved. EAGLE 3.1 draft fashions might be plugged immediately via the identical speculative-decoding code path.

vllm serve nvidia/Kimi-K2.6-NVFP4 
  --trust-remote-code 
  --tensor-parallel-size 4 
  --tool-call-parser kimi_k2 
  --enable-auto-tool-choice 
  --reasoning-parser kimi_k2 
  --attention-backend tokenspeed_mla 
  --speculative-config '{"mannequin":"lightseekorg/kimi-k2.6-eagle3.1-mla","technique":"eagle3","num_speculative_tokens":3}' 
  --language-model-only

Benchmark Results on Kimi K2.6

The analysis staff benchmarked the Kimi K2.6 EAGLE 3.1 draft mannequin on Kimi-K2.6-NVFP4 with vLLM (TP=4, GB200, non-disagg) on the SPEED-Bench coding dataset. EAGLE 3.1 delivers 2.03× greater per-user output throughput at concurrency 1. The speedup stays significant as concurrency scales: 1.71× at C=4 and 1.66× at C=16.

Marktechpost’s Visual Explainer

01 / 07

vLLM · May 26, 2026

Meet EAGLE 3.1


The EAGLE staff, vLLM staff, and TorchSpec staff collectively launched EAGLE 3.1 — a focused repair for speculative decoding instability in manufacturing LLM serving.

#speculative-decoding
#vLLM
#LLM inference
#efficiency

02 / 07

Background

What is Speculative Decoding?


A way for dashing up LLM inference utilizing two fashions working collectively.

  • A small, quick draft mannequin proposes a number of tokens forward
  • The massive goal mannequin verifies all proposed tokens in one move
  • Accepted tokens are saved — rejected tokens fall again gracefully
  • Result: greater output throughput with no change in output high quality

03 / 07

The Problem

Attention Drift in EAGLE 3


EAGLE 3 efficiency degraded in real-world deployments below three circumstances:

  • Different chat templates
  • Long-context inputs
  • Out-of-distribution system prompts

Root trigger: consideration drift — as hypothesis depth will increase, the drafter shifts consideration away from sink tokens towards its personal generated tokens.

04 / 07

Root Cause

Two Underlying Issues

  • The fused enter illustration turns into more and more imbalanced — higher-layer hidden states dominate the drafter enter
  • Hidden-state magnitude grows throughout hypothesis steps because of the unnormalized residual path
  • Together, these make the drafter progressively much less secure at deeper hypothesis depths

05 / 07

Architecture

Two Architectural Fixes

Fix 1
FC normalization utilized after every goal hidden state and earlier than the FC layer. Keeps hidden-state magnitude bounded throughout decoding steps.
Fix 2
Post-norm hidden-state suggestions — normalized hidden states fed into the following decoding step, making the drafter behave like recursive invocation slightly than appended layers.

06 / 07

Benchmarks · SPEED-Bench Coding · GB200 TP=4

Per-User Throughput vs. No-Spec Baseline

2.03×Concurrency 1
1.71×Concurrency 4
1.66×Concurrency 16

In long-context workloads, EAGLE 3.1 achieves as much as 2× longer acceptance size in contrast with EAGLE 3. Tested on Kimi-K2.6-NVFP4 with vLLM.

07 / 07

Deployment · vLLM v0.22.0

How to Deploy EAGLE 3.1


Backward-compatible with EAGLE 3 checkpoints. Already merged in vLLM important. Stable launch: v0.22.0.

vllm serve nvidia/Kimi-K2.6-NVFP4 
  --trust-remote-code 
  --tensor-parallel-size 4 
  --tool-call-parser kimi_k2 
  --enable-auto-tool-choice 
  --reasoning-parser kimi_k2 
  --attention-backend tokenspeed_mla 
  --speculative-config 
    '{"mannequin":"lightseekorg/kimi-k2.6-eagle3.1-mla",
      "technique":"eagle3",
      "num_speculative_tokens":3}' 
  --language-model-only

1 / 7

Marktechsubmit
AI & ML Research, Simplified.

Key Takeaways

  • EAGLE 3.1 fixes consideration drift — a newly recognized instability the place the drafter loses deal with sink tokens at deeper hypothesis depths.
  • Two architectural modifications — FC normalization and post-norm hidden-state suggestions — stabilize the drafter throughout hypothesis steps.
  • In long-context workloads, EAGLE 3.1 delivers as much as 2× longer acceptance size in contrast with EAGLE 3.
  • Benchmarks on Kimi-K2.6-NVFP4 present 2.03× per-user output throughput at concurrency 1, dropping to 1.66× at C=16.
  • EAGLE 3.1 is backward-compatible with EAGLE 3 checkpoints and is already merged into vLLM important, transport in v0.22.0.


Check out the Technical detailsAlso, be happy to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The submit Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference appeared first on MarkTechPost.

Similar Posts