|

DeepSeek AI Releases DeepSeek-V4: Compressed Sparse Attention and Heavily Compressed Attention Enable One-Million-Token Contexts

DeepSeek-AI has launched a preview model of the DeepSeek-V4 sequence: two Mixture-of-Experts (MoE) language fashions constructed round one core problem making one-million-token context home windows sensible and reasonably priced at inference time.

The sequence consists of DeepSeek-V4-Pro, with 1.6T whole parameters and 49B activated per token, and DeepSeek-V4-Flash, with 284B whole parameters and 13B activated per token. Both fashions natively help a context size of 1 million tokens. DeepSeek-V4-Pro was pre-trained on 33T tokens and DeepSeek-V4-Flash on 32T tokens. Model checkpoints for all 4 variants: DeepSeek-V4-Pro, DeepSeek-V4-Pro-Base, DeepSeek-V4-Flash, and DeepSeek-V4-Flash-Base are publicly obtainable on Hugging Face.

https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/essential/DeepSeek_V4.pdf

Architectural Challenges of Long Context

The vanilla consideration mechanism in a typical Transformer has quadratic computational complexity with respect to sequence size, doubling the context roughly quadruples consideration compute and reminiscence. At a million tokens, this turns into prohibitive with out architectural intervention. DeepSeek-V4 addresses this by way of 4 coordinated improvements: a hybrid consideration structure, a brand new residual connection design, a distinct optimizer, and FP4 quantization-aware coaching.

https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/essential/DeepSeek_V4.pdf

Hybrid Attention: CSA and HCA

The central architectural innovation is a hybrid mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), interleaved throughout Transformer layers.

CSA compresses the Key-Value (KV) cache of each m tokens into one entry utilizing a discovered token-level compressor, then applies DeepSeek Sparse Attention (DSA) the place every question token attends solely to the top-okay chosen compressed KV entries. A part referred to as the Lightning Indexer handles sparse choice by scoring queries towards compressed KV blocks. Both CSA and HCA embrace a sliding window consideration department masking the latest nwin tokens for native dependency modeling.

HCA is extra aggressive: it consolidates KV entries of each m′ tokens — the place m′ ≫ m right into a single compressed entry, then applies dense consideration over these representations. No sparse choice step is required; the compression ratio itself reduces KV cache measurement.

The effectivity positive factors are substantial. In the one-million-token setting, DeepSeek-V4-Pro requires solely 27% of the single-token inference FLOPs (in equal FP8 FLOPs) and 10% of the KV cache measurement of DeepSeek-V3.2. DeepSeek-V4-Flash achieves 10% of single-token FLOPs and 7% of KV cache relative to DeepSeek-V3.2.

Manifold-Constrained Hyper-Connections (mHC)

DeepSeek-V4 replaces typical residual connections with Manifold-Constrained Hyper-Connections (mHC). Hyper-Connections (HC) generalize residual connections by increasing the residual stream width by an element of nhc (set to 4 in each fashions), introducing discovered enter, residual, and output mapping matrices. Naive HC suffers from numerical instability when stacking many layers.

mHC resolves this by constraining the residual mapping matrix Bl to the Birkhoff polytope — the manifold of doubly stochastic matrices the place all rows and columns sum to at least one and all entries are non-negative. This bounds the spectral norm of the mapping at 1, stopping sign amplification in each the ahead cross and backpropagation. The constraint is enforced through the Sinkhorn-Knopp algorithm with t_max = 20 iterations. Mapping parameters are dynamically generated per-input for expressivity.

Muon Optimizer and FP4 QAT

DeepSeek-V4 adopts the Muon optimizer for almost all of its parameters. Muon makes use of Newton-Schulz iterations to roughly orthogonalize the gradient replace matrix earlier than making use of it as a weight replace. The implementation makes use of a hybrid two-stage schedule: 8 iterations with coefficients (3.4445, −4.7750, 2.0315) for speedy convergence, then 2 stabilization iterations with coefficients (2, −1.5, 0.5). AdamW is retained for the embedding module, prediction head, static biases and gating components of mHC modules, and all RMSNorm weights.

For deployment effectivity, FP4 (MXFP4) Quantization-Aware Training (QAT) is utilized to MoE skilled weights and to the Query-Key (QK) path within the Lightning Indexer of CSA. During inference and RL rollout, actual FP4 weights are used instantly fairly than simulated quantization, decreasing reminiscence site visitors and sampling latency.

Training Stability at Scale

Training trillion-parameter MoE fashions launched notable instabilities. Two strategies proved efficient. Anticipatory Routing decouples the spine and routing community updates: routing indices at step t are computed utilizing historic parameters θt−Δt, breaking the cycle through which routing choices reinforce outlier values in MoE layers. SwiGLU Clamping constrains the linear part of SwiGLU to [−10, 10] and caps the gate part higher sure at 10, instantly suppressing anomalous activations. Both strategies have been utilized all through coaching of each fashions.

Post-Training: Specialist Experts and On-Policy Distillation

The post-training pipeline replaces the blended RL stage of DeepSeek-V3.2 with On-Policy Distillation (OPD). Independent area consultants are first educated in arithmetic, coding, agent duties, and instruction following through Supervised Fine-Tuning (SFT) adopted by Reinforcement Learning utilizing Group Relative Policy Optimization (GRPO). More than ten trainer fashions then distill a single unified scholar mannequin by minimizing the reverse KL divergence between the scholar and every trainer’s output distribution on the scholar’s personal generated trajectories, utilizing full-vocabulary logit distillation for steady gradient estimates.

The ensuing mannequin helps three reasoning effort modes: Non-think (quick, no specific chain-of-thought), Think High (deliberate reasoning), and Think Max (most reasoning effort with a devoted system immediate and lowered size penalties throughout RL coaching).

Benchmark Results

DeepSeek-V4-Pro-Max achieves a Codeforces ranking of 3206, forward of GPT-5.4-xHigh (3168) and Gemini-3.1-Pro-High (3052). On SimpleQA Verified, it scores 57.9 Pass@1, outperforming Claude Opus 4.6 Max (46.2) and GPT-5.4-xHigh (45.3), although trailing Gemini-3.1-Pro-High (75.6). On SWE-Verified, DeepSeek-V4-Pro-Max achieves 80.6% resolved, marginally behind Claude Opus 4.6 Max (80.8%), whereas Gemini-3.1-Pro-High additionally scores 80.6%.

On long-context benchmarks, DeepSeek-V4-Pro-Max scores 83.5 MMR on OpenAI MRCR 1M and 62.0 accuracy on CorpusQA 1M, surpassing Gemini-3.1-Pro-High (76.3 and 53.8 respectively), however trailing Claude Opus 4.6 Max (92.9 and 71.7) on each.

Key Takeaways

  • Hybrid CSA and HCA consideration cuts KV cache to 10% of DeepSeek-V3.2 at 1M tokens.
  • Manifold-Constrained Hyper-Connections (mHC) change residual connections for extra steady deep layer coaching.
  • The Muon optimizer replaces AdamW for many parameters, delivering sooner convergence and coaching stability.
  • Post-training makes use of On-Policy Distillation from 10+ area consultants as an alternative of conventional blended RL.
  • DeepSeek-V4-Flash-Base outperforms DeepSeek-V3.2-Base regardless of having 3x fewer activated parameters.

Check out the Paper and Model Weights. Also, be at liberty to observe us on Twitter and don’t neglect to hitch our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The submit DeepSeek AI Releases DeepSeek-V4: Compressed Sparse Attention and Heavily Compressed Attention Enable One-Million-Token Contexts appeared first on MarkTechPost.

Similar Posts