|

MiniMax Sparse Attention (MSA): a Two-Branch Block-Sparse Attention Trained on a 109B-Parameter MoE With a 3T-Token Budget

MiniMax launched MSA (MiniMax Sparse Attention), a sparse consideration methodology constructed instantly on Grouped Query Attention (GQA). It targets one bottleneck: the quadratic value of softmax consideration at lengthy context. The MiniMax analysis workforce examined it inside a 109B-parameter Mixture-of-Experts mannequin skilled with native multimodal knowledge. They additionally open-sourced an inference kernel and shipped a manufacturing mannequin, MiniMax-M3.

What is MSA (MiniMax Sparse Attention)

MSA (MiniMax Sparse Attention) components consideration into two levels: an Index Branch and a Main Branch. The Index Branch decides which key-value blocks every question ought to learn. The Main Branch then runs actual softmax consideration over solely these blocks.

Selection occurs at block granularity, not per token. The default block dimension is Bok = 128 tokens. Each question and GQA group retains ok = 16 blocks. That fixes the per-query funds at kBok = 2,048 key-value tokens.

The two value constructions differ. Dense GQA consideration scales per question as O(N), the complete context. MSA scales as O(kBok), which stays fastened as N grows. The compute hole due to this fact widens as context size will increase.

Selection is shared inside every GQA group however unbiased throughout teams. One key-value head serves a number of question heads, and so they share one block set. Different teams can attend to totally different long-range areas.

How the Two Branches Work

The Index Branch provides solely two projection matrices to a normal GQA layer. It defines one index question head per GQA group and one shared index key head. It scores seen key tokens, then max-pools these scores to the block stage.

A Top-k operator then selects the highest-scoring blocks per question and group. The native block containing the question is all the time included. This prevents the selector from dropping the question’s fast neighborhood.

The Main Branch gathers causally seen tokens from the chosen blocks. It applies scaled dot-product softmax consideration restricted to these tokens. Each question head retains its personal question projection however shares the group’s block set.

A visualization within the report exhibits what the realized indexer selects. Heads focus on the native diagonal and the primary block. They reserve the remainder of the funds for a few long-range stripes.

https://arxiv.org/pdf/2606.13392v1
https://arxiv.org/pdf/2606.13392v1

How MSA is Trained

Top-k choice is non-differentiable, so the language-modeling loss can’t practice the index projections. MSA solves this with a KL alignment loss. The loss matches the Index Branch distribution to the Main Branch consideration sample. The instructor is the group-averaged Main Branch distribution over the chosen tokens.

Three mechanisms stabilize sparse coaching. Gradient Detach applies stop-gradient to the Index Branch enter. This confines the KL loss to the index projections, not the spine. Without it, bigger KL coefficients precipitated gradient spikes and loss divergence.

Indexer Warmup runs full consideration in each branches for the primary iterations. The indexer learns from the KL loss earlier than it controls routing. The compelled Local Block reserves one slot for close by context.

Ablations formed the ultimate recipe. An early variant added an Index Branch worth head with its personal output. Once warmup is used, that worth head is now not obligatory. The closing design drops it on effectivity grounds.

MSA helps two coaching routes. MSA-PT trains from scratch after a 40B-token indexer warmup. MSA-CPT converts a dense GQA checkpoint skilled on 2.6T tokens. It then continues for 400B tokens, together with 40B tokens of warmup.

The Kernel Co-Design

Theoretical sparsity doesn’t develop into pace with out a matching GPU path. MSA pairs the algorithm with two kernel concepts.

The first is exp-free Top-k choice. Softmax preserves order, so rating uncooked scores yields similar indices. The kernel skips the max, exp, and sum steps earlier than choice. At 128K context with ok = 16, it ran 5.1× sooner than torch.topk. It additionally beat the TileLang radix-select kernel by 3.7×.

The second is KV-outer sparse consideration with question collect. Iterating over KV blocks raises arithmetic depth versus iterating over queries. The kernel packs ⌈128/G⌉ question positions into one 128×128 rating MMA. A two-phase ahead splits the eye and mix steps throughout CTAs.

The open-source kernel, fmha_sm100, targets NVIDIA SM100 GPUs. It ships dense FlashAttention plus sparse Top-k kernels below an MIT license. It helps BF16, FP8, NVFP4, and FP4 precision.

How MSA Compares To Other Sparse Methods

The analysis workforce positions MSA in opposition to 4 natively skilled sparse designs.

The desk under summarizes the variations it describes.

Method Backbone Selection granularity Indexer / choice sign
MSA GQA Block-level (B_k = 128), per-GQA-group Top-k KL alignment loss
NSA MQA / MHA Compressed + chosen blocks + sliding window Native (end-to-end) coaching
InfLLM-V2 Dense↔sparse switchable Parameter-free block choice + sliding window Parameter-free (no skilled indexer)
MoBA GQA Very giant KV blocks (block-averaged keys) LM gradient solely
DSA MLA (MQA mode) Token-level; single Top-k shared throughout heads ReLU lightning indexer

MSA’s distinguishing pair is per-GQA-group Top-k sharing mixed with block-level choice. This retains KV reads contiguous whereas giving every group its personal retrieval.

The high quality facet holds up. Both sparse fashions keep broadly aggressive with the Full-Attention baseline.

The desk under exhibits consultant outcomes below the 3T-token funds.

Benchmark Full MSA-PT MSA-CPT
MMLU 67.0 67.2 66.8
GSM8K 76.2 77.7 73.7
HumanEval 61.0 64.0 57.9
RULER-8K 79.8 84.2 77.2
RULER-32K 75.0 77.5 75.7
VideoMME 41.11 45.48 39.65

After long-context extension, MSA-CPT stayed near Full on HELMET-128K and RULER-128K. Each question nonetheless attends to solely 2,048 key-value tokens.

Explainer Playground