MiniMax Sparse Attention (MSA): a Two-Branch Block-Sparse Attention Trained on a 109B-Parameter MoE With a 3T-Token Budget
MiniMax launched MSA (MiniMax Sparse Attention), a sparse consideration methodology constructed instantly on Grouped Query Attention (GQA). It targets one bottleneck: the quadratic value of softmax consideration at lengthy context. The MiniMax analysis workforce examined it inside a 109B-parameter Mixture-of-Experts mannequin skilled with native multimodal knowledge. They additionally open-sourced an inference kernel and shipped a manufacturing mannequin, MiniMax-M3.
What is MSA (MiniMax Sparse Attention)
MSA (MiniMax Sparse Attention) components consideration into two levels: an Index Branch and a Main Branch. The Index Branch decides which key-value blocks every question ought to learn. The Main Branch then runs actual softmax consideration over solely these blocks.
Selection occurs at block granularity, not per token. The default block dimension is Bok = 128 tokens. Each question and GQA group retains ok = 16 blocks. That fixes the per-query funds at kBok = 2,048 key-value tokens.
The two value constructions differ. Dense GQA consideration scales per question as O(N), the complete context. MSA scales as O(kBok), which stays fastened as N grows. The compute hole due to this fact widens as context size will increase.
Selection is shared inside every GQA group however unbiased throughout teams. One key-value head serves a number of question heads, and so they share one block set. Different teams can attend to totally different long-range areas.
How the Two Branches Work
The Index Branch provides solely two projection matrices to a normal GQA layer. It defines one index question head per GQA group and one shared index key head. It scores seen key tokens, then max-pools these scores to the block stage.
A Top-k operator then selects the highest-scoring blocks per question and group. The native block containing the question is all the time included. This prevents the selector from dropping the question’s fast neighborhood.
The Main Branch gathers causally seen tokens from the chosen blocks. It applies scaled dot-product softmax consideration restricted to these tokens. Each question head retains its personal question projection however shares the group’s block set.
A visualization within the report exhibits what the realized indexer selects. Heads focus on the native diagonal and the primary block. They reserve the remainder of the funds for a few long-range stripes.


How MSA is Trained
Top-k choice is non-differentiable, so the language-modeling loss can’t practice the index projections. MSA solves this with a KL alignment loss. The loss matches the Index Branch distribution to the Main Branch consideration sample. The instructor is the group-averaged Main Branch distribution over the chosen tokens.
Three mechanisms stabilize sparse coaching. Gradient Detach applies stop-gradient to the Index Branch enter. This confines the KL loss to the index projections, not the spine. Without it, bigger KL coefficients precipitated gradient spikes and loss divergence.
Indexer Warmup runs full consideration in each branches for the primary iterations. The indexer learns from the KL loss earlier than it controls routing. The compelled Local Block reserves one slot for close by context.
Ablations formed the ultimate recipe. An early variant added an Index Branch worth head with its personal output. Once warmup is used, that worth head is now not obligatory. The closing design drops it on effectivity grounds.
MSA helps two coaching routes. MSA-PT trains from scratch after a 40B-token indexer warmup. MSA-CPT converts a dense GQA checkpoint skilled on 2.6T tokens. It then continues for 400B tokens, together with 40B tokens of warmup.
The Kernel Co-Design
Theoretical sparsity doesn’t develop into pace with out a matching GPU path. MSA pairs the algorithm with two kernel concepts.
The first is exp-free Top-k choice. Softmax preserves order, so rating uncooked scores yields similar indices. The kernel skips the max, exp, and sum steps earlier than choice. At 128K context with ok = 16, it ran 5.1× sooner than torch.topk. It additionally beat the TileLang radix-select kernel by 3.7×.
The second is KV-outer sparse consideration with question collect. Iterating over KV blocks raises arithmetic depth versus iterating over queries. The kernel packs ⌈128/G⌉ question positions into one 128×128 rating MMA. A two-phase ahead splits the eye and mix steps throughout CTAs.
The open-source kernel, fmha_sm100, targets NVIDIA SM100 GPUs. It ships dense FlashAttention plus sparse Top-k kernels below an MIT license. It helps BF16, FP8, NVFP4, and FP4 precision.
How MSA Compares To Other Sparse Methods
The analysis workforce positions MSA in opposition to 4 natively skilled sparse designs.
The desk under summarizes the variations it describes.
| Method | Backbone | Selection granularity | Indexer / choice sign |
|---|---|---|---|
| MSA | GQA | Block-level (B_k = 128), per-GQA-group Top-k |
KL alignment loss |
| NSA | MQA / MHA | Compressed + chosen blocks + sliding window | Native (end-to-end) coaching |
| InfLLM-V2 | Dense sparse switchable |
Parameter-free block choice + sliding window | Parameter-free (no skilled indexer) |
| MoBA | GQA | Very giant KV blocks (block-averaged keys) | LM gradient solely |
| DSA | MLA (MQA mode) | Token-level; single Top-k shared throughout heads | ReLU lightning indexer |
MSA’s distinguishing pair is per-GQA-group Top-k sharing mixed with block-level choice. This retains KV reads contiguous whereas giving every group its personal retrieval.
The high quality facet holds up. Both sparse fashions keep broadly aggressive with the Full-Attention baseline.
The desk under exhibits consultant outcomes below the 3T-token funds.
| Benchmark | Full | MSA-PT | MSA-CPT |
|---|---|---|---|
| MMLU | 67.0 | 67.2 | 66.8 |
| GSM8K | 76.2 | 77.7 | 73.7 |
| HumanEval | 61.0 | 64.0 | 57.9 |
| RULER-8K | 79.8 | 84.2 | 77.2 |
| RULER-32K | 75.0 | 77.5 | 75.7 |
| VideoMME | 41.11 | 45.48 | 39.65 |
After long-context extension, MSA-CPT stayed near Full on HELMET-128K and RULER-128K. Each question nonetheless attends to solely 2,048 key-value tokens.
Explainer Playground
Use Cases With Examples
MSA targets workloads the place context size is the binding deployment constraint.
- Long-horizon brokers: An agent that spans a whole lot of reasoning and motion steps accumulates a giant transcript. Dense consideration over that historical past grows quadratically. MSA holds the per-query funds at 2,048 tokens no matter size.
- Repository-scale code reasoning: A coding agent loading a full repository can exceed a whole lot of hundreds of tokens. The indexer routes every question to the few related blocks. Irrelevant recordsdata keep exterior the chosen set.
- Persistent reminiscence: A protracted-running assistant retains rising conversational state. MSA reads a fixed-size slice of probably the most related blocks per question. The decoding value stays roughly flat as reminiscence grows.
- Long video understanding: The mannequin is natively multimodal and skilled on picture and video knowledge. MSA-PT scored highest of the three runs on a number of video benchmarks, together with VideoMME and TemporalBench. Sparse choice scales to lengthy visible token sequences.
Running the Kernel
The quickest path makes use of the Hugging Face kernels library.
# pip set up -U kernels
from kernels import get_kernel
kernel_module = get_kernel("MiniMaxAI/msa", model=0)
sparse_atten_func = kernel_module.sparse_atten_func
sparse_atten_func(...)
The repository additionally showcases the planner, indexer, and a focus name instantly.
import torch
from fmha_sm100 import fmha_sm100, fmha_sm100_plan, sparse_topk_select
page_size, topk = 128, 16
# Dense proxy go: per-block max rating from a low cost Q slice.
proxy_plan = fmha_sm100_plan(
qo_lens, kv_lens, proxy_q.form[1],
num_kv_heads=1, page_size=page_size, output_maxscore=True,
)
_, max_score = fmha_sm100(
proxy_q, proxy_k_pages, proxy_v_pages, proxy_plan,
kv_indices=kv_indices, output_o=False, output_maxscore=True,
)
# Block scores -> chosen KV block indexes.
kv_block_indexes = sparse_topk_select(
max_score.contiguous(), topk, num_valid_pages=num_pages,
)
# Sparse consideration over the chosen blocks.
sparse_plan = fmha_sm100_plan(
qo_lens, kv_lens, q.form[1],
num_kv_heads=k_pages.form[1], page_size=page_size, kv_block_num=topk,
)
out, _ = fmha_sm100(
q, k_pages, v_pages, sparse_plan,
kv_indices=kv_indices, kv_block_indexes=kv_block_indexes,
)
These are the repository’s official utilization examples. The inputs are paged key-value tensors that the caller prepares. The first run JIT-compiles the indexer, which might take a jiffy. Requirements are an SM100 GPU, CUDA Toolkit, and Python 3.10 or larger.
Strengths and Weaknesses
Strengths
- Per-token consideration compute drops 28.4× at 1M context within the reported setting.
- Measured wall-clock speedups attain 14.2× prefill and seven.6× decoding at 1M on H800.
- The design provides solely two projection matrices to a normal GQA layer.
- It helps each from-scratch coaching and conversion from dense checkpoints.
- The inference kernel is launched below an MIT license.
Weaknesses and open questions
- The launched kernel targets NVIDIA SM100; different architectures want separate work.
- A residual long-context retrieval hole stays versus full consideration on some subtasks.
- Reported speedups assume a particular head configuration and the H800 setup.
- The KL loss provides training-time complexity over a plain dense layer.
- Results come from the MiniMax’s personal analysis suite, not third-party replica.
Check out the Full Paper and Repo. Also, be at liberty to comply with us on Twitter and don’t overlook to affix our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us
The submit MiniMax Sparse Attention (MSA): a Two-Branch Block-Sparse Attention Trained on a 109B-Parameter MoE With a 3T-Token Budget appeared first on MarkTechPost.

sparse switchable