Cerebras Releases MiniMax-M2-REAP-162B-A10B: A Memory Efficient Version of MiniMax-M2 for Long Context Coding Agents
Cerebras has launched MiniMax-M2-REAP-162B-A10B, a compressed Sparse Mixture-of-Experts (SMoE) Causal Language Model derived from MiniMax-M2, utilizing the brand new Router weighted Expert Activation Pruning (REAP) technique. The mannequin retains the habits of the unique 230B complete, 10B energetic MiniMax M2, whereas pruning specialists and decreasing reminiscence for deployment targeted workloads resembling coding brokers and power calling.
Architecture and core specs
MiniMax-M2-REAP-162B-A10B has these key properties:
- Base mannequin: MiniMax-M2
- Compression technique: REAP, Router weighted Expert Activation Pruning
- Total parameters: 162B
- Active parameters per token: 10B
- Layers: 62 transformer blocks
- Attention heads per layer: 48
- Experts: 180 specialists, obtained by pruning a 256 professional configuration
- Activated specialists per token: 8
- Context size: 196,608 tokens
- License: modified MIT, derived from MiniMaxAI MiniMax M2
The SMoE design implies that the mannequin shops 162B parameters, however every token solely routes via a small set of specialists, so the efficient compute price per token is much like a 10B dense mannequin. MiniMax M2 itself is positioned as an MoE mannequin constructed for coding and agentic workflows, with 230B complete parameters and 10B energetic, which this checkpoint inherits.
How REAP compresses MiniMax-M2?
MiniMax-M2-REAP-162B-A10B is created by making use of REAP uniformly throughout all MoE blocks of MiniMax M2, at a 30 % professional pruning fee.
The REAP technique defines a saliency rating for every professional that mixes:
- Router gate values: How usually and the way strongly the router selects that professional
- Expert activation norms: The magnitude of the professional output when energetic
Experts that contribute minimally to the layer output, underneath this mixed criterion, are eliminated. The remaining specialists hold their unique weights and the router retains separate gates for every of them. This is one shot compression, there isn’t any further fantastic tuning after pruning within the technique definition.
A core theoretical outcome within the REAP’s research paper is that professional merging with summed gates causes purposeful subspace collapse. When specialists are merged, the router loses its unbiased, enter dependent management over these specialists, so a single merged professional should approximate an enter dependent combination that was initially expressed via a number of specialists. The analysis staff proves that, each time the router coverage depends upon the enter and the specialists will not be equivalent, this introduces irreducible error. In distinction, pruning removes some specialists however preserves unbiased management of the survivors, so the error scales with the gate weight of the eliminated specialists.
Across a set of SMoE fashions within the 20B to 1T parameter vary, REAP persistently outperforms professional merging and different pruning standards on generative benchmarks resembling code era, mathematical reasoning and power calling, particularly at 50 % compression.
Accuracy underneath 30 % professional pruning
The MiniMax-M2-REAP-162B-A10B mannequin will get in contrast on three checkpoints on commonplace coding, reasoning and agentic benchmarks:
- MiniMax-M2 (230B, base mannequin)
- MiniMax-M2-REAP-172B-A10B, 25 % pruning
- MiniMax-M2-REAP-162B-A10B, 30 % pruning

On coding benchmarks resembling HumanEval, HumanEval Plus, MBPP and MBPP Plus, the 162B REAP mannequin stays very near the bottom mannequin. HumanEval sits round 90% vary, and MBPP stays within the 80% vary, with the 172B and 162B fashions basically monitoring the unique MiniMax-M2 inside a number of factors.
On reasoning benchmarks resembling AIME 25 and MATH 500, there are small shifts between the three fashions, however there isn’t any collapse at 30 % pruning and the 162B checkpoint stays aggressive with the bottom mannequin.
On instrument calling and agentic analysis, represented by τ2 bench in a telecom setting, the 162B REAP mannequin once more matches the bottom mannequin inside small variance. The mannequin card explicitly states that this checkpoint retains virtually equivalent efficiency whereas being about 30 % lighter in parameter depend.
These outcomes line up with the broader REAP study, which studies close to lossless compression for code era and power calling on a number of massive SMoE architectures when pruning specialists utilizing the REAP criterion.
Deployment, reminiscence utilization and noticed throughput
Cerebras gives a direct vLLM serve instance and positions MiniMax-M2-REAP-162B-A10B as a drop in mannequin for the present MiniMax M2 integration.
vllm serve cerebras/MiniMax-M2-REAP-162B-A10B
--tensor-parallel-size 8
--tool-call-parser minimax_m2
--reasoning-parser minimax_m2_append_think
--trust-remote-code
--enable_expert_parallel
--enable-auto-tool-choice
If the run hits reminiscence limits, the cardboard recommends decreasing --max-num-seqs, for instance to 64, to maintain batch measurement in test on a given GPU.
Key Takeaways
- SMoE structure with environment friendly compute: MiniMax-M2-REAP-162B-A10B is a Sparse Mixture of Experts mannequin with 162B complete parameters and 10B energetic parameters per token, so the compute price per token is near a 10B dense mannequin whereas retaining frontier scale capability.
- REAP professional pruning retains habits of MiniMax-M2: The mannequin is produced by making use of REAP Router weighted Expert Activation Pruning to MiniMax-M2 at roughly 30 % professional pruning, pruning specialists based mostly on router gate values and professional activation norms whereas leaving surviving specialists and router construction intact.
- Near lossless accuracy at 30 % compression: On coding benchmarks resembling HumanEval and MBPP, and on reasoning benchmarks resembling AIME25 and MATH 500, the 162B REAP variant tracks the 230B MiniMax-M2 and a 172B REAP variant inside a number of factors, displaying close to lossless compression for code, reasoning and power use.
- Pruning outperforms professional merging for generative SMoE: The REAP examine exhibits that pruning specialists utilizing a saliency criterion avoids the purposeful subspace collapse seen with professional merging in generative duties, and performs higher throughout massive SMoE fashions within the 22B to about 1T parameter vary.
Comparison Table

Editorial Comments
Cerebras’ launch of MiniMax-M2-REAP-162B-A10B is a robust sign that Router weighted Expert Activation Pruning is prepared for actual workloads, not simply as a analysis curiosity. The checkpoint exhibits {that a} 30 % professional pruning schedule can hold MiniMax-M2 230B-A10B habits virtually intact whereas reducing reminiscence and preserving lengthy context coding, reasoning and power calling efficiency, which is strictly what SMoE researchers want for sensible deployment. Overall, Cerebras is quietly turning professional pruning into manufacturing infrastructure for frontier class SMoE fashions.
Check out the Model Weights. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to observe us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The publish Cerebras Releases MiniMax-M2-REAP-162B-A10B: A Memory Efficient Version of MiniMax-M2 for Long Context Coding Agents appeared first on MarkTechPost.
