|

MBZUAI Researchers Release K2 Think: A 32B Open-Source System for Advanced AI Reasoning and Outperforms 20x Larger Reasoning Models

A workforce of researchers from MBZUAI’s Institute of Foundation Models and G42 launched K2 Think, is a 32B-parameter open reasoning system for superior AI reasoning. It pairs lengthy chain-of-thought supervised fine-tuning with reinforcement studying from verifiable rewards, agentic planning, test-time scaling, and inference optimizations (speculative decoding + wafer-scale {hardware}). The result’s frontier-level math efficiency with markedly decrease parameter rely and aggressive outcomes on code and science—along with a clear, totally open launch spanning weights, knowledge, and code.

System overview

K2 Think is constructed by post-training an open-weight Qwen2.5-32B base mannequin and including a light-weight test-time compute scaffold. The design emphasizes parameter effectivity: a 32B spine is intentionally chosen to allow quick iteration and deployment whereas leaving headroom for post-training beneficial properties. The core recipe combines six “pillars”: (1) Long chain-of-thought (CoT) supervised fine-tuning; (2) Reinforcement Learning with Verifiable Rewards (RLVR); (3) agentic planning earlier than fixing; (4) test-time scaling through best-of-N choice with verifiers; (5) speculative decoding; and (6) inference on a wafer-scale engine.

The objectives are easy: elevate go@1 on competition-grade math benchmarks, keep robust code/science efficiency, and preserve response size and wall-clock latency below management by way of plan-before-you-think prompting and hardware-aware inference.

Pillar 1: Long CoT SFT

Phase-1 SFT makes use of curated, lengthy chain-of-thought traces and instruction/response pairs spanning math, code, science, instruction following, and normal chat (AM-Thinking-v1-Distilled). The impact is to show the bottom mannequin to externalize intermediate reasoning and undertake a structured output format. Rapid go@1 beneficial properties happen early (≈0.5 epoch), with AIME’24 stabilizing round ~79% and AIME’25 round ~72% on the SFT checkpoint earlier than RL, indicating convergence.

Pillar 2: RL with Verifiable Rewards

K2 Think then trains with RLVR on Guru, a ~92k-prompt, six-domain dataset (Math, Code, Science, Logic, Simulation, Tabular) designed for verifiable end-to-end correctness. The implementation makes use of the verl library with a GRPO-style policy-gradient algorithm. Notable statement: beginning RL from a robust SFT checkpoint yields modest absolute beneficial properties and can plateau/degenerate, whereas making use of the identical RL recipe straight on the bottom mannequin reveals massive relative enhancements (e.g., ~40% on AIME’24 over coaching), supporting a trade-off between SFT power and RL headroom.

A second ablation reveals multi-stage RL with a lowered preliminary context window (e.g., 16k → 32k) underperforms—failing to recuperate the SFT baseline—suggesting that decreasing max sequence size under the SFT regime can disrupt realized reasoning patterns.

Pillars 3–4: Agentic “Plan-Before-You-Think” and Test-time Scaling

At inference, the system first elicits a compact plan earlier than producing a full resolution, then performs best-of-N (e.g., N=3) sampling with verifiers to pick probably the most likely-correct reply. Two results are reported: (i) constant high quality beneficial properties from the mixed scaffold; and (ii) shorter closing responses regardless of the added plan—common token counts drop throughout benchmarks, with reductions as much as ~11.7% (e.g., Omni-HARD), and general lengths corresponding to a lot bigger open fashions. This issues for each latency and value.

Table-level evaluation reveals K2 Think’s response lengths are shorter than Qwen3-235B-A22B and in the identical vary as GPT-OSS-120B on math; after including plan-before-you-think and verifiers, K2 Think’s common tokens fall versus its personal post-training checkpoint (e.g., AIME’24 −6.7%, AIME’25 −3.9%, HMMT25 −7.2%, Omni-HARD −11.7%, LCBv5 −10.5%, GPQA-D −2.1%).

Pillars 5–6: Speculative decoding and wafer-scale inference

K2 Think targets Cerebras Wafer-Scale Engine inference with speculative decoding, promoting per-request throughput upwards of 2,000 tokens/sec, which makes the test-time scaffold sensible for manufacturing and analysis loops. The hardware-aware inference path is a central a part of the discharge and aligns with the system’s “small-but-fast” philosophy.

https://k2think-about.pages.dev/belongings/tech-report/K2-Think_Tech-Report.pdf

Evaluation protocol

Benchmarking covers competition-level math (AIME’24, AIME’25, HMMT’25, Omni-MATH-HARD), code (ResideCodeBench v5; SciCode sub/most important), and science information/reasoning (GPQA-Diamond; HLE). The analysis workforce stories a standardized setup: max technology size 64k tokens, temperature 1.0, top-p 0.95, cease marker </reply>, and every rating as a mean of 16 impartial go@1 evaluations to scale back run-to-run variance.

https://k2think-about.pages.dev/belongings/tech-report/K2-Think_Tech-Report.pdf

Results

Math (micro-average throughout AIME’24/’25, HMMT25, Omni-HARD). K2 Think reaches 67.99, main the open-weight cohort and evaluating favorably even to a lot bigger programs; it posts 90.83 (AIME’24), 81.24 (AIME’25), 73.75 (HMMT25), and 60.73 on Omni-HARD—the latter being probably the most tough break up. The positioning is per robust parameter effectivity relative to DeepSeek V3.1 (671B) and GPT-OSS-120B (120B).

Code. ResideCodeBench v5 rating is 63.97, exceeding equally sized friends and even bigger open fashions (e.g., > Qwen3-235B-A22B at 56.64). On SciCode, K2 Think is 39.2/12.0 (sub/most important), monitoring the perfect open programs intently on sub-problem accuracy.

Science. GPQA-Diamond reaches 71.08; HLE is 9.95. The mannequin is not only a math specialist: it stays aggressive throughout knowledge-heavy duties.

https://k2think-about.pages.dev/belongings/tech-report/K2-Think_Tech-Report.pdf
https://k2think-about.pages.dev/belongings/tech-report/K2-Think_Tech-Report.pdf

Key numbers at a look

  • Backbone: Qwen2.5-32B (open weight), post-trained with lengthy CoT SFT + RLVR (GRPO through verl).
  • RL knowledge: Guru (~92k prompts) throughout Math/Code/Science/Logic/Simulation/Tabular.
  • Inference scaffold: Plan-before-you-think + BoN with verifiers; shorter outputs (e.g., −11.7% tokens on Omni-HARD) at greater accuracy.
  • Throughput goal: ~2,000 tok/s on Cerebras WSE with speculative decoding.
  • Math micro-avg: 67.99 (AIME’24 90.83, AIME’25 81.24, HMMT’25 73.75, Omni-HARD 60.73).
  • Code/Science: LCBv5 63.97; SciCode 39.2/12.0; GPQA-D 71.08; HLE 9.95.
  • Safety-4 macro: 0.75 (Refusal 0.83, Conv. Robustness 0.89, Cybersecurity 0.56, Jailbreak 0.72).

Summary

K2 Think demonstrates that integrative post-training + test-time compute + hardware-aware inference can shut a lot of the hole to bigger, proprietary reasoning programs. At 32B, it’s tractable to fine-tune and serve; with plan-before-you-think and BoN-with-verifiers, it controls token budgets; with speculative decoding on wafer-scale {hardware}, it reaches ~2k tok/s per request. K2 Think is introduced as a totally open system—weights, coaching knowledge, deployment code, and test-time optimization code.


Check out the Paper, Model on Hugging Face, GitHub and Direct Access. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter.

The publish MBZUAI Researchers Release K2 Think: A 32B Open-Source System for Advanced AI Reasoning and Outperforms 20x Larger Reasoning Models appeared first on MarkTechPost.

Similar Posts