Moonshot AI Releases Kimi K2 Thinking: An Impressive Thinking Model that can Execute up to 200–300 Sequential Tool Calls without Human Interference
How will we design AI programs that can plan, purpose, and act over lengthy sequences of selections without fixed human steerage? Moonshot AI has launched Kimi K2 Thinking, an open supply pondering agent mannequin that exposes the total reasoning stream of the Kimi K2 Mixture of Experts structure. It targets workloads that want deep reasoning, lengthy horizon instrument use, and secure agent habits throughout many steps.

What is Kimi K2 Thinking?
Kimi K2 Thinking is described as the most recent, most succesful model of Moonshot’s open supply pondering mannequin. It is constructed as a pondering agent that causes step-by-step and dynamically invokes instruments throughout inference. The mannequin is designed to interleave chain of thought with perform calls so it can learn, assume, name a instrument, assume once more, and repeat for a whole bunch of steps.
The mannequin units a brand new state-of-the-art on Humanity’s Last Exam and BrowseComp, whereas sustaining coherent habits throughout about 200 to 300 sequential instrument calls without human interference.
At the identical time, K2 Thinking is launched as an open weights mannequin with a 256K token context window and native INT4 inference, which reduces latency and GPU reminiscence utilization whereas preserving benchmark efficiency.
K2 Thinking is already stay on kimi.com in chat mode and is accessible by way of the Moonshot platform API, with a devoted agentic mode deliberate to expose the total instrument utilizing habits.
Architecture, MoE design, and context size
Kimi K2 Thinking inherits the Kimi K2 Mixture of Experts design. The mannequin makes use of a MoE structure with 1T complete parameters and 32B activated parameters per token. It has 61 layers together with 1 dense layer, 384 consultants with 8 consultants chosen per token, 1 shared knowledgeable, 64 consideration heads, and an consideration hidden dimension of 7168. The MoE hidden dimension is 2048 per knowledgeable.
The vocabulary measurement is 160K tokens and the context size is 256K. The consideration mechanism is Multi head Latent Attention, and the activation perform is SwiGLU.
Test time scaling and lengthy horizon pondering
Kimi K2 Thinking is explicitly optimized for check time scaling. The mannequin is educated to develop its reasoning size and power name depth when dealing with more durable duties, relatively than counting on a hard and fast brief chain of thought.

On Humanity’s Last Exam within the no instruments setting, K2 Thinking scores 23.9. With instruments, the rating rises to 44.9, and within the heavy setting it reaches 51.0. On AIME25 with Python, it reviews 99.1, and on HMMT25 with Python it reviews 95.1. On IMO AnswerBench it scores 78.6, and on GPQA it scores 84.5.
The testing protocol caps pondering token budgets at 96K for HLE, AIME25, HMMT25, and GPQA. It makes use of 128K pondering tokens for IMO AnswerBench, LiveCodeBench, and OJ Bench, and 32K completion tokens for Longform Writing. On HLE, the utmost step restrict is 120 with a 48K reasoning finances per step. On agentic search duties, the restrict is 300 steps with a 24K reasoning finances per step.
Benchmarks in agentic search and coding
On agentic search duties with instruments, K2 Thinking reviews 60.2 on BrowseComp, 62.3 on BrowseComp ZH, 56.3 on Seal 0, 47.4 on FinSearchComp T3, and 87.0 on Frames.
On basic data benchmarks, it reviews 84.6 on MMLU Pro, 94.4 on MMLU Redux, 73.8 on Longform Writing, and 58.0 on Well beingBench.
For coding, K2 Thinking achieves 71.3 on SWE bench Verified with instruments, 61.1 on SWE bench Multilingual with instruments, 41.9 on Multi SWE bench with instruments, 44.8 on SciCode, 83.1 on LiveCodeBenchV6, 48.7 on OJ Bench within the C plus plus setting, and 47.1 on Terminal Bench with simulated instruments.
Moonshot group additionally defines a Heavy Mode that runs eight trajectories in parallel, then aggregates them to produce a remaining reply. This is utilized in some reasoning benchmarks to squeeze out additional accuracy from the identical base mannequin.
Native INT4 quantization and deployment
K2 Thinking is educated as a local INT4 mannequin. The analysis group applies Quantization Aware Training through the submit coaching stage and makes use of INT4 weight solely quantization on the MoE elements. This helps INT4 inference with roughly a 2x technology velocity enchancment in low latency mode whereas sustaining state-of-the-art efficiency. All reported benchmark scores are obtained below INT4 precision.
The checkpoints are saved in compressed tensors format and can be unpacked to larger precision codecs akin to FP8 or BF16 utilizing the official compressed tensors instruments. Recommended inference engines embrace vLLM, SGLang, and KTransformers.
Key Takeaways
- Kimi K2 Thinking is an open weights pondering agent that extends the Kimi K2 Mixture of Experts structure with specific lengthy horizon reasoning and power use, not simply brief chat type responses.
- The mannequin makes use of a trillion parameter MoE design with about tens of billions of lively parameters per token, a 256K context window, and is educated as a local INT4 mannequin with Quantization Aware Training, which supplies about 2x sooner inference whereas protecting benchmark efficiency secure.
- K2 Thinking is optimized for check time scaling, it can perform a whole bunch of sequential instrument calls in a single process and is evaluated below giant pondering token budgets and strict step caps, which is necessary whenever you strive to reproduce its reasoning and agentic outcomes.
- On public benchmarks, it leads or is aggressive on reasoning, agentic search, and coding duties akin to HLE with instruments, BrowseComp, and SWE bench Verified with instruments, displaying that the pondering oriented variant delivers clear positive aspects over the bottom non pondering K2 mannequin.
Editorial Comments
Kimi K2 Thinking is a robust sign that check time scaling is now a firstclass design goal for open supply reasoning fashions. Moonshot AI will not be solely exposing a 1T parameter Mixture of Experts system with 32B lively parameters and 256K context window, it’s doing so with native INT4 quantization, Quantization Aware Training, and power orchestration that runs for a whole bunch of steps in manufacturing like settings. Overall, Kimi K2 Thinking exhibits that open weights reasoning brokers with lengthy horizon planning and power use have gotten sensible infrastructure, not simply analysis demos.
Check out the Model Weights and Technical Details. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t overlook to be a part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The submit Moonshot AI Releases Kimi K2 Thinking: An Impressive Thinking Model that can Execute up to 200–300 Sequential Tool Calls without Human Interference appeared first on MarkTechPost.
