|

Xiaomi Releases MiMo-V2.5-Pro and MiMo-V2.5: Matching Frontier Model Benchmarks at Significantly Lower Token Cost

Xiaomi MiMo staff publicly launched two new fashions: MiMo-V2.5-Pro and MiMo-V2.5. The benchmarks, mixed with some genuinely hanging real-world process demos, make a compelling case that open agentic AI is catching as much as the frontier quicker than most anticipated. Both fashions can be found instantly through API, and priced competitively.

What is an Agentic Model, and Why Does It Matter?

Most LLM benchmarks take a look at a mannequin’s skill to reply a single, self-contained query. Agentic benchmarks take a look at one thing a lot tougher — whether or not a mannequin can full a multi-step purpose autonomously, utilizing instruments (net search, code execution, file I/O, API calls) over many turns, with out dropping observe of the unique goal.

Think of it because the distinction between a mannequin that may reply “how do I write a lexer?” versus one that may really write a whole compiler, run exams in opposition to it, catch regressions, and repair them — all with no human within the loop. The latter is precisely what Xiaomi MiMo staff is demonstrating right here.

MiMo-V2.5-Pro: The Flagship

MiMo-V2.5-Pro is Xiaomi’s most succesful mannequin thus far, delivering important enhancements over its predecessor, MiMo-V2-Pro, typically agentic capabilities, complicated software program engineering, and long-horizon duties.

The key benchmark numbers are aggressive with prime closed-source fashions: SWE-bench Pro 57.2, Claw-Eval 63.8, and τ3-Bench 72.9 — putting it alongside Claude Opus 4.6 and GPT-5.4 throughout most evaluations. V2.5-Pro can maintain complicated, long-horizon duties spanning greater than a thousand instrument calls, demonstrating substantial enhancements in instruction following inside agentic situations, reliably adhering to refined necessities embedded in context and sustaining robust coherence throughout ultra-long contexts.

One behavioral property that distinguishes V2.5-Pro from earlier fashions is what Xiaomi MiMo staff calls “harness consciousness”: it makes full use of the affordances of its harness atmosphere, manages its reminiscence, and shapes how its personal context is populated towards the ultimate goal. This means the mannequin doesn’t simply execute directions mechanically. It actively optimizes its personal working atmosphere to remain on observe throughout very lengthy duties.

The three real-world process demos Xiaomi revealed illustrate precisely what “long-horizon agentic functionality” means in observe.

Demo 1 — SysY Compiler in Rust: Referred from Peking University’s Compiler Principles course venture, this process asks the mannequin to implement a whole SysY compiler in Rust from scratch: lexer, parser, AST, Koopa IR codegen, RISC-V meeting backend, and efficiency optimization. The reference venture usually takes a PKU CS main pupil a number of weeks. MiMo-V2.5-Pro completed in 4.3 hours throughout 672 instrument calls, scoring an ideal 233/233 in opposition to the course’s hidden take a look at suite.

What’s notable isn’t simply the ultimate rating — it’s the structure of execution. Rather than thrashing by means of trial and error, the mannequin constructed the compiler layer by layer: scaffold the total pipeline first, good Koopa IR (110/110), then the RISC-V backend (103/103), then efficiency (20/20). The first compile alone handed 137/233 exams, a 59% chilly begin that implies the structure was designed appropriately earlier than a single take a look at was run. When a refactoring step later brought on regressions, the mannequin identified the failures, recovered, and pushed on. This is structured, self-correcting engineering habits — not pattern-matched code era.

Demo 2 — Full-Featured Desktop Video Editor: With just some easy prompts, MiMo-V2.5-Pro delivered a working desktop app: multi-track timeline, clip trimming, cross-fades, audio mixing, and export pipeline. The ultimate construct is 8,192 traces of code, produced over 1,868 instrument calls throughout 11.5 hours of autonomous work.

Demo 3 — Analog EDA- FVF-LDO Design: This is probably the most technically specialised demo: a graduate-level analog-circuit EDA process requiring the design and optimization of a whole FVF-LDO (Flipped-Voltage-Follower low-dropout regulator) from scratch within the TSMC 180nm CMOS course of. The mannequin needed to dimension the ability transistor, tune the compensation community, and decide bias voltages in order that six metrics land inside spec concurrently — section margin, line regulation, load regulation, quiescent present, PSRR, and transient response. Wired into an ngspice simulation loop, in about an hour of closed-loop iteration — calling the simulator, studying waveforms, tweaking parameters — the mannequin produced a design the place each goal metric is met, with 4 key metrics improved by an order of magnitude over its personal preliminary try.

Token Efficiency: Intelligence at frontier degree is just helpful if it’s cost-effective. On ClawEval, V2.5-Pro lands at 64% Pass^3 utilizing solely ~70K tokens per trajectory — roughly 40–60% fewer tokens than Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4 at comparable functionality ranges. For engineers constructing manufacturing agent pipelines, it is a materials value discount, not only a advertising stat.

https://mimo.xiaomi.com/mimo-v2-5-pro/

MiMo Coding Bench is Xiaomi’s in-house analysis suite designed to evaluate fashions on real-world developer duties inside agentic frameworks like Claude Code. It covers repo understanding, venture constructing, code overview, structured artifact era, planning, SWE, and extra. V2.5-Pro leads the sector on this benchmark, and Xiaomi explicitly positions it as a drop-in backend for scaffolds together with Claude Code, OpenCode, and Kilo.

MiMo-V2.5: Native Omnimodal at Half the Cost

While V2.5-Pro targets the toughest long-horizon agentic duties, MiMo-V2.5 is a serious step ahead in agentic functionality and multimodal understanding. With native visible and audio understanding, MiMo-V2.5 causes seamlessly throughout modalities, surpasses MiMo-V2-Pro in agentic efficiency, and helps as much as 1 million tokens of context.

The mannequin is designed with notion and motion unified from scratch. MiMo-V2.5 is skilled from the begin to see, hear, and act on what it perceives, resulting in a single mannequin that understands all the things and will get issues finished. This is architecturally important — earlier multimodal fashions typically bolted imaginative and prescient on prime of a textual content spine, creating functionality gaps at the perception-action boundary.

On the coding facet, the worth proposition is obvious: in MiMo Coding Bench, MiMo-V2.5 delivers robust outcomes on on a regular basis coding duties, closing the hole with frontier fashions and matching MiMo-V2.5-Pro at half the fee. For groups that don’t want the acute long-horizon depth of V2.5-Pro, it is a compelling working level.

https://mimo.xiaomi.com/mimo-v2-5/

On multimodal benchmarks: MiMo-V2.5 achieves a 62.3 on the Claw-Eval basic subset, putting it at the Pareto frontier of efficiency and effectivity. On the multimodal agentic subset, MiMo-V2.5 reaches 23.8 on Claw-Eval Multimodal, matching Claude Sonnet 4.6, main MiMo-V2-Omni by eight factors, and trailing Claude Opus 4.6 by a single level.

On video understanding, MiMo-V2.5 scores 87.7 on Video-MME, successfully tied with Gemini 3 Pro (88.4) and properly forward of Gemini 3 Flash. Long-horizon video comprehension — scene monitoring, temporal reasoning, visible grounding over minutes of footage — is now in frontier territory. On picture understanding, MiMo-V2.5 lands at 81.0 on CharXiv RQ and 77.9 on MMMU-Pro, closing in on Gemini 3 Pro.

Pricing is simple: MiMo-V2.5 runs at 1x (1 token = 1 credit score), whereas MiMo-V2.5-Pro runs at 2x (1 token = 2 credit). Token Plans not cost a multiplier for the 1M-token context window — beforehand a typical value friction for long-context agentic workloads.

Key Takeaways

  • MiMo-V2.5-Pro matches frontier closed-source fashions on key agentic benchmarks (SWE-bench Pro 57.2, Claw-Eval 63.8, τ3-Bench 72.9), whereas utilizing 40–60% fewer tokens per trajectory than Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4.
  • Long-horizon autonomy is actual and measurable — V2.5-Pro autonomously constructed a whole SysY compiler in Rust (233/233 exams, 672 instrument calls, 4.3 hours) and a full-featured desktop video editor (8,192 traces of code, 1,868 instrument calls, 11.5 hours).
  • MiMo-V2.5 is natively omnimodal — skilled from scratch to see, hear, and act throughout modalities with a local 1M-token context window, matching Claude Sonnet 4.6 on Claw-Eval Multimodal and practically tying Gemini 3 Pro on Video-MME (87.7 vs. 88.4).
  • Pro-level coding efficiency at half the fee — on MiMo Coding Bench, MiMo-V2.5 matches MiMo-V2.5-Pro on on a regular basis coding duties at 1x token pricing, making it the sensible alternative for many manufacturing agent pipelines.
  • Both fashions are already appropriate with standard agentic scaffolds like Claude Code, OpenCode, and Kilo — giving AI devs a drop-in, auditable, self-hostable path to frontier-level agentic AI.

Check out the Technical details MiMo-V2.5, and Technical details MiMo-V2.5-Pro. Also, be happy to comply with us on Twitter and don’t neglect to hitch our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The put up Xiaomi Releases MiMo-V2.5-Pro and MiMo-V2.5: Matching Frontier Model Benchmarks at Significantly Lower Token Cost appeared first on MarkTechPost.

Similar Posts