Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs
Inference pace is turning into a aggressive metric for giant language fashions. Xiaomi’s MiMo staff simply launched MiMo-V2.5-Pro-ExtremelySpeed, inbuilt collaboration with the TileRT methods group. It decodes quicker than 1000 tokens per second on a 1-trillion-parameter mannequin. Xiaomi staff describes this as a first at trillion-parameter scale. Demos present era peaks close to 1200 tokens per second. The notable half is the {hardware}: it runs on commodity GPUs, not customized silicon.
What is MiMo-V2.5-Pro-ExtremelySpeed
ExtremelySpeed is a high-speed serving mode for the prevailing MiMo-V2.5-Pro mannequin. The base mannequin makes use of a Mixture-of-Experts (MoE) structure at trillion-parameter scale. ExtremelySpeed targets era pace relatively than mannequin functionality. It adjustments how briskly the mannequin produces output tokens. The speedup comes from three coordinated strategies throughout the mannequin and the serving system. Xiaomi calls this strategy excessive model-system codesign. Crucially, the whole stack runs on a single commonplace 8-GPU commodity node.
The Speed Case: Three Layers Working Together
The first layer is FP4 quantization. At trillion scale, FP8 or FP16 weights create heavy reminiscence and bandwidth stress. Lower bit-width weights transfer via reminiscence quicker, which instantly lifts decode pace. Xiaomi makes use of the MXFP4 format, utilized selectively to the MoE Experts solely. Other modules maintain greater precision, reported as FP8 by TileRT. Experts maintain most parameters and tolerate quantization finest, so the tradeoff is favorable. Quantization-Aware Training (QAT) retains benchmark high quality primarily on par with the unique.
The second layer is DFlash speculative decoding, coated intimately beneath. The third layer is TileRT, the system that executes all the pieces on the GPU. Each method alone isn’t sufficient. The 1000 TPS outcome wants all three aligned tightly.
DFlash: Parallel Drafting Without a Serial Bottleneck
Standard speculative decoding makes use of a small draft mannequin to guess upcoming tokens. The massive mannequin then verifies these guesses in parallel. Rejection sampling retains output equivalent to regular decoding, so high quality is lossless. The drawback is that the draft mannequin nonetheless generates tokens one at a time. DFlash, a technique from the analysis group, removes that constraint. It makes use of block-level masked parallel prediction. The draft mannequin fills a complete block of masked positions in a single ahead move.
Xiaomi tuned DFlash with the Muon second-order optimizer and mannequin self-distillation. The draft mannequin makes use of Sliding Window Attention (SWA) solely, matching the MiMo-V2 design. This makes per-prediction compute fixed relatively than rising with context size. Block measurement is capped at 8 to restrict verification value and increase concurrency.
Acceptance size measures what number of draft tokens survive verification every spherical.
| Scenario | Acceptance Length |
|---|---|
| Coding | 6.30 |
| Math / Reasoning | 5.56 |
| Agent | 4.29 |
In coding, six to seven of eight draft tokens are accepted per spherical. Some samples attain a most of seven.14.
TileRT: Squeezing the Microseconds
At 1000 TPS, every operator runs for under microseconds. Traditional methods launch operators one after the other, and every launch prices time. Those gaps fracture the execution stream and grow to be the actual bottleneck. TileRT replaces this with a Persistent Engine Kernel that stays resident on the GPU. It makes use of Warp Specialization to separate knowledge motion, compute, and communication into coordinated roles. Small operations like RMSNorm, RoPE, and KV cache writes flip into bottlenecks at this scale. The system was co-designed with the FP4 and DFlash decisions, not added afterward.
Use Cases
The launch targets latency-sensitive work the place ready breaks the loop:
- Parallel reasoning: run many Best-of-N or tree-search paths throughout the similar wall-clock time.
- Coding brokers: quicker code era cuts the wait between agent steps.
- Real-time choice loops: buying and selling sign era, fraud interception, and dwell dialogue.
- Interactive prototyping: demos present a Snake sport in about 10 seconds and a macOS interface in about one minute.
These are throughput-bound workloads the place uncooked token pace is the binding constraint.
How It Compares
The first desk contrasts the 2 routes to excessive decode pace.
| Approach | Hardware | How pace is achieved |
|---|---|---|
| Cerebras | Wafer-Scale integration (customized) | Scale on a single customized wafer |
| Groq | Custom structure | Pure on-chip SRAM |
| MiMo × TileRT | Commodity GPUs (8-GPU node) | Model-system codesign: FP4 + DFlash + TileRT |
The second desk compares the usual mannequin with the ExtremelySpeed mode.
| Dimension | MiMo-V2.5-Pro | MiMo-V2.5-Pro-ExtremelySpeed |
|---|---|---|
| Decode pace | Baseline | ~10× quicker (1000+ TPS) |
| Price | 1× | 3× |
| Weight precision | Standard | FP4 MoE Experts by way of QAT |
| Decoding | Standard autoregressive | DFlash speculative decoding |
| Access | Standard mannequin plans | API solely, application-based trial |
| Token Plan | Supported | Not supported |
Access, Pricing, and Open Source
ExtremelySpeed ships via a restricted, application-based window. The API trial runs June 9 to June 23, 2026. Pricing is 3× the usual MiMo-V2.5-Pro charge, for roughly 10× the pace. It is API solely, and the Token Plan isn’t supported. Approved customers additionally obtain free Chat entry in the course of the trial. Chat limits apply: 10 queue entries every day, 30-minute periods, and 5-minute idle launch. Xiaomi open-sourced the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face. TileRT has open-sourced choose modules on GitHub.
Strengths and Limitations
Strengths
- 1000+ TPS on a 1T mannequin with out customized silicon.
- Lossless decoding via rejection sampling in DFlash.
- FP4 utilized solely the place tolerance is highest, preserving high quality.
- An open checkpoint lets the group check the claims.
Limitations
- Access is gated, quick, and approval-based at launch.
- Pricing triples per token versus the usual mannequin.
- Acceptance size drops in open-ended dialog.
- Independent third-party pace verification isn’t but public.
Key Takeaways
- Xiaomi MiMo and TileRT decode a 1-trillion-parameter mannequin previous 1000 tokens per second on commodity GPUs.
- The speedup comes from three layers: FP4 quantization, DFlash speculative decoding, and the TileRT runtime.
- FP4 (MXFP4) is utilized solely to MoE Experts; QAT retains functionality primarily on par.
- DFlash predicts a complete masked block per ahead move, hitting 6.30 common acceptance size in coding.
- ExtremelySpeed runs on a single 8-GPU node by way of an application-based API trial, June 9–23, 2026.
Marktechpost’s Visual Explainer
MiMo-V2.5-Pro-ExtremelySpeed: 1000+ Tokens Per Second on a 1T Model
Xiaomi MiMo & TileRT — FP4 quantization, DFlash speculative decoding, and a microsecond-scale runtime.
AI analysis, fashions, and developer instruments — defined for engineers.
Check out the Model weights and Technical details. Also, be at liberty to observe us on Twitter and don’t overlook to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us
The put up Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs appeared first on MarkTechPost.
