|

Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs

Inference pace is turning into a aggressive metric for giant language fashions. Xiaomi’s MiMo staff simply launched MiMo-V2.5-Pro-ExtremelySpeed, inbuilt collaboration with the TileRT methods group. It decodes quicker than 1000 tokens per second on a 1-trillion-parameter mannequin. Xiaomi staff describes this as a first at trillion-parameter scale. Demos present era peaks close to 1200 tokens per second. The notable half is the {hardware}: it runs on commodity GPUs, not customized silicon.

What is MiMo-V2.5-Pro-ExtremelySpeed

ExtremelySpeed is a high-speed serving mode for the prevailing MiMo-V2.5-Pro mannequin. The base mannequin makes use of a Mixture-of-Experts (MoE) structure at trillion-parameter scale. ExtremelySpeed targets era pace relatively than mannequin functionality. It adjustments how briskly the mannequin produces output tokens. The speedup comes from three coordinated strategies throughout the mannequin and the serving system. Xiaomi calls this strategy excessive model-system codesign. Crucially, the whole stack runs on a single commonplace 8-GPU commodity node.

The Speed Case: Three Layers Working Together

The first layer is FP4 quantization. At trillion scale, FP8 or FP16 weights create heavy reminiscence and bandwidth stress. Lower bit-width weights transfer via reminiscence quicker, which instantly lifts decode pace. Xiaomi makes use of the MXFP4 format, utilized selectively to the MoE Experts solely. Other modules maintain greater precision, reported as FP8 by TileRT. Experts maintain most parameters and tolerate quantization finest, so the tradeoff is favorable. Quantization-Aware Training (QAT) retains benchmark high quality primarily on par with the unique.

The second layer is DFlash speculative decoding, coated intimately beneath. The third layer is TileRT, the system that executes all the pieces on the GPU. Each method alone isn’t sufficient. The 1000 TPS outcome wants all three aligned tightly.

DFlash: Parallel Drafting Without a Serial Bottleneck

Standard speculative decoding makes use of a small draft mannequin to guess upcoming tokens. The massive mannequin then verifies these guesses in parallel. Rejection sampling retains output equivalent to regular decoding, so high quality is lossless. The drawback is that the draft mannequin nonetheless generates tokens one at a time. DFlash, a technique from the analysis group, removes that constraint. It makes use of block-level masked parallel prediction. The draft mannequin fills a complete block of masked positions in a single ahead move.

Xiaomi tuned DFlash with the Muon second-order optimizer and mannequin self-distillation. The draft mannequin makes use of Sliding Window Attention (SWA) solely, matching the MiMo-V2 design. This makes per-prediction compute fixed relatively than rising with context size. Block measurement is capped at 8 to restrict verification value and increase concurrency.

Acceptance size measures what number of draft tokens survive verification every spherical.

Scenario Acceptance Length
Coding 6.30
Math / Reasoning 5.56
Agent 4.29

In coding, six to seven of eight draft tokens are accepted per spherical. Some samples attain a most of seven.14.

TileRT: Squeezing the Microseconds

At 1000 TPS, every operator runs for under microseconds. Traditional methods launch operators one after the other, and every launch prices time. Those gaps fracture the execution stream and grow to be the actual bottleneck. TileRT replaces this with a Persistent Engine Kernel that stays resident on the GPU. It makes use of Warp Specialization to separate knowledge motion, compute, and communication into coordinated roles. Small operations like RMSNorm, RoPE, and KV cache writes flip into bottlenecks at this scale. The system was co-designed with the FP4 and DFlash decisions, not added afterward.

Use Cases

The launch targets latency-sensitive work the place ready breaks the loop:

  • Parallel reasoning: run many Best-of-N or tree-search paths throughout the similar wall-clock time.
  • Coding brokers: quicker code era cuts the wait between agent steps.
  • Real-time choice loops: buying and selling sign era, fraud interception, and dwell dialogue.
  • Interactive prototyping: demos present a Snake sport in about 10 seconds and a macOS interface in about one minute.

These are throughput-bound workloads the place uncooked token pace is the binding constraint.

How It Compares

The first desk contrasts the 2 routes to excessive decode pace.

Approach Hardware How pace is achieved
Cerebras Wafer-Scale integration (customized) Scale on a single customized wafer
Groq Custom structure Pure on-chip SRAM
MiMo × TileRT Commodity GPUs (8-GPU node) Model-system codesign: FP4 + DFlash + TileRT

The second desk compares the usual mannequin with the ExtremelySpeed mode.

Dimension MiMo-V2.5-Pro MiMo-V2.5-Pro-ExtremelySpeed
Decode pace Baseline ~10× quicker (1000+ TPS)
Price
Weight precision Standard FP4 MoE Experts by way of QAT
Decoding Standard autoregressive DFlash speculative decoding
Access Standard mannequin plans API solely, application-based trial
Token Plan Supported Not supported

Access, Pricing, and Open Source

ExtremelySpeed ships via a restricted, application-based window. The API trial runs June 9 to June 23, 2026. Pricing is 3× the usual MiMo-V2.5-Pro charge, for roughly 10× the pace. It is API solely, and the Token Plan isn’t supported. Approved customers additionally obtain free Chat entry in the course of the trial. Chat limits apply: 10 queue entries every day, 30-minute periods, and 5-minute idle launch. Xiaomi open-sourced the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face. TileRT has open-sourced choose modules on GitHub.

Strengths and Limitations

Strengths

  • 1000+ TPS on a 1T mannequin with out customized silicon.
  • Lossless decoding via rejection sampling in DFlash.
  • FP4 utilized solely the place tolerance is highest, preserving high quality.
  • An open checkpoint lets the group check the claims.

Limitations

  • Access is gated, quick, and approval-based at launch.
  • Pricing triples per token versus the usual mannequin.
  • Acceptance size drops in open-ended dialog.
  • Independent third-party pace verification isn’t but public.

Key Takeaways

  • Xiaomi MiMo and TileRT decode a 1-trillion-parameter mannequin previous 1000 tokens per second on commodity GPUs.
  • The speedup comes from three layers: FP4 quantization, DFlash speculative decoding, and the TileRT runtime.
  • FP4 (MXFP4) is utilized solely to MoE Experts; QAT retains functionality primarily on par.
  • DFlash predicts a complete masked block per ahead move, hitting 6.30 common acceptance size in coding.
  • ExtremelySpeed runs on a single 8-GPU node by way of an application-based API trial, June 9–23, 2026.

Marktechpost’s Visual Explainer

GUIDE • INFERENCE SYSTEMS

MiMo-V2.5-Pro-ExtremelySpeed: 1000+ Tokens Per Second on a 1T Model

Xiaomi MiMo & TileRT — FP4 quantization, DFlash speculative decoding, and a microsecond-scale runtime.

01 / 08

What It Is

  • Xiaomi’s MiMo staff constructed it with the TileRT methods group.
  • It decodes over 1000 tokens/s on a 1-trillion-parameter mannequin.
  • Demos present era peaks close to 1200 tokens/s.
  • It runs on commodity GPUs, a single commonplace 8-GPU node.
  • Released June 8, 2026.
1000+tokens / second
1Tparameters (MoE)
8commodity GPUs

02 / 08

Three Layers Working Together

  • FP4 quantization shrinks weights and eases bandwidth stress.
  • DFlash speculative decoding predicts many tokens in parallel.
  • TileRT executes the entire pipeline at microsecond scale.
  • Xiaomi calls this strategy excessive model-system codesign.
  • No single method is sufficient; all three should align.

03 / 08

Layer 1 — FP4 Quantization

  • Uses the MXFP4 format to decrease reminiscence and bandwidth value.
  • Applied selectively to the MoE Experts solely.
  • Other modules maintain greater precision (FP8, per TileRT).
  • Experts maintain most parameters and tolerate quantization finest.
  • QAT retains functionality primarily on par with the unique.

04 / 08

Layer 2 — DFlash Speculative Decoding

  • A research-community technique utilizing block-level masked parallel prediction.
  • The draft mannequin fills a complete block in a single ahead move.
  • It makes use of Sliding Window Attention; block measurement capped at 8.
  • Rejection sampling retains the output lossless.
Scenario Acceptance Length
Coding 6.30
Math / Reasoning 5.56
Agent 4.29

05 / 08

Layer 3 — TileRT Runtime

  • At 1000 TPS, every operator runs for under microseconds.
  • A Persistent Engine Kernel stays resident on the GPU.
  • Warp Specialization splits knowledge motion, compute, and communication.
  • Small ops like RMSNorm and RoPE grow to be bottlenecks right here.
  • The runtime was co-designed with the FP4 and DFlash decisions.

06 / 08

Where It Fits

  • Parallel reasoning: many Best-of-N or tree-search paths directly.
  • Coding brokers: much less wait between agent steps.
  • Real-time loops: buying and selling alerts, fraud interception, dwell dialogue.
  • Interactive prototyping: a Snake sport in about 10 seconds.

07 / 08

Standard vs ExtremelySpeed

Dimension MiMo-V2.5-Pro ExtremelySpeed
Decode pace Baseline ~10× (1000+ TPS)
Price
Weights Standard FP4 MoE Experts (QAT)
Decoding Autoregressive DFlash speculative
Access Standard plans API solely, by utility

08 / 08

Access, Pricing & Open Source

  • API trial runs June 9 to June 23, 2026 (Beijing time).
  • Pricing is 3× the usual charge for roughly 10× pace.
  • API solely; the Token Plan isn’t supported.
  • Checkpoint open-sourced: MiMo-V2.5-Pro-FP4-DFlash on Hugging Face.
  • TileRT has open-sourced choose modules on GitHub.

Marktechpost
AI analysis, fashions, and developer instruments — defined for engineers.


Check out the Model weights and Technical detailsAlso, be at liberty to observe us on Twitter and don’t overlook to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The put up Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs appeared first on MarkTechPost.

Similar Posts