Inference speed becomes a competitive metric for large language models. Xiaomi’s MiMo team has just released MiMo-V2.5-Pro-UltraSpeed, developed in collaboration with the TileRT programming team. It records faster than 1000 tokens per second in a trillion parameter model. The Xiaomi team describes this as starting on a multi-billion parameter scale. The demos show a production peak of around 1200 tokens per second. The notable part is the hardware: it runs on commodity GPUs, not custom silicon.
What is MiMo-V2.5-Pro-UltraSpeed
UltraSpeed a high-speed delivery mode for the existing MiMo-V2.5-Pro model. The basic model uses a Mixture-of-Experts (MoE) formulation at the multi-billion parameter scale. UltraSpeed targets production speed rather than model capability. Changes how fast the model generates output tokens. The speedup comes from three integrated strategies in every model and transmission system. Xiaomi calls this method of displaying the system code. Most importantly, the entire stack runs on a single standard 8-GPU hardware platform.
Speed Case: Three Layers Working Together
The first layer is FP4 quantization. At the trillion scale, FP8 or FP16 weights create heavy memory and bandwidth pressure. Weights with a lower minimum width move through memory faster, which directly increases the decode speed. Xiaomi uses the MXFP4 format, which is used by MoE Professionals only. Some modules maintain high precision, reported as FP8 by TileRT. Professionals handle more parameters and tolerate better measurements, so the tradeoff is attractive. Quantization-Aware Training (QAT) keeps the quality of the benchmark consistent with the original.
The second layer is the DFlash projection architecture, which is covered in more detail below. The third layer is TileRT, a system that does everything on the GPU. Each process alone is not enough. A 1000 TPS result requires all three to be tightly aligned.
DFlash: Parallel Programming Without the Serial Bottleneck
Standard predictive coding uses a small draft model to predict future tokens. The larger model then corroborates those predictions in parallel. Downsampling keeps the output the same as normal decoding, so quality is not lost. The problem is that the draft model still generates tokens one at a time. DFlash, an approach from the research community, removes that limitation. It uses block-level hidden matching prediction. The draft model fills an entire block of covered areas in one forward position.
Xiaomi has tuned DFlash with a second-order Muon stimulator and a self-blinding model. The draft model uses only Sliding Window Attention (SWA), which is compatible with the MiMo-V2 design. This makes each prediction calculation constant rather than growing with the length of the context. The block size is set to 8 to limit verification costs and increase currency compatibility.
Acceptance length measures how many draft tokens survive validation each round.
| The situation | Length of reception |
|---|---|
| Coding | 6.30 |
| Statistics / Consulting | 5.56 |
| Agent | 4.29 |
In coding, six to seven out of eight draft tokens are accepted each round. Some samples are up to 7.14.
TileRT: Compressing Microseconds
At 1000 TPS, each operator spends only microseconds. Traditional systems launch operators one by one, and each launch costs time. Those gaps break the execution stream and become a real barrier. TileRT replaces this with a Persistent Engine Kernel that always resides on the GPU. It uses Warp Specialization to separate data movement, computation, and communication into coordinated roles. Small functions like RMSNorm, RoPE, and KV cache writes become bottlenecks at this scale. The program was designed in conjunction with the FP4 and DFlash options, not added later.
Use Cases
The release directs to the delay-sensitive function when the wait skips the loop:
- Similar thinking: use multiple Best-of-N or tree search methods at the same time of the wall.
- Code agents: generating code faster reduces the wait between agent steps.
- Real time decisions: generating trading signals, fraud, and live chat.
- Interactive prototyping: demos show the Snake game in about 10 seconds and the macOS interface in about a minute.
These are the workloads bound to the output of the task where the speed of the tokens is the bounded limit.
How it compares
The first table compares the two routes with extreme decoding speeds.
| The way | Computer hardware | How speed is achieved |
|---|---|---|
| Cerebras | Wafer-Scale Integration (custom) | Measure out in one custom slice of bread |
| Groq | Custom properties | Pure on-chip SRAM |
| MiMo × TileRT | Commodity GPUs (8-GPU node) | System model code: FP4 + DFlash + TileRT |
The second table compares the standard model with the UltraSpeed mode.
| Size | MiMo-V2.5-Pro | MiMo-V2.5-Pro-UltraSpeed |
|---|---|---|
| Record the speed | The foundation | ~10× faster (1000+ TPS) |
| Price | 1× | 3× |
| Weight accuracy | General | FP4 MoE experts with QAT |
| Recording | Standard autoregressive | DFlash projection decoding |
| Access | Standard model programs | API only, application based trial |
| Token system | Supported | It is not supported |
Access, Pricing, and Open Source
UltraSpeed runs in a limited, application-based window. The API trial runs from June 9 to June 23, 2026. The price is 3× the standard price of MiMo-V2.5-Pro, with about 10× speed. It is an API only, and the Token Program is not supported. Authorized users also get free chat access during the trial. Chat limits apply: 10 daily queue entries, 30 minute sessions, and 5 minute idle ban. Xiaomi has opened the source for testing MiMo-V2.5-Pro-FP4-DFlash on Hugging Face. TileRT has a selection of open source modules on GitHub.
Powers and Limitations
Power
- 1000+ TPS on 1T model without custom silicon.
- Lossless decoding with sample rejection in DFlash.
- FP4 is used only when tolerances are very high, maintaining quality.
- An open testing environment allows the public to test claims.
Limitations
- Access is gated, short, and based on authorization at startup.
- The price is tripled per token compared to the standard model.
- The length of reception comes down to open discussion.
- Independent third-party speed verification is not yet public.
Key Takeaways
- Xiaomi MiMo and TileRT determine the 1-trillion parameter model past 1000 tokens per second on commodity GPUs.
- The acceleration comes from three stages: FP4 quantization, DFlash speculative decoding, and the TileRT runtime.
- FP4 (MXFP4) is only used for MoE Professionals; QAT keeps the skill level.
- DFlash predicts an entire hidden block in a forward pass, hitting an average guess length of 6.30 for coding.
- UltraSpeeda is on a single 8-GPU environment with an application-based API trial, June 9–23, 2026.
Marktechpost Visual Explainer
Marktechpost
AI research, models, and developer tools – explained by developers.
Check it out Model weights again Technical details. Also, feel free to follow us Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us