Xiaomi MiMo and TileRT Push 1-Trillion-Parameter Model Past 1000 Tokens per Second on Commodity GPUs

Inference speed becomes a competitive metric for large language models. Xiaomi’s MiMo team has just released MiMo-V2.5-Pro-UltraSpeed, developed in collaboration with the TileRT programming team. It records faster than 1000 tokens per second in a trillion parameter model. The Xiaomi team describes this as starting on a multi-billion parameter scale. The demos show a production peak of around 1200 tokens per second. The notable part is the hardware: it runs on commodity GPUs, not custom silicon.

What is MiMo-V2.5-Pro-UltraSpeed

UltraSpeed a high-speed delivery mode for the existing MiMo-V2.5-Pro model. The basic model uses a Mixture-of-Experts (MoE) formulation at the multi-billion parameter scale. UltraSpeed targets production speed rather than model capability. Changes how fast the model generates output tokens. The speedup comes from three integrated strategies in every model and transmission system. Xiaomi calls this method of displaying the system code. Most importantly, the entire stack runs on a single standard 8-GPU hardware platform.

Speed Case: Three Layers Working Together

The first layer is FP4 quantization. At the trillion scale, FP8 or FP16 weights create heavy memory and bandwidth pressure. Weights with a lower minimum width move through memory faster, which directly increases the decode speed. Xiaomi uses the MXFP4 format, which is used by MoE Professionals only. Some modules maintain high precision, reported as FP8 by TileRT. Professionals handle more parameters and tolerate better measurements, so the tradeoff is attractive. Quantization-Aware Training (QAT) keeps the quality of the benchmark consistent with the original.

The second layer is the DFlash projection architecture, which is covered in more detail below. The third layer is TileRT, a system that does everything on the GPU. Each process alone is not enough. A 1000 TPS result requires all three to be tightly aligned.

DFlash: Parallel Programming Without the Serial Bottleneck

Standard predictive coding uses a small draft model to predict future tokens. The larger model then corroborates those predictions in parallel. Downsampling keeps the output the same as normal decoding, so quality is not lost. The problem is that the draft model still generates tokens one at a time. DFlash, an approach from the research community, removes that limitation. It uses block-level hidden matching prediction. The draft model fills an entire block of covered areas in one forward position.

Xiaomi has tuned DFlash with a second-order Muon stimulator and a self-blinding model. The draft model uses only Sliding Window Attention (SWA), which is compatible with the MiMo-V2 design. This makes each prediction calculation constant rather than growing with the length of the context. The block size is set to 8 to limit verification costs and increase currency compatibility.

Acceptance length measures how many draft tokens survive validation each round.

The situation	Length of reception
Coding	6.30
Statistics / Consulting	5.56
Agent	4.29

In coding, six to seven out of eight draft tokens are accepted each round. Some samples are up to 7.14.

TileRT: Compressing Microseconds

At 1000 TPS, each operator spends only microseconds. Traditional systems launch operators one by one, and each launch costs time. Those gaps break the execution stream and become a real barrier. TileRT replaces this with a Persistent Engine Kernel that always resides on the GPU. It uses Warp Specialization to separate data movement, computation, and communication into coordinated roles. Small functions like RMSNorm, RoPE, and KV cache writes become bottlenecks at this scale. The program was designed in conjunction with the FP4 and DFlash options, not added later.

Use Cases

The release directs to the delay-sensitive function when the wait skips the loop:

Similar thinking: use multiple Best-of-N or tree search methods at the same time of the wall.
Code agents: generating code faster reduces the wait between agent steps.
Real time decisions: generating trading signals, fraud, and live chat.
Interactive prototyping: demos show the Snake game in about 10 seconds and the macOS interface in about a minute.

These are the workloads bound to the output of the task where the speed of the tokens is the bounded limit.

How it compares

The first table compares the two routes with extreme decoding speeds.

The way	Computer hardware	How speed is achieved
Cerebras	Wafer-Scale Integration (custom)	Measure out in one custom slice of bread
Groq	Custom properties	Pure on-chip SRAM
MiMo × TileRT	Commodity GPUs (8-GPU node)	System model code: FP4 + DFlash + TileRT

The second table compares the standard model with the UltraSpeed mode.

Size	MiMo-V2.5-Pro	MiMo-V2.5-Pro-UltraSpeed
Record the speed	The foundation	~10× faster (1000+ TPS)
Price	1×	3×
Weight accuracy	General	FP4 MoE experts with QAT
Recording	Standard autoregressive	DFlash projection decoding
Access	Standard model programs	API only, application based trial
Token system	Supported	It is not supported

Access, Pricing, and Open Source

UltraSpeed runs in a limited, application-based window. The API trial runs from June 9 to June 23, 2026. The price is 3× the standard price of MiMo-V2.5-Pro, with about 10× speed. It is an API only, and the Token Program is not supported. Authorized users also get free chat access during the trial. Chat limits apply: 10 daily queue entries, 30 minute sessions, and 5 minute idle ban. Xiaomi has opened the source for testing MiMo-V2.5-Pro-FP4-DFlash on Hugging Face. TileRT has a selection of open source modules on GitHub.

Powers and Limitations

Power

1000+ TPS on 1T model without custom silicon.
Lossless decoding with sample rejection in DFlash.
FP4 is used only when tolerances are very high, maintaining quality.
An open testing environment allows the public to test claims.

Limitations

Access is gated, short, and based on authorization at startup.
The price is tripled per token compared to the standard model.
The length of reception comes down to open discussion.
Independent third-party speed verification is not yet public.

Key Takeaways

Xiaomi MiMo and TileRT determine the 1-trillion parameter model past 1000 tokens per second on commodity GPUs.
The acceleration comes from three stages: FP4 quantization, DFlash speculative decoding, and the TileRT runtime.
FP4 (MXFP4) is only used for MoE Professionals; QAT keeps the skill level.
DFlash predicts an entire hidden block in a forward pass, hitting an average guess length of 6.30 for coding.
UltraSpeeda is on a single 8-GPU environment with an application-based API trial, June 9–23, 2026.

Marktechpost Visual Explainer

01 / 08

What It Is

Xiaomi’s MiMo team has also built a TileRT programming team.
Determines more than 1000 tokens/s in a 1-trillion parameter model.
The demos show a production peak of around 1200 tokens/s.
It runs on commodity GPUs, one standard 8-GPU environment.
Released on June 8, 2026.

1000+tokens / second

1Tparameters (MoE)

8commodity GPUs

02 / 08

Three Layers Work Together

FP4 rating reduce weights and reduce bandwidth pressure.
DFlash Predictive prediction predicts multiple tokens in parallel.
TileRT it uses the entire pipeline on a microsecond scale.
Xiaomi calls this method of displaying the system code.
No single process is sufficient; all three must agree.

03 / 08

Layer 1 – FP4 Quantization

It uses the MXFP4 format to reduce memory and bandwidth costs.
Used at the discretion of MoE Professionals only.
Some modules maintain high precision (FP8, per TileRT).
Professionals hold many parameters and tolerate the best measurement.
QAT keeps the ability equal to the original.

04 / 08

Layer 2 — DFlash Predictive Recording

A research-community approach using block-level hidden parallel prediction.
The draft model fills the entire block in one forward pass.
Uses Sliding Window Attention; the block size is set to 8.
Dumping the sample keeps the output from being lost.

The situation	Length of reception
Coding	6.30
Statistics / Consulting	5.56
Agent	4.29

05 / 08

Layer 3 – TileRT runtime

At 1000 TPS, each operator spends only microseconds.
The Persistent Engine Kernel always resides on the GPU.
Warp Specialization separates data movement, computation, and communication.
Small ops like RMMSNorm and RoPE become issues here.
The runtime was designed in conjunction with FP4 and DFlash options.

06/08

Where It Comes In

Similar thinking: multiple N-best or tree-searching methods at once.
Code agents: less waiting between agent steps.
Real time loops: trademarks, fraud, live chat.
Interactive prototyping: Snake game in about 10 seconds.

07/08

Normal vs UltraSpeed

Size	MiMo-V2.5-Pro	UltraSpeed
Record the speed	The foundation	~10× (1000+ TPS)
Price	1×	3×
Weights	General	FP4 MoE Experts (QAT)
Recording	Autoregressive	DFlash guesses
Access	Standard programs	API only, per app

08/08

Access, Pricing and Open Source

The API trial runs from June 9 to June 23, 2026 (Beijing time).
The price is 3 × the standard rate at about 10× the speed.
API only; The Symbol Program is not supported.
Open Source Testbed: MiMo-V2.5-Pro-FP4-DFlash on Hugging Face.
TileRT has a selection of open source modules on GitHub.

Marktechpost
AI research, models, and developer tools – explained by developers.

Check it out Model weights again Technical details. Also, feel free to follow us Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us