DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

Large automated language models generate text one token at a time. Each sign waits for the one before it. This serial loop leaves modern GPUs underutilized and keeps processing slow. Costs get worse with longer Chain-of-Thought models. Their long effects make delay a prominent part of the generation.

Predictive decoding is a standard fix. A small draft model suggests future tokens. A large target model validates those tokens in parallel. Received tokens are stored, so your output remains intact. But many methods, including the state-of-the-art EAGLE-3, still write automatically. That draft serial includes a real-world speedup of around 2–3×.

DFlashpresented by a team of researchers from the UC San Diego group (z-lab), takes a different route. It is a lightweight block distribution model built for scripting. Instead of sorting tokens all at once, it proposes a complete block in one pass. The target model then verifies that block in parallel.

The research team reports more than 6× lossless speedup across a range of models and functions. It reaches 2.5× higher speed than EAGLE-3. At NVIDIA Blackwell, the NVIDIA engineering team reports access to up to 15× higher gpt-oss-120b. That number carries the same goal of user interaction.

Block diffusion draft changes

Block distribution models imply a block of hidden tokens at a time. They include parallel generation and autoregressive block structure. DFlash uses this view only in the scripting phase. Validation remains with a reliable autoregressive target model.

This classification is qualitatively important. LLMs of independent distributions tend to follow autocorrelation models with precision. They also require multiple denoising steps, which slows down their raw inference speed. DFlash eliminates both problems. The draft only needs to be good enough to be accepted. The corresponding verification of the target ensures the distribution of the final output.

The second benefit is the cost of writing. The cost of the autoregressive editor increases in proportion to the number of speculative tokens. The distribution framework generates all tokens in one common place. So the write latency remains very low as the block grows. This frees up DFlash to use deep, expressive draft models without adding latency.

This separates DFlash from previous distribution work. Methods such as DiffuSpec and SpecDiff-2 used larger 7B drafts, capping speedups near 3–4×. DFlash instead uses a small five-layer framework (eight layers for Qwen3-Coder).

“The target knows best” understanding

The main idea of ​​DFlash is simple: the target knows better. Hidden features of large autoregressive models include information about multiple future tokens. DFlash extracts hidden states from several target layers. It combines them into a single unified target context. This feature then sets the draft model.

DFlash injects this feature differently than EAGLE-3. EAGLE-3 includes features targeted at embedding draft inputs only. As the draft depth increases, that signal is diluted. DFlash instead injects attribute Key and Value projections into all draft layers. Conceptual features remain in the KV draft archive and continue to iterate through the draft.

This KV injection allows the acceptance length to scale with the draft depth. DFlash drafter with five layers generating 16 tokens outperforms EAGLE-3 generating 8 tokens. Both low latency and high acceptance in paper testing. The draft model effectively becomes a diffusion adapter over the target.

Two acceleration numbers, measured separately

DFlash’s 6× research is lossless single-stream acceleration. On Qwen3-8B with greedy decoding (Transformers backend), DFlash measures 4.86× speedup. EAGLE-3 averages 1.76× for tree size 16 and 2.02× for tree size 60. DFlash peaks at 6.08× for MATH-500 (τ = 7.87) and averages τ = 6.49 for all tasks.

NVIDIA’s 15× is the result of focused performance. Running gpt-oss-120b on eight NVIDIA Blackwell GPUs on a DGX B300 system, using TensorRT-LLM. In the range of 500–600 tokens/second per user, DFlash uses more than 15× automatic decoding. That’s about 1.5× more than EAGLE-3 in the same area.

The table below shows the acceleration of each function of paper in Qwen3-8B at 0 temperature (Back Transformers).

Function (Qwen3-8B, temp=0) The foundation EAGLE-3 (16) DFlash (16) DFlash τ
GSM8K 1.00× 1.94× 5.15× 6.54
COLORS-500 1.00× 1.81× 6.08× 7.87
AIME25 1.00× 1.79× 5.62× 7.08
HumanEval 1.00× 1.89× 5.14× 6.50
MBPP 1.00× 1.69× 4.65× 5.95
LiveCodeBench 1.00× 1.57× 5.51× 7.27
MT-Bench 1.00× 1.63× 2.75× 4.24
Average 1.00× 1.76× 4.86× 6.49

A separate NVIDIA Speed-Bench comparison measures the speed of performance against the same benchmark. On gpt-oss-120b, DFlash averages 2.3× compared to EAGLE-3’s 1.7×. In Llama 3.1 8B Instruct, DFlash averaged 2.8× compared to EAGLE-3’s 2.2×.

Use cases with examples

DFlash targets latency-sensitive operations where token and token generation are detrimental. The three patterns fit perfectly:

  • Coding agents: Coding requires fast, interactive responses. On the Gemma 4 31B with vLLM, NVIDIA reports up to 5.8× in Math500 with 1. HumanEval up to 5.6×. Faster drafts mean shorter wait times within agent loops.
  • Thinking models: Long Chain-of-Caught traces dominate generation time. With reflection mode enabled, DFlash runs about 4.5× less than greedy decoding on Qwen3-4B and Qwen3-8B. Under sample, it holds about 3.9×. This reduces the cost of long-term effects.
  • Serving and excelling: DFlash also increases the output feed. In SGlang with B200 GPU, it reaches 5.1× on Qwen3-8B (Math500, concurrency 1). Profits go up as concurrency goes up but you’re always optimistic, so supply costs go down.

It uses DFlash

DFlash ships with checkpoints and framework support, so adoption requires little code. In vLLM, you replace EAGLE-3 for DFlash. No redesign of the application is required.

vllm serve Qwen/Qwen3.5-27B 
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' 
  --attention-backend flash_attn 
  --max-num-batched-tokens 32768

The Transformers backend supports Qwen3 and LLaMA-3.1 models. It discloses a spec_generate call that pairs the draft model with the target model.

from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

draft = AutoModel.from_pretrained(
    "z-lab/Qwen3-8B-DFlash-b16", trust_remote_code=True,
    dtype="auto", device_map="cuda:0").eval()
target = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B", dtype="auto", device_map="cuda:0").eval()
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

messages = [{"role": "user", "content": "How many positive whole-number divisors does 196 have?"}]
input_ids = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True,
    enable_thinking=False).to(draft.device)

output = draft.spec_generate(
    input_ids=input_ids, max_new_tokens=2048, temperature=0.0,
    target=target, stop_token_ids=[tokenizer.eos_token_id])
print(tokenizer.decode(output[0], skip_special_tokens=False))

Key Takeaways

  • DFlash writes an entire token block in one forward pass, not one token at a time.
  • It injects the target hidden features into the entire KV cache of the draft layer, scaling the acknowledgment length with depth.
  • Research Paper Metrics: up to 6.08× lossless speed on Qwen3-8B; NVIDIA test: up to 15× output from Blackwell in focused interaction.
  • The lightweight five-layer drafter replaces the 7B drafters that included the previous methods of spreading around 3–4×.

Interactive Descriptor

Leave a Comment