🏎️ Speculative Decoding
Small model guesses 5 tokens, big model checks all 5 at once
Autoregressive generation is inherently sequential — each token needs the previous one. Speculative decoding breaks this bottleneck: a small, fast draft model proposes K tokens, and the big target model verifies all K in a single forward pass. Accepted tokens are kept, rejected ones are resampled. The output distribution is mathematically identical to the target model — lossless speedup.
Draft → Verify → Accept
What you're seeing: The full draft-verify pipeline. The small model drafts 5 tokens sequentially (fast, one at a time). The big model verifies all 5 in a single parallel forward pass. 4 are accepted; the 5th diverges and gets resampled.
Key insight:Verification is parallel (1 forward pass) while generation is sequential (K passes) — net speedup when acceptance rate > 1/K.
Standard vs. Speculative Decoding
What you're seeing: Side-by-side comparison of standard decoding (one token per big-model forward pass) vs. speculative decoding (draft 5 tokens with a small model, verify all at once with the big model).
What to notice:The draft model guesses "famous" but the target model prefers "known" — that token gets rejected and resampled. The first 4 tokens are accepted for free.
Standard Decoding (1 token/step)
0 tokens generated = 0 forward passes of big model
Speculative Decoding (draft + verify)
Waiting...
The Intuition
The problem:Generating one token from a 70B model requires reading all 140GB of weights (FP16) from GPU memory. At 2 TB/s bandwidth, that's . The GPU's compute units are 95% idle, just waiting for data.
The key insight:Verifying K tokens costs almost the same as generating 1 token. Why? The bottleneck is reading weights from memory, and you read them once regardless of whether you're processing 1 token or K tokens. The extra compute for K tokens is free because the GPU was underutilized.
Draft-then-verify pipeline: (1) Draft model (, fast) generates K candidate tokens autoregressively. (2) Target model (70B+) processes the prompt + K candidates in one forward pass, producing probability distributions for each position. (3) Accept/reject: for each candidate, compare the draft and target distributions using rejection sampling. Accepted tokens are kept; the first rejected token is resampled from the target distribution, and remaining candidates are discarded.
Why it's lossless: This is not an approximation. The acceptance-rejection scheme is mathematically equivalent to sampling directly from the target model. If the draft model perfectly matches the target, all tokens are accepted (maximum speedup). If the draft model is completely wrong, every token is rejected and resampled — you still get the correct output, just without speedup.
Variants: Beyond the standard draft-model approach, several self-speculative methods eliminate the need for a separate draft model. Medusa adds multiple parallel prediction heads to the target model, each predicting a future token position. EAGLEtrains an autoregressive head on the target's hidden states for higher acceptance rates. Lookahead decoding uses Jacobi iteration to generate multiple tokens in parallel without any draft model at all — treating autoregressive generation as a fixed-point problem.
Prompt lookup decoding (Saxena 2023) is a zero-overhead variant that needs no draft model at all. Instead of sampling from a smaller model, it scans the input prompt for n-grams that match the most recently generated tokens and copies candidate tokens directly from the prompt. For tasks where the output re-uses input text verbatim — code editing, document summarization, retrieval-augmented generation, fill-in-the-middle — acceptance rates can reach 60–80%, giving with zero additional parameters or memory. The entire draft phase costs only a string search over the prompt. This is now available as prompt_lookup_num_tokens in HuggingFace generate(). For general text where the output diverges from the prompt, it degrades gracefully to standard decoding with negligible overhead.
Quick check
A 70B FP16 model on a 2 TB/s GPU takes ~70 ms per token. What hardware property makes this the same cost as verifying K=5 draft tokens simultaneously?
Why is speculative decoding lossless (identical output distribution to the target model)?
Step-by-Step Derivation
Acceptance Probability
For draft token with draft probability and target probability :
If , the token is always accepted. The target model "likes" it at least as much as the draft.
Rejection Resampling Distribution
When a draft token is rejected, sample a replacement from:
This corrects for the draft model's bias: tokens the target prefers more than the draft get boosted in the resampling distribution.
Expected Speedup
With average acceptance rate , draft tokens, and draft-to-target latency ratio :
When the draft model is very fast () and acceptance is high (), speedup approaches . In practice, with and .
Speculative Decoding Loop (Pseudo-code)
def speculative_decode(target_model, draft_model, prompt, K=5):
tokens = prompt
while not done:
# 1. Draft: small model generates K candidates autoregressively
draft_tokens, draft_probs = [], []
for _ in range(K):
p_draft = draft_model(tokens + draft_tokens)
t = sample(p_draft)
draft_tokens.append(t)
draft_probs.append(p_draft[t])
# 2. Verify: target model scores ALL candidates in one forward pass
target_probs = target_model(tokens + draft_tokens) # parallel!
# 3. Accept/reject via rejection sampling
accepted = []
for i, t in enumerate(draft_tokens):
r = random.uniform(0, 1)
if r < min(1, target_probs[i][t] / draft_probs[i]):
accepted.append(t) # accept draft token
else:
# Resample from adjusted distribution
residual = max(0, target_probs[i] - draft_probs[i])
accepted.append(sample(residual / residual.sum()))
break # discard remaining drafts
tokens.extend(accepted)
return tokensPyTorch implementation
# Speculative decoding acceptance step: rejection sampling
import torch
def acceptance_step(p_target: torch.Tensor, p_draft: torch.Tensor, draft_token: int):
"""
p_target, p_draft: probability vectors over vocab (shape: [vocab_size])
Returns: (accepted: bool, next_token: int)
"""
alpha = min(1.0, p_target[draft_token].item() / p_draft[draft_token].item())
if torch.rand(1).item() < alpha:
return True, draft_token # accept draft token
# Resample from residual distribution: max(0, p_target - p_draft)
residual = torch.clamp(p_target - p_draft, min=0.0)
total = residual.sum()
# Degenerate case: p_draft >= p_target everywhere → fall back to p_target
residual = residual / total if total > 0 else p_target
return False, torch.multinomial(residual, num_samples=1).item()Quick check
A draft token x has q(x)=0.4 but p(x)=0.2. It is rejected. Why must the resampled token come from max(0, p−q) normalized — not from p directly?
Break It — See What Happens
Real-World Numbers
| System | Method | Speedup | Details |
|---|---|---|---|
| Google PaLM | Standard spec. decoding | Leviathan et al. (2023), first large-scale deployment | |
| Medusa | Parallel heads + tree verify | No separate draft model — adds heads to the target model, tree-structured attention for verification | |
| EAGLE | Feature-level drafting | Drafts in feature space (hidden states), not token space — higher acceptance rate than Medusa | |
| EAGLE-2 | Dynamic draft trees | Context-aware draft tree selection — expands high-confidence branches, prunes low-confidence ones | |
| Typical: Llama-70B | Llama-7B draft | Same model family, , acceptance on general text |
Quick check
Medusa reports 2.2–3.6× speedup; EAGLE reports 2.5–3.5×. Both eliminate external draft models. What accounts for EAGLE's higher acceptance rate?
SOTA 2024–2025: EAGLE-2 and EAGLE-3
The EAGLE line (SafeAI Lab, github.com/SafeAILab/EAGLE) has become the production standard for self-speculative decoding, integrated natively in both vLLM and SGLang.
EAGLE-2 (EMNLP 2024, arxiv:2406.16858)
EAGLE-1 used a static draft tree — a fixed branching structure decided at model-init time. EAGLE-2 replaces this with dynamic draft trees: at each step, the draft head scores candidate tokens and expands only high-confidence branches, pruning low-confidence ones. This context-aware pruning raises the effective acceptance rate. Result: over baseline autoregressive decoding, up from EAGLE-1's 2.5–3.5×.
EAGLE-3 (NeurIPS 2025, arxiv:2503.01840) — current SOTA
EAGLE-3 changes the draft head architecture. EAGLE-1/2 feed only the final-layer hidden state into the draft head. EAGLE-3 uses multi-layer feature fusion: it combines hidden states from multiple transformer layers (early + late), giving the draft head richer semantic context. This significantly improves draft acceptance rates without changing the verification protocol. Result: with no output quality loss — the rejection sampling guarantee remains fully intact.
Speculative decoding speedup evolution (2023–2025)
| System | Year | Speedup vs base | Key innovation |
|---|---|---|---|
| Medusa | 2024 | Parallel prediction heads on target model, tree verify | |
| EAGLE-1 | 2024 | 2.5–3.5× | Feature-space drafting (final hidden state), static tree |
| EAGLE-2 | EMNLP 2024 | Dynamic draft trees, context-aware branch pruning | |
| EAGLE-3 | NeurIPS 2025 | Multi-layer feature fusion in draft head, zero quality loss |
--speculative-model eagle) and SGLang. For batch inference on Llama-3 70B, EAGLE-3 is the default recommended choice. The speedup is latency-reducing — wall-clock time to first token is unchanged (draft tokens are not verified until after the target model runs), but tokens-per-second throughput increases 4–6× on decode-heavy workloads.EAGLE-3 multi-layer fusion — architecture sketch
import torch
import torch.nn as nn
class Eagle3DraftHead(nn.Module):
"""EAGLE-3: fuses hidden states from multiple layers into the draft head.
Reference: arxiv:2503.01840 (simplified).
"""
def __init__(self, d_model: int, vocab_size: int, fusion_layers: list[int]):
super().__init__()
self.fusion_layers = fusion_layers # e.g. [0, 16, 31] for early/mid/late
n = len(fusion_layers)
# Project each layer's hidden state to d_model, then fuse
self.layer_projections = nn.ModuleList([
nn.Linear(d_model, d_model, bias=False) for _ in range(n)
])
self.fusion = nn.Linear(n * d_model, d_model, bias=False)
# Draft transformer block (1–2 layers)
self.draft_attn = nn.TransformerEncoderLayer(d_model, nhead=8, batch_first=True)
self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
def forward(self, layer_hidden_states: list[torch.Tensor]) -> torch.Tensor:
# layer_hidden_states[i]: (B, T, d_model) from layer fusion_layers[i]
projected = [proj(h) for proj, h in zip(self.layer_projections, layer_hidden_states)]
fused = self.fusion(torch.cat(projected, dim=-1)) # (B, T, d_model)
draft_features = self.draft_attn(fused)
return self.lm_head(draft_features) # (B, T, vocab_size) — draft logitsKey Takeaways
What to remember for interviews
- 1Verifying K draft tokens costs nearly the same as generating 1 token because the bottleneck is reading model weights from HBM — and you read them once regardless of K.
- 2The acceptance-rejection scheme is mathematically lossless: accepted tokens use min(1, p_target/p_draft), rejected tokens are resampled from max(0, p_target - p_draft) normalized.
- 3Expected speedup = (1 - alpha^(K+1)) / ((1 - alpha)(1 + Kc)). With alpha=0.7, K=5, and a fast draft model, practical speedup is 2-3x.
- 4EAGLE-3 (NeurIPS 2025, current SOTA): multi-layer feature fusion in the draft head achieves 4.1–6.5× speedup with zero quality loss. Integrated in vLLM and SGLang.
- 5Prompt lookup decoding achieves 2-3x speedup with zero extra parameters by copying n-grams from the input prompt as draft tokens — ideal for code editing and RAG.
Recap quiz
With α=0.7, K=5, and a negligibly fast draft model (c≈0), what is the expected number of tokens accepted per speculative decoding step?
Why does verifying K draft tokens cost nearly the same wall time as generating 1 token on a 70B model at batch size 1?
A team deploys a code-focused draft model to accelerate a general-purpose 70B chat model. Acceptance rate drops to α≈0.2. What is the expected speedup with K=5 and c≈0?
Prompt lookup decoding achieves 60–80% acceptance rates on code-editing tasks with zero extra parameters. Why does it degrade gracefully on open-ended generation?
Medusa replaces the external draft model with prediction heads on the target model itself. What is the key serving advantage, and what does it require?
With α=0.7, why is K=5 typically optimal rather than K=20 for speculative decoding?
Further Reading
- Fast Inference from Transformers via Speculative Decoding — Leviathan et al., 2023. The foundational paper proving speculative decoding is lossless via rejection sampling.
- Medusa: Simple LLM Inference Acceleration Framework — Cai et al., 2024. Adds parallel prediction heads to the target model with tree-structured verification.
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Li et al., 2024. Drafts in feature space for higher acceptance rates than token-level methods.
- EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees — Li et al., EMNLP 2024. Context-aware dynamic draft trees replace static trees, achieving 3.5–4.5× speedup over base decoding.
- EAGLE-3: Scaling up Inference Acceleration of LLMs via Training-Time Test — Li et al., NeurIPS 2025. Multi-layer feature fusion in the draft head; 4.1–6.5× speedup with no output quality loss — current SOTA. Integrated in vLLM and SGLang.
Interview Questions
Showing 5 of 5
Explain speculative decoding step by step. Why is it mathematically lossless?
★★★What determines the speedup of speculative decoding? When does it fail to provide speedup?
★★☆Compare speculative decoding approaches: standard (draft model), Medusa, and EAGLE. What are the tradeoffs?
★★★How does the acceptance-rejection sampling in speculative decoding work? Derive the adjusted distribution for rejected tokens.
★★★Why is speculative decoding particularly effective for LLMs but not for small models? What hardware property makes it work?
★★☆