Skip to content

Transformer Math

Module 27 · Inference

🏎️ Speculative Decoding

Small model guesses 5 tokens, big model checks all 5 at once

Status:

Autoregressive generation is inherently sequential — each token needs the previous one. Speculative decoding breaks this bottleneck: a small, fast draft model proposes K tokens, and the big target model verifies all K in a single forward pass. Accepted tokens are kept, rejected ones are resampled. The output distribution is mathematically identical to the target model — lossless speedup.

🏎️

Draft → Verify → Accept

What you're seeing: The full draft-verify pipeline. The small model drafts 5 tokens sequentially (fast, one at a time). The big model verifies all 5 in a single parallel forward pass. 4 are accepted; the 5th diverges and gets resampled.

Key insight:Verification is parallel (1 forward pass) while generation is sequential (K passes) — net speedup when acceptance rate > 1/K.

Draft(7B)Verify(70B)t=0t=1 (draft done)t=2 (verify done)t1fastt2fastt3fastt4fastt5fast5 candidatesBig Model — ONE forward pass (parallel)scores all 5 positions simultaneouslyt1t2t3t4t5′4 accepted1 resampled
🎮

Standard vs. Speculative Decoding

What you're seeing: Side-by-side comparison of standard decoding (one token per big-model forward pass) vs. speculative decoding (draft 5 tokens with a small model, verify all at once with the big model).

What to notice:The draft model guesses "famous" but the target model prefers "known" — that token gets rejected and resampled. The first 4 tokens are accepted for free.

Standard Decoding (1 token/step)

The capital of France is

0 tokens generated = 0 forward passes of big model

Speculative Decoding (draft + verify)

The capital of France is

Waiting...

💡

The Intuition

The problem:Generating one token from a 70B model requires reading all 140GB of weights (FP16) from GPU memory. At 2 TB/s bandwidth, that's . The GPU's compute units are 95% idle, just waiting for data.

The key insight:Verifying K tokens costs almost the same as generating 1 token. Why? The bottleneck is reading weights from memory, and you read them once regardless of whether you're processing 1 token or K tokens. The extra compute for K tokens is free because the GPU was underutilized.

Draft-then-verify pipeline: (1) Draft model (, fast) generates K candidate tokens autoregressively. (2) Target model (70B+) processes the prompt + K candidates in one forward pass, producing probability distributions for each position. (3) Accept/reject: for each candidate, compare the draft and target distributions using rejection sampling. Accepted tokens are kept; the first rejected token is resampled from the target distribution, and remaining candidates are discarded.

Why it's lossless: This is not an approximation. The acceptance-rejection scheme is mathematically equivalent to sampling directly from the target model. If the draft model perfectly matches the target, all tokens are accepted (maximum speedup). If the draft model is completely wrong, every token is rejected and resampled — you still get the correct output, just without speedup.

Variants: Beyond the standard draft-model approach, several self-speculative methods eliminate the need for a separate draft model. Medusa adds multiple parallel prediction heads to the target model, each predicting a future token position. EAGLEtrains an autoregressive head on the target's hidden states for higher acceptance rates. Lookahead decoding uses Jacobi iteration to generate multiple tokens in parallel without any draft model at all — treating autoregressive generation as a fixed-point problem.

✨ Insight · Think of it like a fast junior developer who writes draft code, and a senior expert who reviews it. Approving correct code is faster than writing it from scratch. Even when the junior gets some parts wrong, the senior only needs to rewrite those parts.

Prompt lookup decoding (Saxena 2023) is a zero-overhead variant that needs no draft model at all. Instead of sampling from a smaller model, it scans the input prompt for n-grams that match the most recently generated tokens and copies candidate tokens directly from the prompt. For tasks where the output re-uses input text verbatim — code editing, document summarization, retrieval-augmented generation, fill-in-the-middle — acceptance rates can reach 60–80%, giving with zero additional parameters or memory. The entire draft phase costs only a string search over the prompt. This is now available as prompt_lookup_num_tokens in HuggingFace generate(). For general text where the output diverges from the prompt, it degrades gracefully to standard decoding with negligible overhead.

Quick check

Trade-off

A 70B FP16 model on a 2 TB/s GPU takes ~70 ms per token. What hardware property makes this the same cost as verifying K=5 draft tokens simultaneously?

A 70B FP16 model on a 2 TB/s GPU takes ~70 ms per token. What hardware property makes this the same cost as verifying K=5 draft tokens simultaneously?
Quick Check

Why is speculative decoding lossless (identical output distribution to the target model)?

📐

Step-by-Step Derivation

Acceptance Probability

For draft token with draft probability and target probability :

If , the token is always accepted. The target model "likes" it at least as much as the draft.

Rejection Resampling Distribution

When a draft token is rejected, sample a replacement from:

This corrects for the draft model's bias: tokens the target prefers more than the draft get boosted in the resampling distribution.

Expected Speedup

With average acceptance rate , draft tokens, and draft-to-target latency ratio :

When the draft model is very fast () and acceptance is high (), speedup approaches . In practice, with and .

Speculative Decoding Loop (Pseudo-code)

python
def speculative_decode(target_model, draft_model, prompt, K=5):
    tokens = prompt
    while not done:
        # 1. Draft: small model generates K candidates autoregressively
        draft_tokens, draft_probs = [], []
        for _ in range(K):
            p_draft = draft_model(tokens + draft_tokens)
            t = sample(p_draft)
            draft_tokens.append(t)
            draft_probs.append(p_draft[t])

        # 2. Verify: target model scores ALL candidates in one forward pass
        target_probs = target_model(tokens + draft_tokens)  # parallel!

        # 3. Accept/reject via rejection sampling
        accepted = []
        for i, t in enumerate(draft_tokens):
            r = random.uniform(0, 1)
            if r < min(1, target_probs[i][t] / draft_probs[i]):
                accepted.append(t)           # accept draft token
            else:
                # Resample from adjusted distribution
                residual = max(0, target_probs[i] - draft_probs[i])
                accepted.append(sample(residual / residual.sum()))
                break                          # discard remaining drafts

        tokens.extend(accepted)
    return tokens
PyTorch implementation
# Speculative decoding acceptance step: rejection sampling
import torch

def acceptance_step(p_target: torch.Tensor, p_draft: torch.Tensor, draft_token: int):
    """
    p_target, p_draft: probability vectors over vocab (shape: [vocab_size])
    Returns: (accepted: bool, next_token: int)
    """
    alpha = min(1.0, p_target[draft_token].item() / p_draft[draft_token].item())
    if torch.rand(1).item() < alpha:
        return True, draft_token  # accept draft token

    # Resample from residual distribution: max(0, p_target - p_draft)
    residual = torch.clamp(p_target - p_draft, min=0.0)
    total = residual.sum()
    # Degenerate case: p_draft >= p_target everywhere → fall back to p_target
    residual = residual / total if total > 0 else p_target
    return False, torch.multinomial(residual, num_samples=1).item()

Quick check

Derivation

A draft token x has q(x)=0.4 but p(x)=0.2. It is rejected. Why must the resampled token come from max(0, p−q) normalized — not from p directly?

A draft token x has q(x)=0.4 but p(x)=0.2. It is rejected. Why must the resampled token come from max(0, p−q) normalized — not from p directly?
🔧

Break It — See What Happens

Draft model too different from target
Generate too many draft tokens (K=50)
📊

Real-World Numbers

SystemMethodSpeedupDetails
Google PaLMStandard spec. decodingLeviathan et al. (2023), first large-scale deployment
MedusaParallel heads + tree verifyNo separate draft model — adds heads to the target model, tree-structured attention for verification
EAGLEFeature-level draftingDrafts in feature space (hidden states), not token space — higher acceptance rate than Medusa
EAGLE-2Dynamic draft treesContext-aware draft tree selection — expands high-confidence branches, prunes low-confidence ones
Typical: Llama-70BLlama-7B draftSame model family, , acceptance on general text
✨ Insight · The field is moving from external draft models to self-speculative methods (Medusa, EAGLE) that augment the target model itself. This avoids loading two separate models and often achieves higher acceptance rates because the draft shares the target's internal representations.

Quick check

Trade-off

Medusa reports 2.2–3.6× speedup; EAGLE reports 2.5–3.5×. Both eliminate external draft models. What accounts for EAGLE&apos;s higher acceptance rate?

Medusa reports 2.2–3.6× speedup; EAGLE reports 2.5–3.5×. Both eliminate external draft models. What accounts for EAGLE&apos;s higher acceptance rate?
🚀

SOTA 2024–2025: EAGLE-2 and EAGLE-3

The EAGLE line (SafeAI Lab, github.com/SafeAILab/EAGLE) has become the production standard for self-speculative decoding, integrated natively in both vLLM and SGLang.

EAGLE-2 (EMNLP 2024, arxiv:2406.16858)

EAGLE-1 used a static draft tree — a fixed branching structure decided at model-init time. EAGLE-2 replaces this with dynamic draft trees: at each step, the draft head scores candidate tokens and expands only high-confidence branches, pruning low-confidence ones. This context-aware pruning raises the effective acceptance rate. Result: over baseline autoregressive decoding, up from EAGLE-1's 2.5–3.5×.

EAGLE-3 (NeurIPS 2025, arxiv:2503.01840) — current SOTA

EAGLE-3 changes the draft head architecture. EAGLE-1/2 feed only the final-layer hidden state into the draft head. EAGLE-3 uses multi-layer feature fusion: it combines hidden states from multiple transformer layers (early + late), giving the draft head richer semantic context. This significantly improves draft acceptance rates without changing the verification protocol. Result: with no output quality loss — the rejection sampling guarantee remains fully intact.

Speculative decoding speedup evolution (2023–2025)

SystemYearSpeedup vs baseKey innovation
Medusa2024Parallel prediction heads on target model, tree verify
EAGLE-120242.5–3.5×Feature-space drafting (final hidden state), static tree
EAGLE-2EMNLP 2024Dynamic draft trees, context-aware branch pruning
EAGLE-3NeurIPS 2025Multi-layer feature fusion in draft head, zero quality loss
✨ Insight · Production status (2025): EAGLE-2 and EAGLE-3 are integrated in vLLM (--speculative-model eagle) and SGLang. For batch inference on Llama-3 70B, EAGLE-3 is the default recommended choice. The speedup is latency-reducing — wall-clock time to first token is unchanged (draft tokens are not verified until after the target model runs), but tokens-per-second throughput increases 4–6× on decode-heavy workloads.
EAGLE-3 multi-layer fusion — architecture sketch
import torch
import torch.nn as nn

class Eagle3DraftHead(nn.Module):
    """EAGLE-3: fuses hidden states from multiple layers into the draft head.
    Reference: arxiv:2503.01840 (simplified).
    """
    def __init__(self, d_model: int, vocab_size: int, fusion_layers: list[int]):
        super().__init__()
        self.fusion_layers = fusion_layers  # e.g. [0, 16, 31] for early/mid/late
        n = len(fusion_layers)
        # Project each layer's hidden state to d_model, then fuse
        self.layer_projections = nn.ModuleList([
            nn.Linear(d_model, d_model, bias=False) for _ in range(n)
        ])
        self.fusion = nn.Linear(n * d_model, d_model, bias=False)
        # Draft transformer block (1–2 layers)
        self.draft_attn = nn.TransformerEncoderLayer(d_model, nhead=8, batch_first=True)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

    def forward(self, layer_hidden_states: list[torch.Tensor]) -> torch.Tensor:
        # layer_hidden_states[i]: (B, T, d_model) from layer fusion_layers[i]
        projected = [proj(h) for proj, h in zip(self.layer_projections, layer_hidden_states)]
        fused = self.fusion(torch.cat(projected, dim=-1))  # (B, T, d_model)
        draft_features = self.draft_attn(fused)
        return self.lm_head(draft_features)  # (B, T, vocab_size) — draft logits
🧠

Key Takeaways

What to remember for interviews

  1. 1Verifying K draft tokens costs nearly the same as generating 1 token because the bottleneck is reading model weights from HBM — and you read them once regardless of K.
  2. 2The acceptance-rejection scheme is mathematically lossless: accepted tokens use min(1, p_target/p_draft), rejected tokens are resampled from max(0, p_target - p_draft) normalized.
  3. 3Expected speedup = (1 - alpha^(K+1)) / ((1 - alpha)(1 + Kc)). With alpha=0.7, K=5, and a fast draft model, practical speedup is 2-3x.
  4. 4EAGLE-3 (NeurIPS 2025, current SOTA): multi-layer feature fusion in the draft head achieves 4.1–6.5× speedup with zero quality loss. Integrated in vLLM and SGLang.
  5. 5Prompt lookup decoding achieves 2-3x speedup with zero extra parameters by copying n-grams from the input prompt as draft tokens — ideal for code editing and RAG.
🧠

Recap quiz

Derivation

With α=0.7, K=5, and a negligibly fast draft model (c≈0), what is the expected number of tokens accepted per speculative decoding step?

With α=0.7, K=5, and a negligibly fast draft model (c≈0), what is the expected number of tokens accepted per speculative decoding step?
Trade-off

Why does verifying K draft tokens cost nearly the same wall time as generating 1 token on a 70B model at batch size 1?

Why does verifying K draft tokens cost nearly the same wall time as generating 1 token on a 70B model at batch size 1?
Derivation

A team deploys a code-focused draft model to accelerate a general-purpose 70B chat model. Acceptance rate drops to α≈0.2. What is the expected speedup with K=5 and c≈0?

A team deploys a code-focused draft model to accelerate a general-purpose 70B chat model. Acceptance rate drops to α≈0.2. What is the expected speedup with K=5 and c≈0?
Trade-off

Prompt lookup decoding achieves 60–80% acceptance rates on code-editing tasks with zero extra parameters. Why does it degrade gracefully on open-ended generation?

Prompt lookup decoding achieves 60–80% acceptance rates on code-editing tasks with zero extra parameters. Why does it degrade gracefully on open-ended generation?
Trade-off

Medusa replaces the external draft model with prediction heads on the target model itself. What is the key serving advantage, and what does it require?

Medusa replaces the external draft model with prediction heads on the target model itself. What is the key serving advantage, and what does it require?
Derivation

With α=0.7, why is K=5 typically optimal rather than K=20 for speculative decoding?

With α=0.7, why is K=5 typically optimal rather than K=20 for speculative decoding?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 5 of 5

Explain speculative decoding step by step. Why is it mathematically lossless?

★★★
GoogleMeta

What determines the speedup of speculative decoding? When does it fail to provide speedup?

★★☆
GoogleOpenAI

Compare speculative decoding approaches: standard (draft model), Medusa, and EAGLE. What are the tradeoffs?

★★★
GoogleMeta

How does the acceptance-rejection sampling in speculative decoding work? Derive the adjusted distribution for rejected tokens.

★★★
GoogleAnthropic

Why is speculative decoding particularly effective for LLMs but not for small models? What hardware property makes it work?

★★☆
GoogleMeta