Speculative Decoding — Transformer Math

Module 27 · Inference

🏎️ Speculative Decoding

Small model guesses 5 tokens, big model checks all 5 at once

Status:

Autoregressive generation is inherently sequential — each token needs the previous one. Speculative decoding breaks this bottleneck: a small, fast draft model proposes K tokens, and the big target model verifies all K in a single forward pass. Accepted tokens are kept, rejected ones are resampled. The output distribution is mathematically identical to the target model — lossless speedup.

🏎️

Draft → Verify → Accept

What you're seeing: The full draft-verify pipeline. The small model drafts 5 tokens sequentially (fast, one at a time). The big model verifies all 5 in a single parallel forward pass. 4 are accepted; the 5th diverges and gets resampled.

Key insight:Verification is parallel (1 forward pass) while generation is sequential (K passes) — net speedup when acceptance rate > 1/K.

🎮

Standard vs. Speculative Decoding

What you're seeing: Side-by-side comparison of standard decoding (one token per big-model forward pass) vs. speculative decoding (draft 5 tokens with a small model, verify all at once with the big model).

What to notice:The draft model guesses "famous" but the target model prefers "known" — that token gets rejected and resampled. The first 4 tokens are accepted for free.

Standard Decoding (1 token/step)

The capital of France is

0 tokens generated = 0 forward passes of big model

Speculative Decoding (draft + verify)

The capital of France is

Waiting...

💡

The Intuition

The problem:Generating one token from a 70B model requires reading all 140GB of weights (FP16) from GPU memory. At 2 TB/s bandwidth, that's . The GPU's compute units are 95% idle, just waiting for data.

The key insight:Verifying K tokens costs almost the same as generating 1 token. Why? The bottleneck is reading weights from memory, and you read them once regardless of whether you're processing 1 token or K tokens. The extra compute for K tokens is free because the GPU was underutilized.

Draft-then-verify pipeline: (1) Draft model (, fast) generates K candidate tokens autoregressively. (2) Target model (70B+) processes the prompt + K candidates in one forward pass, producing probability distributions for each position. (3) Accept/reject: for each candidate, compare the draft and target distributions using rejection sampling. Accepted tokens are kept; the first rejected token is resampled from the target distribution, and remaining candidates are discarded.

Why it's lossless: This is not an approximation. The acceptance-rejection scheme is mathematically equivalent to sampling directly from the target model. If the draft model perfectly matches the target, all tokens are accepted (maximum speedup). If the draft model is completely wrong, every token is rejected and resampled — you still get the correct output, just without speedup.

Variants: Beyond the standard draft-model approach, several self-speculative methods eliminate the need for a separate draft model. Medusa adds multiple parallel prediction heads to the target model, each predicting a future token position. EAGLEtrains an autoregressive head on the target's hidden states for higher acceptance rates. Lookahead decoding uses Jacobi iteration to generate multiple tokens in parallel without any draft model at all — treating autoregressive generation as a fixed-point problem.

✨ Insight · Think of it like a fast junior developer who writes draft code, and a senior expert who reviews it. Approving correct code is faster than writing it from scratch. Even when the junior gets some parts wrong, the senior only needs to rewrite those parts.

Prompt lookup decoding (Saxena 2023) is a zero-overhead variant that needs no draft model at all. Instead of sampling from a smaller model, it scans the input prompt for n-grams that match the most recently generated tokens and copies candidate tokens directly from the prompt. For tasks where the output re-uses input text verbatim — code editing, document summarization, retrieval-augmented generation, fill-in-the-middle — acceptance rates can reach 60–80%, giving with zero additional parameters or memory. The entire draft phase costs only a string search over the prompt. This is now available as prompt_lookup_num_tokens in HuggingFace generate(). For general text where the output diverges from the prompt, it degrades gracefully to standard decoding with negligible overhead.

Quick check

Trade-off

A 70B FP16 model on a 2 TB/s GPU takes ~70 ms per token. What hardware property makes this the same cost as verifying K=5 draft tokens simultaneously?

KV cache reuse means past token computations are free, making K extra tokens negligible.Tensor parallelism distributes K tokens across GPUs, hiding the verification cost.Weight reads from HBM dominate; compute for K positions is free because GPU arithmetic units sit idle.Flash attention O(1) memory makes the verify pass O(K) compute bound.

Quick Check

Why is speculative decoding lossless (identical output distribution to the target model)?

📐

Step-by-Step Derivation

Acceptance Probability

For draft token with draft probability and target probability :

If , the token is always accepted. The target model "likes" it at least as much as the draft.

Rejection Resampling Distribution

When a draft token is rejected, sample a replacement from:

This corrects for the draft model's bias: tokens the target prefers more than the draft get boosted in the resampling distribution.

Expected Speedup

With average acceptance rate , draft tokens, and draft-to-target latency ratio :

When the draft model is very fast () and acceptance is high (), speedup approaches . In practice, with and .

Speculative Decoding Loop (Pseudo-code)

python

def speculative_decode(target_model, draft_model, prompt, K=5):
    tokens = prompt
    while not done:
        # 1. Draft: small model generates K candidates autoregressively
        draft_tokens, draft_probs = [], []
        for _ in range(K):
            p_draft = draft_model(tokens + draft_tokens)
            t = sample(p_draft)
            draft_tokens.append(t)
            draft_probs.append(p_draft[t])

        # 2. Verify: target model scores ALL candidates in one forward pass
        target_probs = target_model(tokens + draft_tokens)  # parallel!

        # 3. Accept/reject via rejection sampling
        accepted = []
        for i, t in enumerate(draft_tokens):
            r = random.uniform(0, 1)
            if r < min(1, target_probs[i][t] / draft_probs[i]):
                accepted.append(t)           # accept draft token
            else:
                # Resample from adjusted distribution
                residual = max(0, target_probs[i] - draft_probs[i])
                accepted.append(sample(residual / residual.sum()))
                break                          # discard remaining drafts

        tokens.extend(accepted)
    return tokens

PyTorch implementation

# Speculative decoding acceptance step: rejection sampling
import torch

def acceptance_step(p_target: torch.Tensor, p_draft: torch.Tensor, draft_token: int):
    """
    p_target, p_draft: probability vectors over vocab (shape: [vocab_size])
    Returns: (accepted: bool, next_token: int)
    """
    alpha = min(1.0, p_target[draft_token].item() / p_draft[draft_token].item())
    if torch.rand(1).item() < alpha:
        return True, draft_token  # accept draft token

    # Resample from residual distribution: max(0, p_target - p_draft)
    residual = torch.clamp(p_target - p_draft, min=0.0)
    total = residual.sum()
    # Degenerate case: p_draft >= p_target everywhere → fall back to p_target
    residual = residual / total if total > 0 else p_target
    return False, torch.multinomial(residual, num_samples=1).item()

Quick check

Derivation

A draft token x has q(x)=0.4 but p(x)=0.2. It is rejected. Why must the resampled token come from max(0, p−q) normalized — not from p directly?

Sampling from p is not well-defined because x was already rejected and its probability is now 0.max(0, p−q) is computationally cheaper than p because the residual vector is sparse and avoids a full softmax.We sample from max(0, p−q) only when q(x)>p(x); otherwise we sample from p directly.Sampling from p would over-represent tokens that the draft already accepted at correct rates, breaking the distribution guarantee.

🔧

Break It — See What Happens

Draft model too different from target

Generate too many draft tokens (K=50)

📊

Real-World Numbers

System	Method	Details
Google PaLM	Standard spec. decoding	Leviathan et al. (2023), first large-scale deployment
Medusa	Parallel heads + tree verify	No separate draft model — adds heads to the target model, tree-structured attention for verification
EAGLE	Feature-level drafting	Drafts in feature space (hidden states), not token space — higher acceptance rate than Medusa
EAGLE-2	Dynamic draft trees	Context-aware draft tree selection — expands high-confidence branches, prunes low-confidence ones
Typical: Llama-70B	Llama-7B draft	Same model family, , acceptance on general text

✨ Insight · The field is moving from external draft models to self-speculative methods (Medusa, EAGLE) that augment the target model itself. This avoids loading two separate models and often achieves higher acceptance rates because the draft shares the target's internal representations.

Quick check

Trade-off

Medusa reports 2.2–3.6× speedup; EAGLE reports 2.5–3.5×. Both eliminate external draft models. What accounts for EAGLE's higher acceptance rate?

EAGLE drafts in the target model's feature space (hidden states) rather than token-logit space, capturing inter-token dependencies.EAGLE runs the target model twice — once for features, once for verification — giving double the acceptance signal.EAGLE uses a larger draft model (13B) that produces better token predictions.EAGLE uses beam search to select the most probable draft candidates, raising acceptance rates.

🚀

SOTA 2024–2025: EAGLE-2 and EAGLE-3

The EAGLE line (SafeAI Lab, github.com/SafeAILab/EAGLE) has become the production standard for self-speculative decoding, integrated natively in both vLLM and SGLang.

EAGLE-2 (EMNLP 2024, arxiv:2406.16858)

EAGLE-1 used a static draft tree — a fixed branching structure decided at model-init time. EAGLE-2 replaces this with dynamic draft trees: at each step, the draft head scores candidate tokens and expands only high-confidence branches, pruning low-confidence ones. This context-aware pruning raises the effective acceptance rate. Result: over baseline autoregressive decoding, up from EAGLE-1's 2.5–3.5×.

EAGLE-3 (NeurIPS 2025, arxiv:2503.01840) — current SOTA

EAGLE-3 changes the draft head architecture. EAGLE-1/2 feed only the final-layer hidden state into the draft head. EAGLE-3 uses multi-layer feature fusion: it combines hidden states from multiple transformer layers (early + late), giving the draft head richer semantic context. This significantly improves draft acceptance rates without changing the verification protocol. Result: with no output quality loss — the rejection sampling guarantee remains fully intact.

Speculative decoding speedup evolution (2023–2025)

System	Year	Speedup vs base	Key innovation
Medusa	2024		Parallel prediction heads on target model, tree verify
EAGLE-1	2024	2.5–3.5×	Feature-space drafting (final hidden state), static tree
EAGLE-2	EMNLP 2024		Dynamic draft trees, context-aware branch pruning
EAGLE-3	NeurIPS 2025		Multi-layer feature fusion in draft head, zero quality loss

✨ Insight · Production status (2025): EAGLE-2 and EAGLE-3 are integrated in vLLM (--speculative-model eagle) and SGLang. For batch inference on Llama-3 70B, EAGLE-3 is the default recommended choice. The speedup is latency-reducing — wall-clock time to first token is unchanged (draft tokens are not verified until after the target model runs), but tokens-per-second throughput increases 4–6× on decode-heavy workloads.

EAGLE-3 multi-layer fusion — architecture sketch

import torch
import torch.nn as nn

class Eagle3DraftHead(nn.Module):
    """EAGLE-3: fuses hidden states from multiple layers into the draft head.
    Reference: arxiv:2503.01840 (simplified).
    """
    def __init__(self, d_model: int, vocab_size: int, fusion_layers: list[int]):
        super().__init__()
        self.fusion_layers = fusion_layers  # e.g. [0, 16, 31] for early/mid/late
        n = len(fusion_layers)
        # Project each layer's hidden state to d_model, then fuse
        self.layer_projections = nn.ModuleList([
            nn.Linear(d_model, d_model, bias=False) for _ in range(n)
        ])
        self.fusion = nn.Linear(n * d_model, d_model, bias=False)
        # Draft transformer block (1–2 layers)
        self.draft_attn = nn.TransformerEncoderLayer(d_model, nhead=8, batch_first=True)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

    def forward(self, layer_hidden_states: list[torch.Tensor]) -> torch.Tensor:
        # layer_hidden_states[i]: (B, T, d_model) from layer fusion_layers[i]
        projected = [proj(h) for proj, h in zip(self.layer_projections, layer_hidden_states)]
        fused = self.fusion(torch.cat(projected, dim=-1))  # (B, T, d_model)
        draft_features = self.draft_attn(fused)
        return self.lm_head(draft_features)  # (B, T, vocab_size) — draft logits

🧠

Key Takeaways

What to remember for interviews

1Verifying K draft tokens costs nearly the same as generating 1 token because the bottleneck is reading model weights from HBM — and you read them once regardless of K.
2The acceptance-rejection scheme is mathematically lossless: accepted tokens use min(1, p_target/p_draft), rejected tokens are resampled from max(0, p_target - p_draft) normalized.
3Expected speedup = (1 - alpha^(K+1)) / ((1 - alpha)(1 + Kc)). With alpha=0.7, K=5, and a fast draft model, practical speedup is 2-3x.
4EAGLE-3 (NeurIPS 2025, current SOTA): multi-layer feature fusion in the draft head achieves 4.1–6.5× speedup with zero quality loss. Integrated in vLLM and SGLang.
5Prompt lookup decoding achieves 2-3x speedup with zero extra parameters by copying n-grams from the input prompt as draft tokens — ideal for code editing and RAG.

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 5 of 5

Explain speculative decoding step by step. Why is it mathematically lossless?

★★★

GoogleMeta

What determines the speedup of speculative decoding? When does it fail to provide speedup?

★★☆

GoogleOpenAI

Compare speculative decoding approaches: standard (draft model), Medusa, and EAGLE. What are the tradeoffs?

★★★

GoogleMeta

How does the acceptance-rejection sampling in speculative decoding work? Derive the adjusted distribution for rejected tokens.

★★★

GoogleAnthropic

Why is speculative decoding particularly effective for LLMs but not for small models? What hardware property makes it work?

★★☆

GoogleMeta