Sampling & Decoding — Transformer Math

Module 25 · Inference

🎲 Sampling & Decoding

Temperature 0 = always 'the', temperature 2 = sometimes 'banana'

Status:

After the model computes logits for every token in the vocabulary, it needs to pick ONE token. This is where sampling strategies come in — they control creativity vs. consistency.

🎮

Interactive Sandbox

What you’re seeing: how temperature, top-k, and top-p each reshape the next-token probability distribution before the final sample is drawn. What to try: set temperature to 0.0 and watch the distribution collapse to a single bar — then raise it past 1.5 to see the tail tokens become competitive.

Sampling Pipeline Visualization

Raw logits→÷ Temperature→Softmax→Top-k→Top-p→Sample

Temperature1.00

0.1 (sharp)2.0 (flat)

Top-k20

1 (greedy)20 (all)

Top-p1.00

0.10 (tight)1.00 (all)

Final Distribution (after all filters)

ActiveFiltered outSampled

Show stage-by-stage probabilities

Raw logits (softmax)

the:29.1% cat:19.5% dog:14.4% sat:7.9% on:6.5% mat:4.3% and:3.6% ran:2.9% big:2.4% red:2.0% blue:1.6% fish:1.3% ate:1.1% had:0.9% was:0.7% not:0.6% very:0.5% good:0.4% bad:0.3% day:0.2%

/ Temperature (T=1.0)

the:29.1% cat:19.5% dog:14.4% sat:7.9% on:6.5% mat:4.3% and:3.6% ran:2.9% big:2.4% red:2.0% blue:1.6% fish:1.3% ate:1.1% had:0.9% was:0.7% not:0.6% very:0.5% good:0.4% bad:0.3% day:0.2%

Top-k (k=20)

the:29.1% cat:19.5% dog:14.4% sat:7.9% on:6.5% mat:4.3% and:3.6% ran:2.9% big:2.4% red:2.0% blue:1.6% fish:1.3% ate:1.1% had:0.9% was:0.7% not:0.6% very:0.5% good:0.4% bad:0.3% day:0.2%

Top-p (p=1.00)

the:29.1% cat:19.5% dog:14.4% sat:7.9% on:6.5% mat:4.3% and:3.6% ran:2.9% big:2.4% red:2.0% blue:1.6% fish:1.3% ate:1.1% had:0.9% was:0.7% not:0.6% very:0.5% good:0.4% bad:0.3% day:0.2%

💡

The Intuition

Why softmax?Raw logits from the model can be any real number — +50, -30, +2. You can't sample from that directly. Softmax converts logits into a probability distribution that sums to 1. Temperature scales the logits before softmax: low T sharpens the distribution (model becomes confident and repetitive), high T flattens it (model becomes creative and unpredictable).

Setting	T	Output style	Example
Greedy	0	Always picks highest prob	“The cat sat on the mat. The cat sat on the mat.” (repetitive)
Balanced	1.0	Natural distribution	“The cat sat on the windowsill, watching birds.”
Creative	2.0	Flat distribution	“The quantum cat philosophized about string theory.”

Temperature is like adjusting confidence. Low temperature = the model is very sure, it picks the most likely token. High temperature = the model considers many options equally.

Top-k and top-p filter out unlikely tokens before sampling. Top-k keeps a fixed number of candidates. Top-p (nucleus sampling) keeps the smallest set whose cumulative probability exceeds a threshold — it adapts to the shape of the distribution.

✨ Insight · Think of it as a funnel: the model produces a full distribution over , then temperature reshapes it, top-k/top-p trim it, and finally one token is randomly drawn from what remains.

Min-p sampling (Nguyen et al., 2024) fixes a fundamental fragility of top-p. Top-p uses a fixed cumulative threshold (e.g., ), but that threshold interacts differently with peaked vs. flat distributions — at high temperature, 0.9 cumulative probability might include hundreds of tokens including nonsense; at low temperature, it might admit just one. Min-p instead sets a dynamic floor: keep all tokens whose probability exceeds , where is the highest token probability at this step. When the model is confident (high ), the floor rises and only the top candidates survive. When the model is uncertain (low ), the floor drops and more tokens pass. A typical value of outperforms top-p across creative writing benchmarks and is now available in llama.cpp and many inference frameworks.

Quick check

Derivation

A peaked distribution (top token p=0.85) and a flat distribution (top token p=0.02) both run with top-p=0.9 and top-k=50. Which filter is binding on the peaked distribution, and approximately how many tokens survive?

Top-p binds on the peaked case: ~2 tokens survive. Top-k=50 is slack.Neither filter changes the distribution because top token p=0.85 already exceeds top-p=0.9.Both filters produce the same count — 50 tokens — because top-p=0.9 never reduces below top-k.Top-k binds: exactly 50 tokens survive regardless of distribution shape.

Quick Check

What happens when temperature approaches 0?

📐

Step-by-Step Derivation

Temperature Scaling

Divide logits by temperature before applying softmax. As , the distribution becomes one-hot (greedy). As , it becomes uniform.

Top-k Filtering

Keep only the highest-probability tokens, set the rest to zero, and renormalize:

Top-p (Nucleus) Sampling

Find the smallest set of tokens whose cumulative probability , then renormalize:

💡 Tip · Top-p adapts to the distribution shape. If the model is confident (peaked distribution), fewer tokens pass the threshold. If uncertain (flat), more tokens are included. This is why top-p is generally preferred over a fixed top-k.

Min-p Sampling (Nguyen et al., 2024)

Keep all tokens whose probability exceeds a dynamic floorscaled by the current maximum probability. Unlike top-p's fixed cumulative threshold, min-p automatically tightens when the model is confident and relaxes when uncertain:

At , if the top token has probability 0.8, only tokens with survive — roughly the top 2–3. If the top token has probability 0.1 (uncertain step), the floor drops to 0.005 and dozens of tokens remain.

Repetition Penalty

Apply a penalty factor to any previously-generated token, using the sign-dependent rule from Hugging Face's implementation so that both positive and negative logits are pushed toward −∞ (less likely):

⚠ Warning · Set too high (e.g., 1.5+) and the model avoids common function words like “the” and “is”, producing grammatically broken output. Typical safe range: 1.0 (off) to 1.3.

Greedy Decoding & Beam Search

Greedy: always pick the most likely token. Beam search: maintain the top- sequences, scored by log-probability sum:

PyTorch implementation

# Temperature + top-k + top-p (nucleus) sampling in PyTorch
import torch
import torch.nn.functional as F

def sample(logits, temperature=1.0, top_k=50, top_p=0.9):
    logits = logits / max(temperature, 1e-8)  # temperature scaling

    # Top-k: zero out everything below the k-th highest logit
    if top_k > 0:
        kth = torch.topk(logits, top_k).values[..., -1, None]
        logits = logits.masked_fill(logits < kth, float("-inf"))

    # Top-p: keep smallest set whose cumulative prob >= p
    probs = F.softmax(logits, dim=-1)
    sorted_probs, sorted_idx = probs.sort(dim=-1, descending=True)
    cum_probs = sorted_probs.cumsum(dim=-1)
    mask = (cum_probs - sorted_probs) > top_p
    sorted_probs[mask] = 0.0
    sorted_probs /= sorted_probs.sum(dim=-1, keepdim=True)
    probs.scatter_(-1, sorted_idx, sorted_probs)

    return torch.multinomial(probs, num_samples=1)

Quick check

Derivation

Step: top token p_max = 0.6. Min-p = 0.05. What is the exact probability floor, and does a token with p = 0.025 survive the filter?

Floor = 0.05; p=0.025 does not survive.Floor = 0.03; p=0.025 survives because it is within rounding error.Floor = 0.03; p=0.025 does not survive (0.025 < 0.03).Floor = 0.6; only tokens matching the top token survive.

PyTorch: Sampling Strategies

python

def sample_next_token(logits, temperature=1.0, top_k=0, top_p=1.0):
    """Apply temperature, top-k, and top-p filtering, then sample."""
    # Temperature scaling
    if temperature != 1.0:
        logits = logits / temperature

    # Top-k filtering
    if top_k > 0:
        top_k_values, _ = torch.topk(logits, top_k)
        min_top_k = top_k_values[:, -1, None]
        logits = torch.where(logits < min_top_k, float('-inf'), logits)

    # Top-p (nucleus) filtering
    if top_p < 1.0:
        sorted_logits, sorted_idx = torch.sort(logits, descending=True)
        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
        # Remove tokens with cumulative prob above threshold
        sorted_mask = cumulative_probs - F.softmax(sorted_logits, dim=-1) > top_p
        sorted_logits[sorted_mask] = float('-inf')
        logits = sorted_logits.scatter(1, sorted_idx, sorted_logits)

    # Sample from the filtered distribution
    probs = F.softmax(logits, dim=-1)
    return torch.multinomial(probs, num_samples=1)

🔧

Break It — See What Happens

Temperature = 0 (Greedy)

Temperature = very high (approaches uniform)

Top-k = 1 (also Greedy)

Try these in the sandbox above. Set temperature to 0.01 and watch all bars collapse to one. Set top-k to 1 and only one bar remains — same result, different mechanism.

Quick check

Derivation

Top-k=1 with temperature=2.0 is applied. What is the output distribution, and is it identical to temperature=0 (greedy)?

One-hot on the highest-logit token — identical to greedy. Temperature is irrelevant when k=1.Top-k=1 and greedy differ because greedy uses argmax on probabilities while top-k uses sorted ranking.Temperature=2.0 makes the distribution uniform, so top-k=1 picks a random token from the full vocabulary.Temperature=2.0 flattens logits before filtering, so top-k=1 picks a random token from the top 2.

📊

Real-World Numbers

API Provider Defaults

Provider	Default Temp	Default top_p	Max output tokens
OpenAI (GPT-4o)			16,384 max output (128K context window)
Anthropic (Claude 3.5)			8,192 (max)
Google (Gemini 1.5)	1.0	0.95 (typical)	Varies by model — read `Model.output_token_limit` at runtime

Recommended Settings by Task

Use Case	Temperature	Top-p	Notes
Code generation			Near-deterministic — syntax correctness matters most
Creative writing		0.95	Higher for diversity; top-p guards against nonsense
Factual QA	0.0	–	Greedy — commit to the most likely answer
Summarization / chat	0.3 – 0.7		Balanced: coherent but not robotic

✨ Insight · Most production LLM APIs default to T=1.0 and let the user lower it. A common mistake is setting both low temperature AND tight top-p — they compound, making output extremely deterministic and repetitive. Pick one primary lever; the other should stay near its default.

Quick check

Trade-off

You are building a production factual QA endpoint. A user reports that identical prompts sometimes get different answers. You are currently using T=0.7, top-p=0.95. What is the minimal change to make outputs deterministic?

Set T=0 (greedy) — this makes argmax deterministic. top-p becomes irrelevant.Set top-p=1.0 to include all tokens — this eliminates stochastic truncation.Set top-k=1 and keep T=0.7 — top-k=1 is equivalent to greedy.Fix a random seed in the API call — temperature=0.7 with a fixed seed is fully deterministic.

🧠

Key Takeaways

What to remember for interviews

1Temperature scales logits before softmax: T<1 sharpens, T>1 flattens the distribution
2Top-k keeps exactly k tokens; top-p adapts to distribution shape (preferred in practice)
3Min-p sets a dynamic floor relative to the max probability — fixes top-p's fragility
4Code: T=0-0.2, Creative: T=0.8-1.2, Factual QA: T=0 (greedy)
5Pipeline: logits → /T → softmax → top-k → top-p → sample

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 8 of 8

Explain temperature, top-k, and top-p. How do they interact?

★★☆

OpenAIAnthropic

When would you use beam search vs sampling? What are the tradeoffs?

★★☆

GoogleMeta

What is nucleus sampling (top-p) and why is it preferred over top-k alone?

★★☆

OpenAI

How does repetition penalty work? What are its failure modes?

★★☆

Anthropic

Why is greedy decoding suboptimal for open-ended generation?

★★☆

OpenAIGoogle

What is the relationship between temperature and entropy of the output distribution?

★★★

AnthropicOpenAI

How would you set sampling parameters for: (a) code generation, (b) creative writing, (c) factual QA?

★★☆

GoogleDatabricks

What is speculative decoding and how does it interact with sampling?

★★★

OpenAIDatabricks