Skip to content

Transformer Math

Module 25 · Inference

🎲 Sampling & Decoding

Temperature 0 = always 'the', temperature 2 = sometimes 'banana'

Status:

After the model computes logits for every token in the vocabulary, it needs to pick ONE token. This is where sampling strategies come in — they control creativity vs. consistency.

🎮

Interactive Sandbox

What you’re seeing: how temperature, top-k, and top-p each reshape the next-token probability distribution before the final sample is drawn. What to try: set temperature to 0.0 and watch the distribution collapse to a single bar — then raise it past 1.5 to see the tail tokens become competitive.

Sampling Pipeline Visualization

Raw logits÷ TemperatureSoftmaxTop-kTop-pSample
0.1 (sharp)2.0 (flat)
1 (greedy)20 (all)
0.10 (tight)1.00 (all)

Final Distribution (after all filters)

29.1%the19.5%cat14.4%dog7.9%sat6.5%on4.3%mat3.6%and2.9%ran2.4%big2.0%red1.6%blue1.3%fish1.1%atehadwasnotverygoodbadday
ActiveFiltered outSampled
Show stage-by-stage probabilities

Raw logits (softmax)

the:29.1% cat:19.5% dog:14.4% sat:7.9% on:6.5% mat:4.3% and:3.6% ran:2.9% big:2.4% red:2.0% blue:1.6% fish:1.3% ate:1.1% had:0.9% was:0.7% not:0.6% very:0.5% good:0.4% bad:0.3% day:0.2%

/ Temperature (T=1.0)

the:29.1% cat:19.5% dog:14.4% sat:7.9% on:6.5% mat:4.3% and:3.6% ran:2.9% big:2.4% red:2.0% blue:1.6% fish:1.3% ate:1.1% had:0.9% was:0.7% not:0.6% very:0.5% good:0.4% bad:0.3% day:0.2%

Top-k (k=20)

the:29.1% cat:19.5% dog:14.4% sat:7.9% on:6.5% mat:4.3% and:3.6% ran:2.9% big:2.4% red:2.0% blue:1.6% fish:1.3% ate:1.1% had:0.9% was:0.7% not:0.6% very:0.5% good:0.4% bad:0.3% day:0.2%

Top-p (p=1.00)

the:29.1% cat:19.5% dog:14.4% sat:7.9% on:6.5% mat:4.3% and:3.6% ran:2.9% big:2.4% red:2.0% blue:1.6% fish:1.3% ate:1.1% had:0.9% was:0.7% not:0.6% very:0.5% good:0.4% bad:0.3% day:0.2%

💡

The Intuition

Why softmax?Raw logits from the model can be any real number — +50, -30, +2. You can't sample from that directly. Softmax converts logits into a probability distribution that sums to 1. Temperature scales the logits before softmax: low T sharpens the distribution (model becomes confident and repetitive), high T flattens it (model becomes creative and unpredictable).

SettingTOutput styleExample
Greedy0Always picks highest prob“The cat sat on the mat. The cat sat on the mat.” (repetitive)
Balanced1.0Natural distribution“The cat sat on the windowsill, watching birds.”
Creative2.0Flat distribution“The quantum cat philosophized about string theory.”

Temperature is like adjusting confidence. Low temperature = the model is very sure, it picks the most likely token. High temperature = the model considers many options equally.

Top-k and top-p filter out unlikely tokens before sampling. Top-k keeps a fixed number of candidates. Top-p (nucleus sampling) keeps the smallest set whose cumulative probability exceeds a threshold — it adapts to the shape of the distribution.

✨ Insight · Think of it as a funnel: the model produces a full distribution over , then temperature reshapes it, top-k/top-p trim it, and finally one token is randomly drawn from what remains.

Min-p sampling (Nguyen et al., 2024) fixes a fundamental fragility of top-p. Top-p uses a fixed cumulative threshold (e.g., ), but that threshold interacts differently with peaked vs. flat distributions — at high temperature, 0.9 cumulative probability might include hundreds of tokens including nonsense; at low temperature, it might admit just one. Min-p instead sets a dynamic floor: keep all tokens whose probability exceeds , where is the highest token probability at this step. When the model is confident (high ), the floor rises and only the top candidates survive. When the model is uncertain (low ), the floor drops and more tokens pass. A typical value of outperforms top-p across creative writing benchmarks and is now available in llama.cpp and many inference frameworks.

Quick check

Derivation

A peaked distribution (top token p=0.85) and a flat distribution (top token p=0.02) both run with top-p=0.9 and top-k=50. Which filter is binding on the peaked distribution, and approximately how many tokens survive?

A peaked distribution (top token p=0.85) and a flat distribution (top token p=0.02) both run with top-p=0.9 and top-k=50. Which filter is binding on the peaked distribution, and approximately how many tokens survive?
Quick Check

What happens when temperature approaches 0?

📐

Step-by-Step Derivation

Temperature Scaling

Divide logits by temperature before applying softmax. As , the distribution becomes one-hot (greedy). As , it becomes uniform.

Top-k Filtering

Keep only the highest-probability tokens, set the rest to zero, and renormalize:

Top-p (Nucleus) Sampling

Find the smallest set of tokens whose cumulative probability , then renormalize:

💡 Tip · Top-p adapts to the distribution shape. If the model is confident (peaked distribution), fewer tokens pass the threshold. If uncertain (flat), more tokens are included. This is why top-p is generally preferred over a fixed top-k.

Min-p Sampling (Nguyen et al., 2024)

Keep all tokens whose probability exceeds a dynamic floorscaled by the current maximum probability. Unlike top-p's fixed cumulative threshold, min-p automatically tightens when the model is confident and relaxes when uncertain:

At , if the top token has probability 0.8, only tokens with survive — roughly the top 2–3. If the top token has probability 0.1 (uncertain step), the floor drops to 0.005 and dozens of tokens remain.

Repetition Penalty

Apply a penalty factor to any previously-generated token, using the sign-dependent rule from Hugging Face's implementation so that both positive and negative logits are pushed toward −∞ (less likely):

⚠ Warning · Set too high (e.g., 1.5+) and the model avoids common function words like “the” and “is”, producing grammatically broken output. Typical safe range: 1.0 (off) to 1.3.

Greedy Decoding & Beam Search

Greedy: always pick the most likely token. Beam search: maintain the top- sequences, scored by log-probability sum:

PyTorch implementation
# Temperature + top-k + top-p (nucleus) sampling in PyTorch
import torch
import torch.nn.functional as F

def sample(logits, temperature=1.0, top_k=50, top_p=0.9):
    logits = logits / max(temperature, 1e-8)  # temperature scaling

    # Top-k: zero out everything below the k-th highest logit
    if top_k > 0:
        kth = torch.topk(logits, top_k).values[..., -1, None]
        logits = logits.masked_fill(logits < kth, float("-inf"))

    # Top-p: keep smallest set whose cumulative prob >= p
    probs = F.softmax(logits, dim=-1)
    sorted_probs, sorted_idx = probs.sort(dim=-1, descending=True)
    cum_probs = sorted_probs.cumsum(dim=-1)
    mask = (cum_probs - sorted_probs) > top_p
    sorted_probs[mask] = 0.0
    sorted_probs /= sorted_probs.sum(dim=-1, keepdim=True)
    probs.scatter_(-1, sorted_idx, sorted_probs)

    return torch.multinomial(probs, num_samples=1)

Quick check

Derivation

Step: top token p_max = 0.6. Min-p = 0.05. What is the exact probability floor, and does a token with p = 0.025 survive the filter?

Step: top token p_max = 0.6. Min-p = 0.05. What is the exact probability floor, and does a token with p = 0.025 survive the filter?

PyTorch: Sampling Strategies

python
def sample_next_token(logits, temperature=1.0, top_k=0, top_p=1.0):
    """Apply temperature, top-k, and top-p filtering, then sample."""
    # Temperature scaling
    if temperature != 1.0:
        logits = logits / temperature

    # Top-k filtering
    if top_k > 0:
        top_k_values, _ = torch.topk(logits, top_k)
        min_top_k = top_k_values[:, -1, None]
        logits = torch.where(logits < min_top_k, float('-inf'), logits)

    # Top-p (nucleus) filtering
    if top_p < 1.0:
        sorted_logits, sorted_idx = torch.sort(logits, descending=True)
        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
        # Remove tokens with cumulative prob above threshold
        sorted_mask = cumulative_probs - F.softmax(sorted_logits, dim=-1) > top_p
        sorted_logits[sorted_mask] = float('-inf')
        logits = sorted_logits.scatter(1, sorted_idx, sorted_logits)

    # Sample from the filtered distribution
    probs = F.softmax(logits, dim=-1)
    return torch.multinomial(probs, num_samples=1)
🔧

Break It — See What Happens

Temperature = 0 (Greedy)
Temperature = very high (approaches uniform)
Top-k = 1 (also Greedy)

Try these in the sandbox above. Set temperature to 0.01 and watch all bars collapse to one. Set top-k to 1 and only one bar remains — same result, different mechanism.

Quick check

Derivation

Top-k=1 with temperature=2.0 is applied. What is the output distribution, and is it identical to temperature=0 (greedy)?

Top-k=1 with temperature=2.0 is applied. What is the output distribution, and is it identical to temperature=0 (greedy)?
📊

Real-World Numbers

API Provider Defaults

ProviderDefault TempDefault top_pMax output tokens
OpenAI (GPT-4o)16,384 max output (128K context window)
Anthropic (Claude 3.5)8,192 (max)
Google (Gemini 1.5)1.00.95 (typical)Varies by model — read Model.output_token_limit at runtime

Recommended Settings by Task

Use CaseTemperatureTop-pNotes
Code generationNear-deterministic — syntax correctness matters most
Creative writing0.95Higher for diversity; top-p guards against nonsense
Factual QA0.0Greedy — commit to the most likely answer
Summarization / chat0.3 – 0.7Balanced: coherent but not robotic
✨ Insight · Most production LLM APIs default to T=1.0 and let the user lower it. A common mistake is setting both low temperature AND tight top-p — they compound, making output extremely deterministic and repetitive. Pick one primary lever; the other should stay near its default.

Quick check

Trade-off

You are building a production factual QA endpoint. A user reports that identical prompts sometimes get different answers. You are currently using T=0.7, top-p=0.95. What is the minimal change to make outputs deterministic?

You are building a production factual QA endpoint. A user reports that identical prompts sometimes get different answers. You are currently using T=0.7, top-p=0.95. What is the minimal change to make outputs deterministic?
🧠

Key Takeaways

What to remember for interviews

  1. 1Temperature scales logits before softmax: T<1 sharpens, T>1 flattens the distribution
  2. 2Top-k keeps exactly k tokens; top-p adapts to distribution shape (preferred in practice)
  3. 3Min-p sets a dynamic floor relative to the max probability — fixes top-p's fragility
  4. 4Code: T=0-0.2, Creative: T=0.8-1.2, Factual QA: T=0 (greedy)
  5. 5Pipeline: logits → /T → softmax → top-k → top-p → sample
🧠

Recap quiz

Derivation

A model's output distribution at T=0.5 has entropy H. What happens to H at T=2.0, and why does this matter for creative writing tasks?

A model's output distribution at T=0.5 has entropy H. What happens to H at T=2.0, and why does this matter for creative writing tasks?
Derivation

You set top-k=50 and top-p=0.9 on the same step. The model's distribution is very peaked: one token has p=0.92. Which filter is more restrictive here, and what token set does the final sampling see?

You set top-k=50 and top-p=0.9 on the same step. The model's distribution is very peaked: one token has p=0.92. Which filter is more restrictive here, and what token set does the final sampling see?
Derivation

At high temperature (T=1.5), top-p=0.9 admits hundreds of low-quality tokens. Min-p=0.05 on the same step, with the top token at p=0.3, admits only tokens above what threshold?

At high temperature (T=1.5), top-p=0.9 admits hundreds of low-quality tokens. Min-p=0.05 on the same step, with the top token at p=0.3, admits only tokens above what threshold?
Trade-off

Greedy decoding on a language model trained on text without a repetition penalty often degenerates into loops. What is the root cause?

Greedy decoding on a language model trained on text without a repetition penalty often degenerates into loops. What is the root cause?
Trade-off

Beam search (b=4) vs. top-p sampling (p=0.95) on a code-completion task — which produces lower perplexity outputs and which produces more diverse outputs? Which is better for a latency-sensitive autocomplete API?

Beam search (b=4) vs. top-p sampling (p=0.95) on a code-completion task — which produces lower perplexity outputs and which produces more diverse outputs? Which is better for a latency-sensitive autocomplete API?
Trade-off

A repetition penalty of theta=1.5 is applied to all previously generated tokens. A user notices the model stops using common words like &ldquo;the&rdquo; and &ldquo;is&rdquo; after a few sentences. What is the root cause?

A repetition penalty of theta=1.5 is applied to all previously generated tokens. A user notices the model stops using common words like &ldquo;the&rdquo; and &ldquo;is&rdquo; after a few sentences. What is the root cause?
Derivation

Speculative decoding uses a small draft model to generate k tokens, then verifies with the large target model in a single forward pass. If the draft model uses top-p=0.9, must the target model use the same top-p to preserve output distribution correctness?

Speculative decoding uses a small draft model to generate k tokens, then verifies with the large target model in a single forward pass. If the draft model uses top-p=0.9, must the target model use the same top-p to preserve output distribution correctness?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 8 of 8

Explain temperature, top-k, and top-p. How do they interact?

★★☆
OpenAIAnthropic

When would you use beam search vs sampling? What are the tradeoffs?

★★☆
GoogleMeta

What is nucleus sampling (top-p) and why is it preferred over top-k alone?

★★☆
OpenAI

How does repetition penalty work? What are its failure modes?

★★☆
Anthropic

Why is greedy decoding suboptimal for open-ended generation?

★★☆
OpenAIGoogle

What is the relationship between temperature and entropy of the output distribution?

★★★
AnthropicOpenAI

How would you set sampling parameters for: (a) code generation, (b) creative writing, (c) factual QA?

★★☆
GoogleDatabricks

What is speculative decoding and how does it interact with sampling?

★★★
OpenAIDatabricks