🎲 Sampling & Decoding
Temperature 0 = always 'the', temperature 2 = sometimes 'banana'
After the model computes logits for every token in the vocabulary, it needs to pick ONE token. This is where sampling strategies come in — they control creativity vs. consistency.
Interactive Sandbox
What you’re seeing: how temperature, top-k, and top-p each reshape the next-token probability distribution before the final sample is drawn. What to try: set temperature to 0.0 and watch the distribution collapse to a single bar — then raise it past 1.5 to see the tail tokens become competitive.
Sampling Pipeline Visualization
Final Distribution (after all filters)
Show stage-by-stage probabilities
Raw logits (softmax)
the:29.1% cat:19.5% dog:14.4% sat:7.9% on:6.5% mat:4.3% and:3.6% ran:2.9% big:2.4% red:2.0% blue:1.6% fish:1.3% ate:1.1% had:0.9% was:0.7% not:0.6% very:0.5% good:0.4% bad:0.3% day:0.2%
/ Temperature (T=1.0)
the:29.1% cat:19.5% dog:14.4% sat:7.9% on:6.5% mat:4.3% and:3.6% ran:2.9% big:2.4% red:2.0% blue:1.6% fish:1.3% ate:1.1% had:0.9% was:0.7% not:0.6% very:0.5% good:0.4% bad:0.3% day:0.2%
Top-k (k=20)
the:29.1% cat:19.5% dog:14.4% sat:7.9% on:6.5% mat:4.3% and:3.6% ran:2.9% big:2.4% red:2.0% blue:1.6% fish:1.3% ate:1.1% had:0.9% was:0.7% not:0.6% very:0.5% good:0.4% bad:0.3% day:0.2%
Top-p (p=1.00)
the:29.1% cat:19.5% dog:14.4% sat:7.9% on:6.5% mat:4.3% and:3.6% ran:2.9% big:2.4% red:2.0% blue:1.6% fish:1.3% ate:1.1% had:0.9% was:0.7% not:0.6% very:0.5% good:0.4% bad:0.3% day:0.2%
The Intuition
Why softmax?Raw logits from the model can be any real number — +50, -30, +2. You can't sample from that directly. Softmax converts logits into a probability distribution that sums to 1. Temperature scales the logits before softmax: low T sharpens the distribution (model becomes confident and repetitive), high T flattens it (model becomes creative and unpredictable).
| Setting | T | Output style | Example |
|---|---|---|---|
| Greedy | 0 | Always picks highest prob | “The cat sat on the mat. The cat sat on the mat.” (repetitive) |
| Balanced | 1.0 | Natural distribution | “The cat sat on the windowsill, watching birds.” |
| Creative | 2.0 | Flat distribution | “The quantum cat philosophized about string theory.” |
Temperature is like adjusting confidence. Low temperature = the model is very sure, it picks the most likely token. High temperature = the model considers many options equally.
Top-k and top-p filter out unlikely tokens before sampling. Top-k keeps a fixed number of candidates. Top-p (nucleus sampling) keeps the smallest set whose cumulative probability exceeds a threshold — it adapts to the shape of the distribution.
Min-p sampling (Nguyen et al., 2024) fixes a fundamental fragility of top-p. Top-p uses a fixed cumulative threshold (e.g., ), but that threshold interacts differently with peaked vs. flat distributions — at high temperature, 0.9 cumulative probability might include hundreds of tokens including nonsense; at low temperature, it might admit just one. Min-p instead sets a dynamic floor: keep all tokens whose probability exceeds , where is the highest token probability at this step. When the model is confident (high ), the floor rises and only the top candidates survive. When the model is uncertain (low ), the floor drops and more tokens pass. A typical value of outperforms top-p across creative writing benchmarks and is now available in llama.cpp and many inference frameworks.
Quick check
A peaked distribution (top token p=0.85) and a flat distribution (top token p=0.02) both run with top-p=0.9 and top-k=50. Which filter is binding on the peaked distribution, and approximately how many tokens survive?
What happens when temperature approaches 0?
Step-by-Step Derivation
Temperature Scaling
Divide logits by temperature before applying softmax. As , the distribution becomes one-hot (greedy). As , it becomes uniform.
Top-k Filtering
Keep only the highest-probability tokens, set the rest to zero, and renormalize:
Top-p (Nucleus) Sampling
Find the smallest set of tokens whose cumulative probability , then renormalize:
Min-p Sampling (Nguyen et al., 2024)
Keep all tokens whose probability exceeds a dynamic floorscaled by the current maximum probability. Unlike top-p's fixed cumulative threshold, min-p automatically tightens when the model is confident and relaxes when uncertain:
At , if the top token has probability 0.8, only tokens with survive — roughly the top 2–3. If the top token has probability 0.1 (uncertain step), the floor drops to 0.005 and dozens of tokens remain.
Repetition Penalty
Apply a penalty factor to any previously-generated token, using the sign-dependent rule from Hugging Face's implementation so that both positive and negative logits are pushed toward −∞ (less likely):
Greedy Decoding & Beam Search
Greedy: always pick the most likely token. Beam search: maintain the top- sequences, scored by log-probability sum:
PyTorch implementation
# Temperature + top-k + top-p (nucleus) sampling in PyTorch
import torch
import torch.nn.functional as F
def sample(logits, temperature=1.0, top_k=50, top_p=0.9):
logits = logits / max(temperature, 1e-8) # temperature scaling
# Top-k: zero out everything below the k-th highest logit
if top_k > 0:
kth = torch.topk(logits, top_k).values[..., -1, None]
logits = logits.masked_fill(logits < kth, float("-inf"))
# Top-p: keep smallest set whose cumulative prob >= p
probs = F.softmax(logits, dim=-1)
sorted_probs, sorted_idx = probs.sort(dim=-1, descending=True)
cum_probs = sorted_probs.cumsum(dim=-1)
mask = (cum_probs - sorted_probs) > top_p
sorted_probs[mask] = 0.0
sorted_probs /= sorted_probs.sum(dim=-1, keepdim=True)
probs.scatter_(-1, sorted_idx, sorted_probs)
return torch.multinomial(probs, num_samples=1)Quick check
Step: top token p_max = 0.6. Min-p = 0.05. What is the exact probability floor, and does a token with p = 0.025 survive the filter?
PyTorch: Sampling Strategies
def sample_next_token(logits, temperature=1.0, top_k=0, top_p=1.0):
"""Apply temperature, top-k, and top-p filtering, then sample."""
# Temperature scaling
if temperature != 1.0:
logits = logits / temperature
# Top-k filtering
if top_k > 0:
top_k_values, _ = torch.topk(logits, top_k)
min_top_k = top_k_values[:, -1, None]
logits = torch.where(logits < min_top_k, float('-inf'), logits)
# Top-p (nucleus) filtering
if top_p < 1.0:
sorted_logits, sorted_idx = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
# Remove tokens with cumulative prob above threshold
sorted_mask = cumulative_probs - F.softmax(sorted_logits, dim=-1) > top_p
sorted_logits[sorted_mask] = float('-inf')
logits = sorted_logits.scatter(1, sorted_idx, sorted_logits)
# Sample from the filtered distribution
probs = F.softmax(logits, dim=-1)
return torch.multinomial(probs, num_samples=1)Break It — See What Happens
Try these in the sandbox above. Set temperature to 0.01 and watch all bars collapse to one. Set top-k to 1 and only one bar remains — same result, different mechanism.
Quick check
Top-k=1 with temperature=2.0 is applied. What is the output distribution, and is it identical to temperature=0 (greedy)?
Real-World Numbers
API Provider Defaults
| Provider | Default Temp | Default top_p | Max output tokens |
|---|---|---|---|
| OpenAI (GPT-4o) | 16,384 max output (128K context window) | ||
| Anthropic (Claude 3.5) | 8,192 (max) | ||
| Google (Gemini 1.5) | 1.0 | 0.95 (typical) | Varies by model — read Model.output_token_limit at runtime |
Recommended Settings by Task
| Use Case | Temperature | Top-p | Notes |
|---|---|---|---|
| Code generation | Near-deterministic — syntax correctness matters most | ||
| Creative writing | 0.95 | Higher for diversity; top-p guards against nonsense | |
| Factual QA | 0.0 | – | Greedy — commit to the most likely answer |
| Summarization / chat | 0.3 – 0.7 | Balanced: coherent but not robotic |
Quick check
You are building a production factual QA endpoint. A user reports that identical prompts sometimes get different answers. You are currently using T=0.7, top-p=0.95. What is the minimal change to make outputs deterministic?
Key Takeaways
What to remember for interviews
- 1Temperature scales logits before softmax: T<1 sharpens, T>1 flattens the distribution
- 2Top-k keeps exactly k tokens; top-p adapts to distribution shape (preferred in practice)
- 3Min-p sets a dynamic floor relative to the max probability — fixes top-p's fragility
- 4Code: T=0-0.2, Creative: T=0.8-1.2, Factual QA: T=0 (greedy)
- 5Pipeline: logits → /T → softmax → top-k → top-p → sample
Recap quiz
A model's output distribution at T=0.5 has entropy H. What happens to H at T=2.0, and why does this matter for creative writing tasks?
You set top-k=50 and top-p=0.9 on the same step. The model's distribution is very peaked: one token has p=0.92. Which filter is more restrictive here, and what token set does the final sampling see?
At high temperature (T=1.5), top-p=0.9 admits hundreds of low-quality tokens. Min-p=0.05 on the same step, with the top token at p=0.3, admits only tokens above what threshold?
Greedy decoding on a language model trained on text without a repetition penalty often degenerates into loops. What is the root cause?
Beam search (b=4) vs. top-p sampling (p=0.95) on a code-completion task — which produces lower perplexity outputs and which produces more diverse outputs? Which is better for a latency-sensitive autocomplete API?
A repetition penalty of theta=1.5 is applied to all previously generated tokens. A user notices the model stops using common words like “the” and “is” after a few sentences. What is the root cause?
Speculative decoding uses a small draft model to generate k tokens, then verifies with the large target model in a single forward pass. If the draft model uses top-p=0.9, must the target model use the same top-p to preserve output distribution correctness?
Further Reading
- The Curious Case of Neural Text Degeneration (Holtzman et al. 2020) — Introduces nucleus sampling (top-p) — dynamically truncates the vocabulary to the smallest set covering probability p.
- Typical Decoding for Natural Language Generation — Typical sampling — selects tokens whose information content is close to the expected information, producing more human-like text.
- Min-P Sampling: Truncation Sampling as Language Model Desmoothing — Nguyen et al. 2024 — min-p sets a dynamic floor at p_min × max_prob, automatically adapting to the distribution without the fixed-cutoff fragility of top-p.
- Transformer Explainer (Georgia Tech) — Interactive GPT-2 demo — adjust temperature and sampling settings and see their effect on next-token probability distributions live.
- Andrej Karpathy — Let's Build GPT — Implements temperature scaling and greedy/sampling decoding from scratch — best way to internalize how sampling parameters affect generation.
- Lilian Weng — Controllable Text Generation — Comprehensive survey of decoding strategies including temperature, top-k, top-p, and beam search — with analysis of quality tradeoffs.
Interview Questions
Showing 8 of 8
Explain temperature, top-k, and top-p. How do they interact?
★★☆When would you use beam search vs sampling? What are the tradeoffs?
★★☆What is nucleus sampling (top-p) and why is it preferred over top-k alone?
★★☆How does repetition penalty work? What are its failure modes?
★★☆Why is greedy decoding suboptimal for open-ended generation?
★★☆What is the relationship between temperature and entropy of the output distribution?
★★★How would you set sampling parameters for: (a) code generation, (b) creative writing, (c) factual QA?
★★☆What is speculative decoding and how does it interact with sampling?
★★★