💭 Reasoning Models
DeepSeek-R1 discovered chain-of-thought without being taught
Language models can do more than predict the next token — they can reason. Chain-of-thought prompting, test-time compute scaling, and RL-trained reasoning (o1, DeepSeek-R1) represent a paradigm shift: instead of making models bigger, we let them think longer. Process reward models score each reasoning step, not just the final answer.
Think Before You Answer
What you're seeing: The top row shows a standard LLM — one forward pass, no intermediate steps. The bottom row shows a reasoning model (o1/CoT): the model generates a chain of thinking tokens before emitting the final answer. The scaling curve shows accuracy improving with more thinking tokens, with diminishing returns past a saturation point.
Key insight: The gap between the two rows is test-time compute — you pay extra tokens per query, but accuracy jumps without retraining the model.
Chain-of-Thought Reasoning Trace
What you're seeing: A model answering a question with and without chain-of-thought reasoning. The naive model jumps to an answer (often wrong). The CoT model thinks step by step.
What to try:Click "Think Step by Step" to watch the reasoning trace unfold. Switch between problems to see how CoT catches common reasoning traps.
Without CoT (direct answer)
With CoT (think step by step)
Click "Think Step by Step" to start reasoning...
The Intuition
Chain-of-Thought (Wei et al., 2022) showed that prompting with few-shot worked reasoning exemplars dramatically improves reasoning — the model decomposes complex problems into simpler sub-steps it can solve individually. Kojima et al. (2022) later found that even the zero-shot trigger "let's think step by step" elicits the same behavior. On GSM8K (grade school math), CoT raised PaLM 540B from ~18% (direct answer) to , and self-consistency (Wang et al., 2022) pushes that to ~74%.
Test-Time Compute Scaling flips the scaling paradigm. Instead of training a bigger model (which is a fixed cost), you let a smaller model think longer at inference. Snell et al. (2024) showed a smaller model with extended reasoning can match a on medium-difficulty problems.
OpenAI o1trains reasoning via RL. The model learns to generate internal "thinking" tokens before the final answer. The reward comes from verifiable outcomes: correct math, passing tests. This is not prompting — the reasoning policy is baked into the weights.
DeepSeek-R1-Zerodiscovered reasoning without any human chain-of-thought data. Using GRPO (Group Relative Policy Optimization) on the base model with only correctness rewards, reasoning emerged spontaneously — the model learned that showing intermediate steps leads to more correct answers. This "aha moment" proved reasoning is an emergent RL behavior, not something that must be explicitly taught. (The shipped DeepSeek-R1 adds a cold-start SFT stage on thousands of long-CoT examples before RL.)
Test-time Compute Scaling reframes the scaling question: instead of spending more on training a bigger model (fixed cost), spend more at inference via chain-of-thought, search, and verification (per-query cost). o1 and o3 demonstrated that doubling inference compute can outperform training a substantially larger model for reasoning tasks. Budget forcinglets the model allocate a "thinking budget" flexibly — spending more tokens on hard sub-problems and less on trivial ones — rather than a fixed chain length.
DeepSeek-R1-Zero (2025)discovered chain-of-thought reasoning purely through RL, without any supervised CoT training data. Using GRPO with only outcome correctness as the reward signal, the model spontaneously learned to "think" — producing self-verification, backtracking, and multi-step decomposition as emergent behaviors. The key insight: RL can discover reasoning strategies without explicit instruction, because showing intermediate steps is simply what maximizes the correctness reward.
Search-Augmented Reasoning goes further by combining best-of-N sampling with process reward models (PRMs) to do beam search over reasoning paths. Rather than generating N independent full chains (wasteful if a chain fails at step 2), you score each reasoning step, prune low-scoring branches early, and expand only the most promising continuations — effectively "beam search over thoughts." This approach outperforms simple majority voting at the same compute budget.
Process Reward Models (PRM) score each reasoning step, not just the final answer. Outcome reward models (ORM) only know if the answer is right; PRMs know which step went wrong. This enables denser training signal and early pruning of bad reasoning paths.
Tree of Thoughts: structured search over reasoning. Best-of-N generates N independent full reasoning chains — wasteful because a chain that goes wrong at step 2 still runs all remaining steps. Tree of Thoughts (Yao et al., 2023) generalizes CoT into a deliberate search process: at each step, generate k candidate continuations, evaluate their promise (with a value function or majority vote), and expand only the most promising branches — pruning the rest. This is BFS or DFS over a reasoning tree, where the PRM acts as the heuristic. The result: the same compute budget explores far more diverse reasoning strategies rather than re-running similar chains from scratch. On the “Game of 24” task (combine four numbers with arithmetic to reach 24), ToT with BFS solved 74% of problems vs. CoT's 4% — because many problems require backtracking, which flat CoT cannot do.
Quick check
PaLM 540B jumps from ~18% to ~57% on GSM8K with CoT prompting. What explains this gain without any weight update?
Why did DeepSeek-R1 emerge reasoning without human chain-of-thought training data?
Step-by-Step Derivation
Test-Time Compute Tradeoff
Given a compute budget , allocate between model size (fixed cost) and test-time tokens (per-query cost):
For easy questions, suffices (direct answer). For medium questions, increasing (more thinking) is more cost-effective than increasing (bigger model). For the hardest questions, no amount of helps — you need a bigger model.
Process Reward Model (PRM) Scoring
Given a reasoning trace with steps , the PRM scores each step. The overall correctness score:
If any single step has low probability, the whole trace is flagged. This is used for best-of-N sampling: generate N reasoning traces, score each with PRM, return the highest-scoring one.
GRPO Group Advantage
For prompt , sample a group of outputs. Compute each advantage relative to the group:
No critic network needed. Outputs better than the group mean are reinforced; worse ones are suppressed. The normalization makes advantages zero-mean, stabilizing training. DeepSeek-R1 uses completions per prompt.
PyTorch: Chain-of-Thought Prompting
def cot_prompt(question: str) -> str:
"""Add chain-of-thought instruction to a question."""
return f"""{question}
Let's think step by step:
1. First, identify what we need to find.
2. Break the problem into smaller parts.
3. Solve each part.
4. Combine and verify the answer."""
def process_reward_score(
prm_model,
reasoning_steps: list[str],
) -> float:
"""Score a reasoning trace with a Process Reward Model.
Each step is scored conditioned on previous steps.
Overall score = product of step-level probabilities.
"""
score = 1.0
context = []
for step in reasoning_steps:
context.append(step)
# PRM predicts P(step is correct | previous steps)
step_score = prm_model.score(context) # -> float in [0, 1]
score *= step_score
return score
def best_of_n_with_prm(
model, prm_model, prompt: str, n: int = 8
) -> str:
"""Generate N reasoning traces, return the best one."""
traces = [model.generate(prompt) for _ in range(n)]
scores = [
process_reward_score(prm_model, trace.steps)
for trace in traces
]
best_idx = scores.index(max(scores))
return traces[best_idx].final_answerPyTorch implementation
# GRPO-style policy update (Group Relative Policy Optimization)
import torch
def grpo_loss(log_probs_new, log_probs_old, rewards, clip_eps=0.2):
"""
log_probs_new / log_probs_old: (G,) log-probs of each output under new/old policy
rewards: (G,) scalar reward per output
"""
# Normalize rewards within the group -> advantage
adv = (rewards - rewards.mean()) / (rewards.std() + 1e-8) # (G,)
# PPO-style clipped ratio
ratio = (log_probs_new - log_probs_old).exp() # (G,)
clipped = ratio.clamp(1 - clip_eps, 1 + clip_eps)
# Policy gradient loss (negative because we maximize reward)
loss = -torch.min(ratio * adv, clipped * adv).mean()
return lossQuick check
GRPO normalizes advantages within a group of G outputs. If all G outputs receive the same reward (all correct or all wrong), what happens to the policy update?
Break It — See What Happens
Quick check
You replace PRM with ORM (outcome only) in a best-of-64 setup for MATH. Lightman et al. show ORM achieves 72.4% vs PRM's 78.2%. What is the mechanism of the 5.8-point loss?
Real-World Numbers
| Model / Technique | Benchmark | Score | Notes |
|---|---|---|---|
| PaLM 540B (no CoT) | GSM8K | Direct answer prompting | |
| PaLM 540B + CoT | GSM8K | +39 pts from step-by-step prompting (Wei et al., 2022) | |
| PaLM 540B + CoT + self-consistency | GSM8K | ~74% | Wang et al. 2022, majority vote over sampled CoTs |
| GPT-4o | AIME 2024 | Competition math, no reasoning training (OpenAI o1 report) | |
| o1 (full) | AIME 2024 | RL-trained reasoning, test-time compute | |
| o1 (full) | MATH | Near-perfect on competition math | |
| DeepSeek-R1 | AIME 2024 | Open-source, GRPO-trained reasoning | |
| PRM + best-of-N | MATH | vs. ORM (Lightman et al.) |
Quick check
o1 achieves 83.3% AIME 2024 at consensus@64 and 74.4% pass@1. What is the implied accuracy gain per additional sample in this consensus scheme?
SOTA 2024–2025: Test-Time Compute Scaling
Snell et al. 2024 established the second scaling axis formally: a (per arxiv:2408.03314, 2024-08). The year following saw every major lab ship a “reasoning mode.”
| Model | Benchmark | Score | Notes |
|---|---|---|---|
| o3 (high compute) | ARC-AGI | vs. GPT-4o 5%; ~$100–200/task at high compute (per openai.com/index/introducing-o3-and-o4-mini, 2025-04) | |
| o3 / o4-mini | AIME | per openai.com/index/introducing-o3-and-o4-mini, 2025-04 | |
| o3 | SWE-bench | per openai.com/index/introducing-o3-and-o4-mini, 2025-04 | |
| Claude 3.7 Sonnet (extended thinking) | SWE-bench | Toggleable budget up to 100K thinking tokens (per anthropic.com/news/visible-extended-thinking, 2025-02) | |
| Gemini 2.5 Pro (Deep Think) | IMO 2024 | 1M context, native multimodal hybrid thinking mode (per deepmind.google blog, 2025) | |
| Grok 3 | AIME / reasoning | — | 10× compute scale on Colossus cluster; RL at scale for reasoning (per x.ai/news/grok-3, 2025-02) |
| Claude Opus 4 / Sonnet 4 | SWE-bench | Extended thinking, 200K context, Constitutional AI v2; hybrid reasoning mode selectable per-request (per anthropic.com/news/claude-4-0, 2025-05) | |
| Grok 4 | AIME 2025 | Colossus 200K H100 cluster, 1M context, real-time X data integration; deep reasoning mode (per x.ai/blog/grok-4, 2025-08) | |
| Kimi K2 | AIME 2025 | ; agentic-tuned, MCP-native; open weights (per github.com/MoonshotAI/Kimi-K2 + moonshot.ai blog, 2025-07) | |
| MiniMax-M1 | MATH-500 | ; lightning attention + 1M native context; hybrid linear+softmax attention (per arxiv:2506.13585, 2025-06) | |
| DeepSeek-R1-0528 | AIME 2024 | Up from 70% on base R1; distilled into Qwen-3 8B; improved search and verification (per huggingface.co/deepseek-ai/DeepSeek-R1-0528, 2025-05) | |
| DeepSeek-V3.1 | LiveCodeBench | ; hybrid thinking/non-thinking modes; open weights (per huggingface.co/deepseek-ai/DeepSeek-V3-1, 2025-08) | |
| Gemini 2.5 Deep Think | IMO 2025 | Parallel sampling across multiple reasoning chains; extended thinking budget (per blog.google, 2025-08) |
Key Takeaways
What to remember for interviews
- 1Chain-of-thought prompting forces the model to generate intermediate reasoning steps before answering, improving accuracy on multi-step tasks by allocating more compute (tokens) to the problem.
- 2Test-time compute scaling lets a smaller model think longer at inference rather than training a bigger model — a smaller model with extended reasoning can match a 14× larger model on medium-difficulty problems.
- 3OpenAI o1 trains reasoning via RL: the model learns a 'thinking' policy where the reward comes from verifiable outcomes (correct math, passing tests) rather than human-written CoT examples.
- 4DeepSeek-R1-Zero discovered chain-of-thought reasoning purely through outcome-based RL (GRPO) — no human CoT data was needed. Reasoning is an emergent behavior when RL optimizes for correctness. (The shipped DeepSeek-R1 adds a cold-start SFT stage before RL.)
- 5Process Reward Models (PRM) score each reasoning step individually rather than just the final answer, enabling early pruning of bad reasoning paths and catching flawed logic that produces lucky correct answers.
- 6SOTA 2025: o3 scores 87.5% on ARC-AGI (high compute) and 96.7% AIME; Claude 3.7 Sonnet reaches 62.3% SWE-bench with extended thinking; Gemini 2.5 Pro Deep Think achieves IMO gold at 35/42.
Recap quiz
o1 achieves 83.3% on AIME 2024 at consensus@64, vs GPT-4o's 13.4%. What does the 64-sample gap reveal about o1's approach?
PRM achieves 78.2% on MATH vs ORM's 72.4% with best-of-N sampling. Why does the PRM gap increase as N grows?
GRPO samples G=64 outputs per prompt to estimate advantages. What problem does this solve compared to PPO's approach?
Tree of Thoughts solved 74% of Game-of-24 tasks vs CoT's 4%. What specific capability does ToT have that flat CoT lacks?
A team wants to use test-time compute scaling to improve accuracy on a coding benchmark. Snell et al. say this works best for “medium-difficulty” problems. How should they select which problems get more thinking budget?
DeepSeek-R1-Zero discovered chain-of-thought reasoning with no CoT training data, using only outcome-correctness rewards. What is the key implication for future alignment research?
PRM requires step-level human annotations, which are ~10× more expensive than outcome labels. When is this cost justified?
Further Reading
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — Wei et al. (2022). The foundational paper showing step-by-step prompting dramatically improves reasoning.
- Let's Verify Step by Step — Lightman et al. (2023). Process reward models outperform outcome reward models by scoring each reasoning step.
- Scaling LLM Test-Time Compute Optimally — Snell et al. (2024). When to think longer vs. use a bigger model.
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — DeepSeek (2025). GRPO discovers reasoning without human CoT data.
- OpenAI o1 System Card — Technical details on RL-trained reasoning and safety evaluations.
- Lilian Weng — Prompt Engineering — Covers CoT, self-consistency, tree-of-thought, and least-to-most prompting with empirical comparisons across benchmarks.
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models — Yao et al. (2023). Structured search over reasoning paths using BFS/DFS — the algorithmic foundation for o1-style test-time search.
- Self-Consistency Improves Chain of Thought Reasoning — Wang et al. (2022). Sample multiple reasoning paths and take majority vote — a verifier-free approach to test-time scaling.
- The Illustrated DeepSeek-R1 — Visual walkthrough of DeepSeek-R1's training pipeline: cold-start SFT, GRPO RL, rejection sampling, and how the model learns to produce long chain-of-thought reasoning.
Interview Questions
Showing 6 of 6
What is chain-of-thought prompting and why does it improve reasoning accuracy?
★☆☆How does OpenAI's o1 differ from standard CoT prompting? What is its training approach?
★★☆Explain DeepSeek-R1's 'aha moment.' How did reasoning emerge without human CoT data?
★★★What is the difference between outcome reward models (ORM) and process reward models (PRM)?
★★☆Explain GRPO (Group Relative Policy Optimization) and how it computes advantages without a critic.
★★★When should you scale test-time compute (think longer) vs. use a bigger model? What is the tradeoff?
★★☆