Skip to content

Transformer Math

Module 32 · Architectures

💭 Reasoning Models

DeepSeek-R1 discovered chain-of-thought without being taught

Status:

Language models can do more than predict the next token — they can reason. Chain-of-thought prompting, test-time compute scaling, and RL-trained reasoning (o1, DeepSeek-R1) represent a paradigm shift: instead of making models bigger, we let them think longer. Process reward models score each reasoning step, not just the final answer.

🧠

Think Before You Answer

What you're seeing: The top row shows a standard LLM — one forward pass, no intermediate steps. The bottom row shows a reasoning model (o1/CoT): the model generates a chain of thinking tokens before emitting the final answer. The scaling curve shows accuracy improving with more thinking tokens, with diminishing returns past a saturation point.

Key insight: The gap between the two rows is test-time compute — you pay extra tokens per query, but accuracy jumps without retraining the model.

Standard LLMCoT / o1QuestionLLMAnswerQuestionLLMLet meFirst…Then…Answer ✓Test-time Compute ScalingThinking tokens →Accuracydiminishingreturns
🎮

Chain-of-Thought Reasoning Trace

What you're seeing: A model answering a question with and without chain-of-thought reasoning. The naive model jumps to an answer (often wrong). The CoT model thinks step by step.

What to try:Click "Think Step by Step" to watch the reasoning trace unfold. Switch between problems to see how CoT catches common reasoning traps.

Q: What is 17 × 23?

Without CoT (direct answer)

A: 381✗ Wrong

With CoT (think step by step)

Click "Think Step by Step" to start reasoning...

💡

The Intuition

Chain-of-Thought (Wei et al., 2022) showed that prompting with few-shot worked reasoning exemplars dramatically improves reasoning — the model decomposes complex problems into simpler sub-steps it can solve individually. Kojima et al. (2022) later found that even the zero-shot trigger "let's think step by step" elicits the same behavior. On GSM8K (grade school math), CoT raised PaLM 540B from ~18% (direct answer) to , and self-consistency (Wang et al., 2022) pushes that to ~74%.

Test-Time Compute Scaling flips the scaling paradigm. Instead of training a bigger model (which is a fixed cost), you let a smaller model think longer at inference. Snell et al. (2024) showed a smaller model with extended reasoning can match a on medium-difficulty problems.

OpenAI o1trains reasoning via RL. The model learns to generate internal "thinking" tokens before the final answer. The reward comes from verifiable outcomes: correct math, passing tests. This is not prompting — the reasoning policy is baked into the weights.

DeepSeek-R1-Zerodiscovered reasoning without any human chain-of-thought data. Using GRPO (Group Relative Policy Optimization) on the base model with only correctness rewards, reasoning emerged spontaneously — the model learned that showing intermediate steps leads to more correct answers. This "aha moment" proved reasoning is an emergent RL behavior, not something that must be explicitly taught. (The shipped DeepSeek-R1 adds a cold-start SFT stage on thousands of long-CoT examples before RL.)

Test-time Compute Scaling reframes the scaling question: instead of spending more on training a bigger model (fixed cost), spend more at inference via chain-of-thought, search, and verification (per-query cost). o1 and o3 demonstrated that doubling inference compute can outperform training a substantially larger model for reasoning tasks. Budget forcinglets the model allocate a "thinking budget" flexibly — spending more tokens on hard sub-problems and less on trivial ones — rather than a fixed chain length.

DeepSeek-R1-Zero (2025)discovered chain-of-thought reasoning purely through RL, without any supervised CoT training data. Using GRPO with only outcome correctness as the reward signal, the model spontaneously learned to "think" — producing self-verification, backtracking, and multi-step decomposition as emergent behaviors. The key insight: RL can discover reasoning strategies without explicit instruction, because showing intermediate steps is simply what maximizes the correctness reward.

Search-Augmented Reasoning goes further by combining best-of-N sampling with process reward models (PRMs) to do beam search over reasoning paths. Rather than generating N independent full chains (wasteful if a chain fails at step 2), you score each reasoning step, prune low-scoring branches early, and expand only the most promising continuations — effectively "beam search over thoughts." This approach outperforms simple majority voting at the same compute budget.

Process Reward Models (PRM) score each reasoning step, not just the final answer. Outcome reward models (ORM) only know if the answer is right; PRMs know which step went wrong. This enables denser training signal and early pruning of bad reasoning paths.

Tree of Thoughts: structured search over reasoning. Best-of-N generates N independent full reasoning chains — wasteful because a chain that goes wrong at step 2 still runs all remaining steps. Tree of Thoughts (Yao et al., 2023) generalizes CoT into a deliberate search process: at each step, generate k candidate continuations, evaluate their promise (with a value function or majority vote), and expand only the most promising branches — pruning the rest. This is BFS or DFS over a reasoning tree, where the PRM acts as the heuristic. The result: the same compute budget explores far more diverse reasoning strategies rather than re-running similar chains from scratch. On the “Game of 24” task (combine four numbers with arithmetic to reach 24), ToT with BFS solved 74% of problems vs. CoT's 4% — because many problems require backtracking, which flat CoT cannot do.

✨ Insight · Think of ORM as grading an exam by the final answer only. PRM is like a teacher who checks each line of work — partial credit for correct steps, immediate feedback on mistakes. PRM catches "right answer, wrong reasoning" cases that ORM misses.

Quick check

Derivation

PaLM 540B jumps from ~18% to ~57% on GSM8K with CoT prompting. What explains this gain without any weight update?

PaLM 540B jumps from ~18% to ~57% on GSM8K with CoT prompting. What explains this gain without any weight update?
Quick Check

Why did DeepSeek-R1 emerge reasoning without human chain-of-thought training data?

📐

Step-by-Step Derivation

Test-Time Compute Tradeoff

Given a compute budget , allocate between model size (fixed cost) and test-time tokens (per-query cost):

For easy questions, suffices (direct answer). For medium questions, increasing (more thinking) is more cost-effective than increasing (bigger model). For the hardest questions, no amount of helps — you need a bigger model.

💡 Tip · Snell et al. (2024): "A smaller model with test-time compute can match a 14x larger model" — but only on medium-difficulty problems. The optimal strategy adapts compute allocation per question.

Process Reward Model (PRM) Scoring

Given a reasoning trace with steps , the PRM scores each step. The overall correctness score:

If any single step has low probability, the whole trace is flagged. This is used for best-of-N sampling: generate N reasoning traces, score each with PRM, return the highest-scoring one.

GRPO Group Advantage

For prompt , sample a group of outputs. Compute each advantage relative to the group:

No critic network needed. Outputs better than the group mean are reinforced; worse ones are suppressed. The normalization makes advantages zero-mean, stabilizing training. DeepSeek-R1 uses completions per prompt.

PyTorch: Chain-of-Thought Prompting

python
def cot_prompt(question: str) -> str:
    """Add chain-of-thought instruction to a question."""
    return f"""{question}

Let's think step by step:
1. First, identify what we need to find.
2. Break the problem into smaller parts.
3. Solve each part.
4. Combine and verify the answer."""


def process_reward_score(
    prm_model,
    reasoning_steps: list[str],
) -> float:
    """Score a reasoning trace with a Process Reward Model.

    Each step is scored conditioned on previous steps.
    Overall score = product of step-level probabilities.
    """
    score = 1.0
    context = []
    for step in reasoning_steps:
        context.append(step)
        # PRM predicts P(step is correct | previous steps)
        step_score = prm_model.score(context)  # -> float in [0, 1]
        score *= step_score
    return score


def best_of_n_with_prm(
    model, prm_model, prompt: str, n: int = 8
) -> str:
    """Generate N reasoning traces, return the best one."""
    traces = [model.generate(prompt) for _ in range(n)]
    scores = [
        process_reward_score(prm_model, trace.steps)
        for trace in traces
    ]
    best_idx = scores.index(max(scores))
    return traces[best_idx].final_answer
PyTorch implementation
# GRPO-style policy update (Group Relative Policy Optimization)
import torch

def grpo_loss(log_probs_new, log_probs_old, rewards, clip_eps=0.2):
    """
    log_probs_new / log_probs_old: (G,) log-probs of each output under new/old policy
    rewards: (G,) scalar reward per output
    """
    # Normalize rewards within the group -> advantage
    adv = (rewards - rewards.mean()) / (rewards.std() + 1e-8)  # (G,)

    # PPO-style clipped ratio
    ratio = (log_probs_new - log_probs_old).exp()  # (G,)
    clipped = ratio.clamp(1 - clip_eps, 1 + clip_eps)

    # Policy gradient loss (negative because we maximize reward)
    loss = -torch.min(ratio * adv, clipped * adv).mean()
    return loss

Quick check

Derivation

GRPO normalizes advantages within a group of G outputs. If all G outputs receive the same reward (all correct or all wrong), what happens to the policy update?

GRPO normalizes advantages within a group of G outputs. If all G outputs receive the same reward (all correct or all wrong), what happens to the policy update?
🔧

Break It — See What Happens

Force short reasoning (limit thinking tokens)
Only reward final answer (no process reward)

Quick check

Trade-off

You replace PRM with ORM (outcome only) in a best-of-64 setup for MATH. Lightman et al. show ORM achieves 72.4% vs PRM's 78.2%. What is the mechanism of the 5.8-point loss?

You replace PRM with ORM (outcome only) in a best-of-64 setup for MATH. Lightman et al. show ORM achieves 72.4% vs PRM's 78.2%. What is the mechanism of the 5.8-point loss?
📊

Real-World Numbers

Model / TechniqueBenchmarkScoreNotes
PaLM 540B (no CoT)GSM8KDirect answer prompting
PaLM 540B + CoTGSM8K+39 pts from step-by-step prompting (Wei et al., 2022)
PaLM 540B + CoT + self-consistencyGSM8K~74%Wang et al. 2022, majority vote over sampled CoTs
GPT-4oAIME 2024Competition math, no reasoning training (OpenAI o1 report)
o1 (full)AIME 2024RL-trained reasoning, test-time compute
o1 (full)MATHNear-perfect on competition math
DeepSeek-R1AIME 2024Open-source, GRPO-trained reasoning
PRM + best-of-NMATHvs. ORM (Lightman et al.)
✨ Insight · The reasoning revolution: CoT gives a large gain on math with zero training. RL-trained reasoning (o1 full model) pushes AIME 2024 from ~13% (GPT-4o) to . And DeepSeek-R1-Zero showed you don't even need human CoT data — reasoning emerges from outcome-based RL alone (DeepSeek-R1 itself adds a cold-start SFT stage before RL).

Quick check

Derivation

o1 achieves 83.3% AIME 2024 at consensus@64 and 74.4% pass@1. What is the implied accuracy gain per additional sample in this consensus scheme?

o1 achieves 83.3% AIME 2024 at consensus@64 and 74.4% pass@1. What is the implied accuracy gain per additional sample in this consensus scheme?
🚀

SOTA 2024–2025: Test-Time Compute Scaling

Snell et al. 2024 established the second scaling axis formally: a (per arxiv:2408.03314, 2024-08). The year following saw every major lab ship a “reasoning mode.”

ModelBenchmarkScoreNotes
o3 (high compute)ARC-AGIvs. GPT-4o 5%; ~$100–200/task at high compute (per openai.com/index/introducing-o3-and-o4-mini, 2025-04)
o3 / o4-miniAIMEper openai.com/index/introducing-o3-and-o4-mini, 2025-04
o3SWE-benchper openai.com/index/introducing-o3-and-o4-mini, 2025-04
Claude 3.7 Sonnet (extended thinking)SWE-benchToggleable budget up to 100K thinking tokens (per anthropic.com/news/visible-extended-thinking, 2025-02)
Gemini 2.5 Pro (Deep Think)IMO 20241M context, native multimodal hybrid thinking mode (per deepmind.google blog, 2025)
Grok 3AIME / reasoning10× compute scale on Colossus cluster; RL at scale for reasoning (per x.ai/news/grok-3, 2025-02)
Claude Opus 4 / Sonnet 4SWE-benchExtended thinking, 200K context, Constitutional AI v2; hybrid reasoning mode selectable per-request (per anthropic.com/news/claude-4-0, 2025-05)
Grok 4AIME 2025Colossus 200K H100 cluster, 1M context, real-time X data integration; deep reasoning mode (per x.ai/blog/grok-4, 2025-08)
Kimi K2AIME 2025; agentic-tuned, MCP-native; open weights (per github.com/MoonshotAI/Kimi-K2 + moonshot.ai blog, 2025-07)
MiniMax-M1MATH-500; lightning attention + 1M native context; hybrid linear+softmax attention (per arxiv:2506.13585, 2025-06)
DeepSeek-R1-0528AIME 2024Up from 70% on base R1; distilled into Qwen-3 8B; improved search and verification (per huggingface.co/deepseek-ai/DeepSeek-R1-0528, 2025-05)
DeepSeek-V3.1LiveCodeBench; hybrid thinking/non-thinking modes; open weights (per huggingface.co/deepseek-ai/DeepSeek-V3-1, 2025-08)
Gemini 2.5 Deep ThinkIMO 2025Parallel sampling across multiple reasoning chains; extended thinking budget (per blog.google, 2025-08)
✨ Insight · The “test-time compute scaling” axis is now a first-class design choice: o3 at high compute costs ~$100–200/ARC-AGI task but scores 87.5%, while the same model at low compute scores ~55%. Claude 3.7 Sonnet exposes this as an explicit token budget knob. Gemini 2.5 Pro's Deep Think mode reaches IMO gold (35/42) — a task that requires multi-step symbolic reasoning beyond pattern-matching. By mid-2025, every major lab ships a reasoning mode: DeepSeek-R1-0528 hits 87.5% AIME 2024 (up from 70%), Kimi K2 (1T/32B MoE) is MCP-native, and MiniMax-M1 runs 1M-context with hybrid linear attention.
🧠

Key Takeaways

What to remember for interviews

  1. 1Chain-of-thought prompting forces the model to generate intermediate reasoning steps before answering, improving accuracy on multi-step tasks by allocating more compute (tokens) to the problem.
  2. 2Test-time compute scaling lets a smaller model think longer at inference rather than training a bigger model — a smaller model with extended reasoning can match a 14× larger model on medium-difficulty problems.
  3. 3OpenAI o1 trains reasoning via RL: the model learns a 'thinking' policy where the reward comes from verifiable outcomes (correct math, passing tests) rather than human-written CoT examples.
  4. 4DeepSeek-R1-Zero discovered chain-of-thought reasoning purely through outcome-based RL (GRPO) — no human CoT data was needed. Reasoning is an emergent behavior when RL optimizes for correctness. (The shipped DeepSeek-R1 adds a cold-start SFT stage before RL.)
  5. 5Process Reward Models (PRM) score each reasoning step individually rather than just the final answer, enabling early pruning of bad reasoning paths and catching flawed logic that produces lucky correct answers.
  6. 6SOTA 2025: o3 scores 87.5% on ARC-AGI (high compute) and 96.7% AIME; Claude 3.7 Sonnet reaches 62.3% SWE-bench with extended thinking; Gemini 2.5 Pro Deep Think achieves IMO gold at 35/42.
🧠

Recap quiz

Trade-off

o1 achieves 83.3% on AIME 2024 at consensus@64, vs GPT-4o's 13.4%. What does the 64-sample gap reveal about o1's approach?

o1 achieves 83.3% on AIME 2024 at consensus@64, vs GPT-4o's 13.4%. What does the 64-sample gap reveal about o1's approach?
Trade-off

PRM achieves 78.2% on MATH vs ORM's 72.4% with best-of-N sampling. Why does the PRM gap increase as N grows?

PRM achieves 78.2% on MATH vs ORM's 72.4% with best-of-N sampling. Why does the PRM gap increase as N grows?
Trade-off

GRPO samples G=64 outputs per prompt to estimate advantages. What problem does this solve compared to PPO's approach?

GRPO samples G=64 outputs per prompt to estimate advantages. What problem does this solve compared to PPO's approach?
Trade-off

Tree of Thoughts solved 74% of Game-of-24 tasks vs CoT's 4%. What specific capability does ToT have that flat CoT lacks?

Tree of Thoughts solved 74% of Game-of-24 tasks vs CoT's 4%. What specific capability does ToT have that flat CoT lacks?
Trade-off

A team wants to use test-time compute scaling to improve accuracy on a coding benchmark. Snell et al. say this works best for “medium-difficulty” problems. How should they select which problems get more thinking budget?

A team wants to use test-time compute scaling to improve accuracy on a coding benchmark. Snell et al. say this works best for “medium-difficulty” problems. How should they select which problems get more thinking budget?
Trade-off

DeepSeek-R1-Zero discovered chain-of-thought reasoning with no CoT training data, using only outcome-correctness rewards. What is the key implication for future alignment research?

DeepSeek-R1-Zero discovered chain-of-thought reasoning with no CoT training data, using only outcome-correctness rewards. What is the key implication for future alignment research?
Trade-off

PRM requires step-level human annotations, which are ~10× more expensive than outcome labels. When is this cost justified?

PRM requires step-level human annotations, which are ~10× more expensive than outcome labels. When is this cost justified?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 6 of 6

What is chain-of-thought prompting and why does it improve reasoning accuracy?

★☆☆
GoogleOpenAI

How does OpenAI's o1 differ from standard CoT prompting? What is its training approach?

★★☆
OpenAI

Explain DeepSeek-R1's 'aha moment.' How did reasoning emerge without human CoT data?

★★★
OpenAIMeta

What is the difference between outcome reward models (ORM) and process reward models (PRM)?

★★☆
OpenAIGoogle

Explain GRPO (Group Relative Policy Optimization) and how it computes advantages without a critic.

★★★
OpenAIMeta

When should you scale test-time compute (think longer) vs. use a bigger model? What is the tradeoff?

★★☆
GoogleAnthropic