Reasoning Models — Transformer Math

Module 32 · Architectures

💭 Reasoning Models

DeepSeek-R1 discovered chain-of-thought without being taught

Status:

Language models can do more than predict the next token — they can reason. Chain-of-thought prompting, test-time compute scaling, and RL-trained reasoning (o1, DeepSeek-R1) represent a paradigm shift: instead of making models bigger, we let them think longer. Process reward models score each reasoning step, not just the final answer.

🧠

Think Before You Answer

What you're seeing: The top row shows a standard LLM — one forward pass, no intermediate steps. The bottom row shows a reasoning model (o1/CoT): the model generates a chain of thinking tokens before emitting the final answer. The scaling curve shows accuracy improving with more thinking tokens, with diminishing returns past a saturation point.

Key insight: The gap between the two rows is test-time compute — you pay extra tokens per query, but accuracy jumps without retraining the model.

🎮

Chain-of-Thought Reasoning Trace

What you're seeing: A model answering a question with and without chain-of-thought reasoning. The naive model jumps to an answer (often wrong). The CoT model thinks step by step.

What to try:Click "Think Step by Step" to watch the reasoning trace unfold. Switch between problems to see how CoT catches common reasoning traps.

Q: What is 17 × 23?

Without CoT (direct answer)

A: 381✗ Wrong

With CoT (think step by step)

Click "Think Step by Step" to start reasoning...

💡

The Intuition

Chain-of-Thought (Wei et al., 2022) showed that prompting with few-shot worked reasoning exemplars dramatically improves reasoning — the model decomposes complex problems into simpler sub-steps it can solve individually. Kojima et al. (2022) later found that even the zero-shot trigger "let's think step by step" elicits the same behavior. On GSM8K (grade school math), CoT raised PaLM 540B from ~18% (direct answer) to , and self-consistency (Wang et al., 2022) pushes that to ~74%.

Test-Time Compute Scaling flips the scaling paradigm. Instead of training a bigger model (which is a fixed cost), you let a smaller model think longer at inference. Snell et al. (2024) showed a smaller model with extended reasoning can match a on medium-difficulty problems.

OpenAI o1trains reasoning via RL. The model learns to generate internal "thinking" tokens before the final answer. The reward comes from verifiable outcomes: correct math, passing tests. This is not prompting — the reasoning policy is baked into the weights.

DeepSeek-R1-Zerodiscovered reasoning without any human chain-of-thought data. Using GRPO (Group Relative Policy Optimization) on the base model with only correctness rewards, reasoning emerged spontaneously — the model learned that showing intermediate steps leads to more correct answers. This "aha moment" proved reasoning is an emergent RL behavior, not something that must be explicitly taught. (The shipped DeepSeek-R1 adds a cold-start SFT stage on thousands of long-CoT examples before RL.)

Test-time Compute Scaling reframes the scaling question: instead of spending more on training a bigger model (fixed cost), spend more at inference via chain-of-thought, search, and verification (per-query cost). o1 and o3 demonstrated that doubling inference compute can outperform training a substantially larger model for reasoning tasks. Budget forcinglets the model allocate a "thinking budget" flexibly — spending more tokens on hard sub-problems and less on trivial ones — rather than a fixed chain length.

DeepSeek-R1-Zero (2025)discovered chain-of-thought reasoning purely through RL, without any supervised CoT training data. Using GRPO with only outcome correctness as the reward signal, the model spontaneously learned to "think" — producing self-verification, backtracking, and multi-step decomposition as emergent behaviors. The key insight: RL can discover reasoning strategies without explicit instruction, because showing intermediate steps is simply what maximizes the correctness reward.

Search-Augmented Reasoning goes further by combining best-of-N sampling with process reward models (PRMs) to do beam search over reasoning paths. Rather than generating N independent full chains (wasteful if a chain fails at step 2), you score each reasoning step, prune low-scoring branches early, and expand only the most promising continuations — effectively "beam search over thoughts." This approach outperforms simple majority voting at the same compute budget.

Process Reward Models (PRM) score each reasoning step, not just the final answer. Outcome reward models (ORM) only know if the answer is right; PRMs know which step went wrong. This enables denser training signal and early pruning of bad reasoning paths.

Tree of Thoughts: structured search over reasoning. Best-of-N generates N independent full reasoning chains — wasteful because a chain that goes wrong at step 2 still runs all remaining steps. Tree of Thoughts (Yao et al., 2023) generalizes CoT into a deliberate search process: at each step, generate k candidate continuations, evaluate their promise (with a value function or majority vote), and expand only the most promising branches — pruning the rest. This is BFS or DFS over a reasoning tree, where the PRM acts as the heuristic. The result: the same compute budget explores far more diverse reasoning strategies rather than re-running similar chains from scratch. On the “Game of 24” task (combine four numbers with arithmetic to reach 24), ToT with BFS solved 74% of problems vs. CoT's 4% — because many problems require backtracking, which flat CoT cannot do.

✨ Insight · Think of ORM as grading an exam by the final answer only. PRM is like a teacher who checks each line of work — partial credit for correct steps, immediate feedback on mistakes. PRM catches "right answer, wrong reasoning" cases that ORM misses.

Quick check

Derivation

PaLM 540B jumps from ~18% to ~57% on GSM8K with CoT prompting. What explains this gain without any weight update?

CoT allocates more tokens to the problem — each step contributes compute that reduces error.The longer output sequence forces the model to attend over more context, retrieving correct memorized answers.CoT triggers a specialized math reasoning module inside the Transformer that is dormant by default.The few-shot CoT examples teach the model new arithmetic facts it didn't learn during pretraining.

Quick Check

Why did DeepSeek-R1 emerge reasoning without human chain-of-thought training data?

📐

Step-by-Step Derivation

Test-Time Compute Tradeoff

Given a compute budget , allocate between model size (fixed cost) and test-time tokens (per-query cost):

For easy questions, suffices (direct answer). For medium questions, increasing (more thinking) is more cost-effective than increasing (bigger model). For the hardest questions, no amount of helps — you need a bigger model.

💡 Tip · Snell et al. (2024): "A smaller model with test-time compute can match a 14x larger model" — but only on medium-difficulty problems. The optimal strategy adapts compute allocation per question.

Process Reward Model (PRM) Scoring

Given a reasoning trace with steps , the PRM scores each step. The overall correctness score:

If any single step has low probability, the whole trace is flagged. This is used for best-of-N sampling: generate N reasoning traces, score each with PRM, return the highest-scoring one.

GRPO Group Advantage

For prompt , sample a group of outputs. Compute each advantage relative to the group:

No critic network needed. Outputs better than the group mean are reinforced; worse ones are suppressed. The normalization makes advantages zero-mean, stabilizing training. DeepSeek-R1 uses completions per prompt.

PyTorch: Chain-of-Thought Prompting

python

def cot_prompt(question: str) -> str:
    """Add chain-of-thought instruction to a question."""
    return f"""{question}

Let's think step by step:
1. First, identify what we need to find.
2. Break the problem into smaller parts.
3. Solve each part.
4. Combine and verify the answer."""


def process_reward_score(
    prm_model,
    reasoning_steps: list[str],
) -> float:
    """Score a reasoning trace with a Process Reward Model.

    Each step is scored conditioned on previous steps.
    Overall score = product of step-level probabilities.
    """
    score = 1.0
    context = []
    for step in reasoning_steps:
        context.append(step)
        # PRM predicts P(step is correct | previous steps)
        step_score = prm_model.score(context)  # -> float in [0, 1]
        score *= step_score
    return score


def best_of_n_with_prm(
    model, prm_model, prompt: str, n: int = 8
) -> str:
    """Generate N reasoning traces, return the best one."""
    traces = [model.generate(prompt) for _ in range(n)]
    scores = [
        process_reward_score(prm_model, trace.steps)
        for trace in traces
    ]
    best_idx = scores.index(max(scores))
    return traces[best_idx].final_answer

PyTorch implementation

# GRPO-style policy update (Group Relative Policy Optimization)
import torch

def grpo_loss(log_probs_new, log_probs_old, rewards, clip_eps=0.2):
    """
    log_probs_new / log_probs_old: (G,) log-probs of each output under new/old policy
    rewards: (G,) scalar reward per output
    """
    # Normalize rewards within the group -> advantage
    adv = (rewards - rewards.mean()) / (rewards.std() + 1e-8)  # (G,)

    # PPO-style clipped ratio
    ratio = (log_probs_new - log_probs_old).exp()  # (G,)
    clipped = ratio.clamp(1 - clip_eps, 1 + clip_eps)

    # Policy gradient loss (negative because we maximize reward)
    loss = -torch.min(ratio * adv, clipped * adv).mean()
    return loss

Quick check

Derivation

GRPO normalizes advantages within a group of G outputs. If all G outputs receive the same reward (all correct or all wrong), what happens to the policy update?

The advantage is 0 for all outputs — no policy update occurs for that prompt.The policy is penalized to reduce confidence on easy or hard prompts.The mean reward replaces std normalization, producing a large positive or negative update.All G outputs are discarded and G new outputs are resampled from the policy.

🔧

Break It — See What Happens

Force short reasoning (limit thinking tokens)

Only reward final answer (no process reward)

Quick check

Trade-off

You replace PRM with ORM (outcome only) in a best-of-64 setup for MATH. Lightman et al. show ORM achieves 72.4% vs PRM's 78.2%. What is the mechanism of the 5.8-point loss?

ORM produces noisier gradient signal during training, so the underlying model is weaker.ORM evaluates only the last token, so it misses multi-step chains that end with a well-formatted but wrong answer.ORM selects chains that ended correctly by luck — correct answer, flawed reasoning — that fail on problem variants.ORM requires N=64 more time to score than PRM, so it effectively uses a smaller budget.

📊

Real-World Numbers

Model / Technique	Benchmark	Score	Notes
PaLM 540B (no CoT)	GSM8K		Direct answer prompting
PaLM 540B + CoT	GSM8K		+39 pts from step-by-step prompting (Wei et al., 2022)
PaLM 540B + CoT + self-consistency	GSM8K	~74%	Wang et al. 2022, majority vote over sampled CoTs
GPT-4o	AIME 2024		Competition math, no reasoning training (OpenAI o1 report)
o1 (full)	AIME 2024		RL-trained reasoning, test-time compute
o1 (full)	MATH		Near-perfect on competition math
DeepSeek-R1	AIME 2024		Open-source, GRPO-trained reasoning
PRM + best-of-N	MATH		vs. ORM (Lightman et al.)

✨ Insight · The reasoning revolution: CoT gives a large gain on math with zero training. RL-trained reasoning (o1 full model) pushes AIME 2024 from ~13% (GPT-4o) to . And DeepSeek-R1-Zero showed you don't even need human CoT data — reasoning emerges from outcome-based RL alone (DeepSeek-R1 itself adds a cold-start SFT stage before RL).

Quick check

Derivation

o1 achieves 83.3% AIME 2024 at consensus@64 and 74.4% pass@1. What is the implied accuracy gain per additional sample in this consensus scheme?

Linear returns — each of 64 samples adds a constant ~0.14 percentage points.No meaningful returns beyond pass@1 — the 8.9-point gap is within statistical noise.Superlinear returns — later samples are more valuable because they resolve earlier ambiguities.Diminishing returns — the 8.9-point gap (74.4%→83.3%) is front-loaded into the first few samples.

🚀

SOTA 2024–2025: Test-Time Compute Scaling

Snell et al. 2024 established the second scaling axis formally: a (per arxiv:2408.03314, 2024-08). The year following saw every major lab ship a “reasoning mode.”

Model	Benchmark	Score	Notes
o3 (high compute)	ARC-AGI		vs. GPT-4o 5%; ~$100–200/task at high compute (per openai.com/index/introducing-o3-and-o4-mini, 2025-04)
o3 / o4-mini	AIME		per openai.com/index/introducing-o3-and-o4-mini, 2025-04
o3	SWE-bench		per openai.com/index/introducing-o3-and-o4-mini, 2025-04
Claude 3.7 Sonnet (extended thinking)	SWE-bench		Toggleable budget up to 100K thinking tokens (per anthropic.com/news/visible-extended-thinking, 2025-02)
Gemini 2.5 Pro (Deep Think)	IMO 2024		1M context, native multimodal hybrid thinking mode (per deepmind.google blog, 2025)
Grok 3	AIME / reasoning	—	10× compute scale on Colossus cluster; RL at scale for reasoning (per x.ai/news/grok-3, 2025-02)
Claude Opus 4 / Sonnet 4	SWE-bench		Extended thinking, 200K context, Constitutional AI v2; hybrid reasoning mode selectable per-request (per anthropic.com/news/claude-4-0, 2025-05)
Grok 4	AIME 2025		Colossus 200K H100 cluster, 1M context, real-time X data integration; deep reasoning mode (per x.ai/blog/grok-4, 2025-08)
Kimi K2	AIME 2025		; agentic-tuned, MCP-native; open weights (per github.com/MoonshotAI/Kimi-K2 + moonshot.ai blog, 2025-07)
MiniMax-M1	MATH-500		; lightning attention + 1M native context; hybrid linear+softmax attention (per arxiv:2506.13585, 2025-06)
DeepSeek-R1-0528	AIME 2024		Up from 70% on base R1; distilled into Qwen-3 8B; improved search and verification (per huggingface.co/deepseek-ai/DeepSeek-R1-0528, 2025-05)
DeepSeek-V3.1	LiveCodeBench		; hybrid thinking/non-thinking modes; open weights (per huggingface.co/deepseek-ai/DeepSeek-V3-1, 2025-08)
Gemini 2.5 Deep Think	IMO 2025		Parallel sampling across multiple reasoning chains; extended thinking budget (per blog.google, 2025-08)

✨ Insight · The “test-time compute scaling” axis is now a first-class design choice: o3 at high compute costs ~$100–200/ARC-AGI task but scores 87.5%, while the same model at low compute scores ~55%. Claude 3.7 Sonnet exposes this as an explicit token budget knob. Gemini 2.5 Pro's Deep Think mode reaches IMO gold (35/42) — a task that requires multi-step symbolic reasoning beyond pattern-matching. By mid-2025, every major lab ships a reasoning mode: DeepSeek-R1-0528 hits 87.5% AIME 2024 (up from 70%), Kimi K2 (1T/32B MoE) is MCP-native, and MiniMax-M1 runs 1M-context with hybrid linear attention.

🧠

Key Takeaways

What to remember for interviews

1Chain-of-thought prompting forces the model to generate intermediate reasoning steps before answering, improving accuracy on multi-step tasks by allocating more compute (tokens) to the problem.
2Test-time compute scaling lets a smaller model think longer at inference rather than training a bigger model — a smaller model with extended reasoning can match a 14× larger model on medium-difficulty problems.
3OpenAI o1 trains reasoning via RL: the model learns a 'thinking' policy where the reward comes from verifiable outcomes (correct math, passing tests) rather than human-written CoT examples.
4DeepSeek-R1-Zero discovered chain-of-thought reasoning purely through outcome-based RL (GRPO) — no human CoT data was needed. Reasoning is an emergent behavior when RL optimizes for correctness. (The shipped DeepSeek-R1 adds a cold-start SFT stage before RL.)
5Process Reward Models (PRM) score each reasoning step individually rather than just the final answer, enabling early pruning of bad reasoning paths and catching flawed logic that produces lucky correct answers.
6SOTA 2025: o3 scores 87.5% on ARC-AGI (high compute) and 96.7% AIME; Claude 3.7 Sonnet reaches 62.3% SWE-bench with extended thinking; Gemini 2.5 Pro Deep Think achieves IMO gold at 35/42.

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 6 of 6

What is chain-of-thought prompting and why does it improve reasoning accuracy?

★☆☆

GoogleOpenAI

How does OpenAI's o1 differ from standard CoT prompting? What is its training approach?

★★☆

OpenAI

Explain DeepSeek-R1's 'aha moment.' How did reasoning emerge without human CoT data?

★★★

OpenAIMeta

What is the difference between outcome reward models (ORM) and process reward models (PRM)?

★★☆

OpenAIGoogle

Explain GRPO (Group Relative Policy Optimization) and how it computes advantages without a critic.

★★★

OpenAIMeta

When should you scale test-time compute (think longer) vs. use a bigger model? What is the tradeoff?

★★☆

GoogleAnthropic

←

🖼️ Multimodal LLMs

✅ Verifiers & Process Reward

→