✅ Verifiers & Process Reward
ORM picks best-of-1860 solutions at 47.5% MATH accuracy. PRM hits the same accuracy with only 400 — by scoring each reasoning step instead of the final answer.
A model generates an answer. But how do you know it reasoned correctly? Verifiers and Process Reward Models score each reasoning step — not just the final answer. Combined with best-of-N sampling and tree search, this unlocks a powerful idea: spending more compute at inference time to get better results without retraining.
Process vs Outcome Reward Scoring
What you are seeing: Four candidate solutions for 17 x 23 = 391. Each shows step-by-step reasoning scored by a verifier. Green steps are correct, red steps have errors.
What to try: Switch between PRM and ORM scoring. Notice how ORM gives Trajectory 2 a perfect score despite flawed reasoning — it only checks the final answer. PRM catches the error. Click any trajectory to expand details.
The Intuition
ORM vs PRM: An Outcome Reward Model checks the final answer — right or wrong. A Process Reward Model scores every intermediate step. For math and code, PRM often outperforms ORM because it catches flawed reasoning that happens to reach the right answer. It also provides denser training signal — one label per step rather than one per solution.
Best-of-N sampling: Generate N candidate solutions, score each with a verifier, return the best one. No training needed — pure inference-time strategy. Quality improves with N, but with diminishing returns (logarithmic scaling).
Self-consistency: Generate N reasoning chains, take a majority vote on the final answer. No verifier needed — the signal comes from agreement. Correct reasoning converges; errors are random.
When to think longer vs use a bigger model: Test-time compute (more samples, deeper search) helps most when (1) you have a reliable verifier, (2) the task is about execution not knowledge, and (3) difficulty varies. For knowledge-heavy or open-ended tasks, a bigger model wins.
Verifier-guided tree search:Instead of generating N complete solutions, generate step-by-step. At each step, branch into K candidates, score with PRM, prune bad branches, expand good ones. More efficient than best-of-N because you don't waste compute finishing clearly wrong reasoning chains.
Quick check
ORM rewards a trajectory that reaches the correct answer through a flawed intermediate step. What is the first-order consequence for a model trained on those labels?
Why does PRM outperform ORM on math reasoning tasks?
Step-by-Step Derivation
Best-of-N Expected Quality
For N i.i.d. samples from a distribution with mean and standard deviation , the expected maximum scales as:
PRM Scoring (Process Reward)
A PRM assigns a correctness score to each step. The overall trajectory score is typically the product (or minimum) of step scores. For a trajectory with steps.
Using the product means one bad step tanks the whole score. An alternative is which focuses on the weakest link. Both penalize trajectories with any flawed step, unlike ORM which only sees the endpoint.
Majority Vote Accuracy Scaling
If each of independent samples has probability of being correct, majority vote accuracy approaches 1 as N grows (Condorcet's Jury Theorem):
PyTorch: Best-of-N with Reward Model
def best_of_n(prompt, generator, reward_model, tokenizer, n=8):
"""Generate N completions, return the one with highest reward."""
# Generate N candidate completions
inputs = tokenizer(prompt, return_tensors="pt")
outputs = generator.generate(
**inputs,
num_return_sequences=n,
do_sample=True,
temperature=0.7,
max_new_tokens=512,
)
completions = tokenizer.batch_decode(outputs, skip_special_tokens=True)
# Score each completion with the reward model
scores = []
for completion in completions:
reward_input = tokenizer(completion, return_tensors="pt")
score = reward_model(**reward_input).logits.squeeze()
scores.append(score.item())
# Return the highest-scoring completion
best_idx = max(range(n), key=lambda i: scores[i])
return completions[best_idx], scores[best_idx]
def prm_score(steps, prm_model, tokenizer):
"""Score a reasoning trajectory step-by-step with a PRM."""
step_scores = []
context = ""
for step in steps:
context += step + " "
inputs = tokenizer(context, return_tensors="pt")
# PRM outputs P(correct) for the latest step
score = torch.sigmoid(prm_model(**inputs).logits[:, -1])
step_scores.append(score.item())
# Trajectory score = product of step scores
trajectory_score = 1.0
for s in step_scores:
trajectory_score *= s
return trajectory_score, step_scoresPyTorch implementation
# Process Reward Model: token-level reward prediction head
import torch
import torch.nn as nn
class ProcessRewardModel(nn.Module):
"""Attach a scalar reward head to a transformer for step-level scoring."""
def __init__(self, base_model, d_model: int):
super().__init__()
self.base = base_model
self.reward_head = nn.Linear(d_model, 1) # P(step correct)
def forward(self, input_ids, step_token_mask):
"""
input_ids: (B, T) full reasoning trace
step_token_mask: (B, T) 1 at the last token of each step, 0 elsewhere
Returns step_scores: (total_steps,) flat reward per step boundary
(use step_token_mask to recover per-sequence grouping).
"""
hidden = self.base(input_ids).last_hidden_state # (B, T, D)
logits = self.reward_head(hidden).squeeze(-1) # (B, T)
# Boolean indexing flattens across batch → (total_steps,)
step_scores = logits[step_token_mask.bool()]
return torch.sigmoid(step_scores)Quick check
Best-of-N quality scales as E[max] ≈ μ + σ√(2 ln N). You run N=4 and see accuracy = 60%. Roughly what N is needed to reach 70%, assuming σ=10 and linear mapping from E[max] to accuracy?
Break It — See What Happens
Quick check
A PRM verifier is trained on human ratings of “confident and detailed” steps rather than logical correctness. What failure mode emerges?
Real-World Numbers
| Result | Source | Details |
|---|---|---|
| Let's Verify Step by Step | MATH benchmark, best-of-1860 with GPT-4 base. PRM selects better solutions by scoring reasoning steps, not just answers. | |
| Best-of-N: log scaling | Scaling Test-Time Compute | N=1 to N=256: accuracy improves ~15 points on MATH but flattens. Compute-optimal strategies outperform brute-force best-of-N. |
| Snell et al. 2024 | A small model with optimal test-time compute allocation matches a 14x larger model on math tasks with a good verifier. | |
| OpenAI o1 (2024) | Uses internal chain-of-thought with test-time compute scaling. Thinking longer on harder problems via learned search strategy. | |
| Wang et al. 2022 | Majority vote over 40 CoT samples improves GSM8K from ~56% to ~74% — no verifier needed, just sample more and vote. |
Quick check
o1 scored 94.8% on MATH (pass@1) while PRM best-of-N (same base model class) peaks around 78.2%. Self-consistency at N=40 reaches ~74% on GSM8K. Rank these strategies by inference cost per correct answer.
Key Takeaways
What to remember for interviews
- 1ORM scores only the final answer; PRM scores every intermediate step — PRM catches flawed reasoning that gets lucky, improving MATH accuracy from 72.4% (ORM) to 78.2% (PRM) with the same base model.
- 2Best-of-N sampling improves quality with diminishing log-returns: E[max] scales as μ + σ√(2 ln N), so going from N=64 to N=128 yields far less gain than N=1 to N=2.
- 3Self-consistency (majority vote over N chains) requires no verifier — correct reasoning paths converge, errors are random. Gains up to +15% on GSM8K with N=40 samples.
- 4Verifier-guided tree search is more compute-efficient than best-of-N: it prunes bad reasoning branches early instead of wasting compute completing clearly-wrong trajectories.
- 5A small model with optimal test-time compute allocation can match a 14x larger model on math tasks — but only when a reliable verifier exists.
- 6SOTA 2025: o3 ARC-AGI 87.5% (high compute) validates test-time compute + verifier at production scale; AlphaProof+AlphaGeometry-2 reached IMO 2024 silver (28/42); ThinkPRM matches large discriminative PRMs with only 1% of labels via CoT verification.
SOTA 2024–2025: PRM Advances
o3 ARC-AGI: 87.5% at High Compute (2025)
The most public validation of test-time compute scaling via search + verifier: , vs. GPT-4o at 5%. The gap is almost entirely attributable to extended test-time compute with internal verification — not a larger model (per arcprize.org/blog/oai-o3-pub-breakthrough, 2024-12).
ThinkPRM — CoT Verifier with 1% Labels (arxiv:2501.07301, 2025-01)
Traditional discriminative PRMs require step-level labels at scale (~800K for PRM800K). ThinkPRM instead trains a verifier to produce a chain-of-thought explanation of why a step is correct or incorrect. Result: — the CoT explanation substitutes for labeled supervision.
ProcessBench — 3,400 Step-Error Cases (arxiv:2504.16828, 2025-04)
A systematic evaluation benchmark with 3,400 math reasoning traces annotated at the step level for error location. Reveals that most current PRMs fail to pinpoint the exact error step even when they correctly score the overall solution as wrong — motivating CoT-based verifiers like ThinkPRM.
AlphaProof + AlphaGeometry-2: IMO 2024 Silver (DeepMind, 2024)
AlphaProof (RL on formal Lean proofs) + AlphaGeometry-2 (neural symbolic geometry) combined for . AlphaGeometry-2 alone solves (per deepmind.google/blog/ai-solves-imo-problems-at-silver-medal-level, 2024-07). These systems use formal verifiers (Lean proof checker, geometry solver) as the reward signal — a production RLVR deployment.
Recap quiz
Verifier & PRM recap
Lightman et al. tested PRM vs ORM at best-of-1860 on MATH. Which outcome most accurately describes the result?
Self-consistency on GSM8K requires no trained verifier but still lifts accuracy from ~56% to ~74% at N=40. What condition must hold for majority vote to keep improving with more samples?
Snell et al. showed a small model with optimal test-time compute can match a 14x larger model on math tasks. What is the critical prerequisite for this to hold?
PRM800K required ~800K step-level correctness labels across 4,500 MATH problems (~200 labeled steps per problem). What does this annotation cost imply for production PRM deployment?
o1 reached 94.8% on MATH (pass@1) and 83.3% on AIME (consensus@64). What best explains why o1's MATH score is so much higher than best-of-N PRM (78.2%) using the same base model class?
Why does verifier-guided tree search outperform best-of-N at the same total token budget?
A PRM uses the product of step scores: score = ∏ p(step t correct). A trajectory has 5 steps with scores [0.9, 0.9, 0.4, 0.9, 0.9]. What is the trajectory score, and what does it reveal about the product aggregation choice?
Further Reading
- Let's Verify Step by Step — Lightman et al., 2023. The foundational PRM paper. Shows process supervision outperforms outcome supervision on MATH with ~75K step-level human labels (PRM800K dataset).
- Scaling LLM Test-Time Compute Optimally — Snell et al., 2024. Analyzes compute-optimal strategies for test-time scaling: when to search vs when to use a bigger model.
- OpenAI o1 Technical Report — OpenAI, 2024. Demonstrates learned test-time compute allocation via internal chain-of-thought reasoning.
- Self-Consistency Improves Chain of Thought Reasoning — Wang et al., 2022. Majority vote over sampled reasoning paths — simple, verifier-free, surprisingly effective.
- Improve Mathematical Reasoning in Language Models by Automated Process Supervision — Luo et al. 2024 — Monte Carlo tree search to automatically generate step-level labels for PRM training without human annotation
- DeepSeek-R1: Incentivizing Reasoning Capability via Reinforcement Learning — DeepSeek 2025 — outcome-based RL (GRPO) with no PRM; shows that dense process supervision is not always necessary when the reward signal is sufficiently clear
- Lilian Weng — Self-Consistency and Process Rewards — Covers the spectrum from self-consistency (no verifier) through ORM to PRM — useful for understanding the tradeoffs in test-time compute strategies
Interview Questions
Showing 6 of 6
Explain the difference between ORM and PRM. When does PRM significantly outperform ORM?
★★☆How does best-of-N sampling work? What are its computational tradeoffs?
★★☆What is self-consistency and how does it differ from best-of-N with a reward model?
★★☆When should you invest in test-time compute vs. training a bigger model?
★★★How does verifier-guided tree search work? Compare it to best-of-N.
★★★How would you train a Process Reward Model? What data do you need?
★★★