Verifiers & Process Reward — Transformer Math

Module 33 · Architectures

✅ Verifiers & Process Reward

ORM picks best-of-1860 solutions at 47.5% MATH accuracy. PRM hits the same accuracy with only 400 — by scoring each reasoning step instead of the final answer.

Status:

A model generates an answer. But how do you know it reasoned correctly? Verifiers and Process Reward Models score each reasoning step — not just the final answer. Combined with best-of-N sampling and tree search, this unlocks a powerful idea: spending more compute at inference time to get better results without retraining.

🎮

Process vs Outcome Reward Scoring

What you are seeing: Four candidate solutions for 17 x 23 = 391. Each shows step-by-step reasoning scored by a verifier. Green steps are correct, red steps have errors.

What to try: Switch between PRM and ORM scoring. Notice how ORM gives Trajectory 2 a perfect score despite flawed reasoning — it only checks the final answer. PRM catches the error. Click any trajectory to expand details.

Scoring mode:

✨ Insight · PRM ranks Trajectory #1 (correct reasoning) highest at 0.97, while Trajectory #2 (lucky answer) scores only 0.31. The verifier sees how you got there, not just where you ended up.

💡

The Intuition

ORM vs PRM: An Outcome Reward Model checks the final answer — right or wrong. A Process Reward Model scores every intermediate step. For math and code, PRM often outperforms ORM because it catches flawed reasoning that happens to reach the right answer. It also provides denser training signal — one label per step rather than one per solution.

Best-of-N sampling: Generate N candidate solutions, score each with a verifier, return the best one. No training needed — pure inference-time strategy. Quality improves with N, but with diminishing returns (logarithmic scaling).

Self-consistency: Generate N reasoning chains, take a majority vote on the final answer. No verifier needed — the signal comes from agreement. Correct reasoning converges; errors are random.

When to think longer vs use a bigger model: Test-time compute (more samples, deeper search) helps most when (1) you have a reliable verifier, (2) the task is about execution not knowledge, and (3) difficulty varies. For knowledge-heavy or open-ended tasks, a bigger model wins.

Verifier-guided tree search:Instead of generating N complete solutions, generate step-by-step. At each step, branch into K candidates, score with PRM, prune bad branches, expand good ones. More efficient than best-of-N because you don't waste compute finishing clearly wrong reasoning chains.

✨ Insight · Think of best-of-N as brute-force search and tree search as informed search. Best-of-N generates full solutions blindly. Tree search uses the PRM as a heuristic to focus compute on promising reasoning paths — like A* vs random sampling.

Quick check

Trade-off

ORM rewards a trajectory that reaches the correct answer through a flawed intermediate step. What is the first-order consequence for a model trained on those labels?

The model learns to always show work, even when the work is incorrect.The model reinforces flawed reasoning that luckily produces correct answers, degrading reliability on harder problems.The model outputs shorter solutions because ORM penalizes verbose wrong steps.The model learns outcome correctness perfectly but struggles with step-level attribution tasks.

Quick Check

Why does PRM outperform ORM on math reasoning tasks?

📐

Step-by-Step Derivation

Best-of-N Expected Quality

For N i.i.d. samples from a distribution with mean and standard deviation , the expected maximum scales as:

💡 Tip · Quality scales with , not . Doubling N from 64 to 128 gives far less improvement than doubling from 1 to 2. This is why brute-force best-of-N eventually loses to smarter search.

PRM Scoring (Process Reward)

A PRM assigns a correctness score to each step. The overall trajectory score is typically the product (or minimum) of step scores. For a trajectory with steps.

Using the product means one bad step tanks the whole score. An alternative is which focuses on the weakest link. Both penalize trajectories with any flawed step, unlike ORM which only sees the endpoint.

Majority Vote Accuracy Scaling

If each of independent samples has probability of being correct, majority vote accuracy approaches 1 as N grows (Condorcet's Jury Theorem):

💡 Tip · This only works when — each sample must be better than a coin flip. If the model consistently gets a problem wrong (systematic error), more samples won't help. Self-consistency amplifies correctness, not just confidence.

PyTorch: Best-of-N with Reward Model

python

def best_of_n(prompt, generator, reward_model, tokenizer, n=8):
    """Generate N completions, return the one with highest reward."""
    # Generate N candidate completions
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = generator.generate(
        **inputs,
        num_return_sequences=n,
        do_sample=True,
        temperature=0.7,
        max_new_tokens=512,
    )
    completions = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    # Score each completion with the reward model
    scores = []
    for completion in completions:
        reward_input = tokenizer(completion, return_tensors="pt")
        score = reward_model(**reward_input).logits.squeeze()
        scores.append(score.item())

    # Return the highest-scoring completion
    best_idx = max(range(n), key=lambda i: scores[i])
    return completions[best_idx], scores[best_idx]


def prm_score(steps, prm_model, tokenizer):
    """Score a reasoning trajectory step-by-step with a PRM."""
    step_scores = []
    context = ""
    for step in steps:
        context += step + " "
        inputs = tokenizer(context, return_tensors="pt")
        # PRM outputs P(correct) for the latest step
        score = torch.sigmoid(prm_model(**inputs).logits[:, -1])
        step_scores.append(score.item())

    # Trajectory score = product of step scores
    trajectory_score = 1.0
    for s in step_scores:
        trajectory_score *= s
    return trajectory_score, step_scores

PyTorch implementation

# Process Reward Model: token-level reward prediction head
import torch
import torch.nn as nn

class ProcessRewardModel(nn.Module):
    """Attach a scalar reward head to a transformer for step-level scoring."""
    def __init__(self, base_model, d_model: int):
        super().__init__()
        self.base = base_model
        self.reward_head = nn.Linear(d_model, 1)  # P(step correct)

    def forward(self, input_ids, step_token_mask):
        """
        input_ids:       (B, T) full reasoning trace
        step_token_mask: (B, T) 1 at the last token of each step, 0 elsewhere
        Returns step_scores: (total_steps,) flat reward per step boundary
                 (use step_token_mask to recover per-sequence grouping).
        """
        hidden = self.base(input_ids).last_hidden_state  # (B, T, D)
        logits = self.reward_head(hidden).squeeze(-1)    # (B, T)
        # Boolean indexing flattens across batch → (total_steps,)
        step_scores = logits[step_token_mask.bool()]
        return torch.sigmoid(step_scores)

Quick check

Derivation

Best-of-N quality scales as E[max] ≈ μ + σ√(2 ln N). You run N=4 and see accuracy = 60%. Roughly what N is needed to reach 70%, assuming σ=10 and linear mapping from E[max] to accuracy?

N ≈ 8 — doubling N always delivers a proportional accuracy gain.N ≈ 16 — accuracy scales linearly with N until saturation.N ≈ 256 — diminishing returns require a 64x increase to add 10 accuracy points.N ≈ 35–40 — the √(2 ln N) factor means each doubling yields shrinking gains.

🔧

Break It — See What Happens

Only score final answer (ORM)

Weak verifier that rewards confident-sounding steps

Quick check

Trade-off

A PRM verifier is trained on human ratings of “confident and detailed” steps rather than logical correctness. What failure mode emerges?

The PRM gives zero scores to all steps, causing tree search to collapse to random sampling.The PRM becomes too slow to use in real-time tree search due to large model size.The generator learns to produce plausible-sounding nonsense (step-level reward hacking), because style proxies correctness.Self-consistency majority vote becomes unreliable because diverse paths no longer converge.

📊

Real-World Numbers

Result	Source	Details
	Let's Verify Step by Step	MATH benchmark, best-of-1860 with GPT-4 base. PRM selects better solutions by scoring reasoning steps, not just answers.
Best-of-N: log scaling	Scaling Test-Time Compute	N=1 to N=256: accuracy improves ~15 points on MATH but flattens. Compute-optimal strategies outperform brute-force best-of-N.
	Snell et al. 2024	A small model with optimal test-time compute allocation matches a 14x larger model on math tasks with a good verifier.
	OpenAI o1 (2024)	Uses internal chain-of-thought with test-time compute scaling. Thinking longer on harder problems via learned search strategy.
	Wang et al. 2022	Majority vote over 40 CoT samples improves GSM8K from ~56% to ~74% — no verifier needed, just sample more and vote.

✨ Insight · The trend: inference compute is becoming a first-class scaling axis alongside model size and training data. — and R1 show that learning to allocate test-time compute adaptively (think longer on hard problems) is more efficient than uniformly scaling N.

Quick check

Trade-off

o1 scored 94.8% on MATH (pass@1) while PRM best-of-N (same base model class) peaks around 78.2%. Self-consistency at N=40 reaches ~74% on GSM8K. Rank these strategies by inference cost per correct answer.

Self-consistency < PRM best-of-N < o1 (o1 cheapest due to adaptive compute).o1 < PRM best-of-N < self-consistency (self-consistency wastes tokens on diverse wrong paths).PRM best-of-N (1860 samples) > o1 > self-consistency N=40, roughly ordered by total inference tokens.All three strategies have identical cost at their reported N values — accuracy differences come from verifier quality alone.

🧠

Key Takeaways

What to remember for interviews

1ORM scores only the final answer; PRM scores every intermediate step — PRM catches flawed reasoning that gets lucky, improving MATH accuracy from 72.4% (ORM) to 78.2% (PRM) with the same base model.
2Best-of-N sampling improves quality with diminishing log-returns: E[max] scales as μ + σ√(2 ln N), so going from N=64 to N=128 yields far less gain than N=1 to N=2.
3Self-consistency (majority vote over N chains) requires no verifier — correct reasoning paths converge, errors are random. Gains up to +15% on GSM8K with N=40 samples.
4Verifier-guided tree search is more compute-efficient than best-of-N: it prunes bad reasoning branches early instead of wasting compute completing clearly-wrong trajectories.
5A small model with optimal test-time compute allocation can match a 14x larger model on math tasks — but only when a reliable verifier exists.
6SOTA 2025: o3 ARC-AGI 87.5% (high compute) validates test-time compute + verifier at production scale; AlphaProof+AlphaGeometry-2 reached IMO 2024 silver (28/42); ThinkPRM matches large discriminative PRMs with only 1% of labels via CoT verification.

🚀

SOTA 2024–2025: PRM Advances

o3 ARC-AGI: 87.5% at High Compute (2025)

The most public validation of test-time compute scaling via search + verifier: , vs. GPT-4o at 5%. The gap is almost entirely attributable to extended test-time compute with internal verification — not a larger model (per arcprize.org/blog/oai-o3-pub-breakthrough, 2024-12).

ThinkPRM — CoT Verifier with 1% Labels (arxiv:2501.07301, 2025-01)

Traditional discriminative PRMs require step-level labels at scale (~800K for PRM800K). ThinkPRM instead trains a verifier to produce a chain-of-thought explanation of why a step is correct or incorrect. Result: — the CoT explanation substitutes for labeled supervision.

ProcessBench — 3,400 Step-Error Cases (arxiv:2504.16828, 2025-04)

A systematic evaluation benchmark with 3,400 math reasoning traces annotated at the step level for error location. Reveals that most current PRMs fail to pinpoint the exact error step even when they correctly score the overall solution as wrong — motivating CoT-based verifiers like ThinkPRM.

AlphaProof + AlphaGeometry-2: IMO 2024 Silver (DeepMind, 2024)

AlphaProof (RL on formal Lean proofs) + AlphaGeometry-2 (neural symbolic geometry) combined for . AlphaGeometry-2 alone solves (per deepmind.google/blog/ai-solves-imo-problems-at-silver-medal-level, 2024-07). These systems use formal verifiers (Lean proof checker, geometry solver) as the reward signal — a production RLVR deployment.

✨ Insight · The trajectory: Lightman et al. 2023 showed PRMs beat ORMs on MATH (+5.8%). ThinkPRM 2025 shows you can get PRM-quality with 1% of the label budget via CoT. AlphaProof/AG-2 shows that when a formal verifier exists (Lean, geometry checker), you can close in on IMO silver without any human step annotations. The bottleneck shifts from “can we build a good verifier?” to “can we extend verifiable domains beyond math and code?”

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 6 of 6

Explain the difference between ORM and PRM. When does PRM significantly outperform ORM?

★★☆

OpenAIGoogle

How does best-of-N sampling work? What are its computational tradeoffs?

★★☆

OpenAIAnthropic

What is self-consistency and how does it differ from best-of-N with a reward model?

★★☆

GoogleMeta

When should you invest in test-time compute vs. training a bigger model?

★★★

OpenAIGoogle

How does verifier-guided tree search work? Compare it to best-of-N.

★★★

OpenAIGoogle

How would you train a Process Reward Model? What data do you need?

★★★

OpenAIAnthropic

Transformer Math