RLHF & Reward Models — Transformer Math

Module 20 · Training

🎯 RLHF & Reward Models

OpenAI fine-tuned InstructGPT on 13,000 labeled examples and it beat GPT-3 (175B) on human preference. Smaller dataset, smaller model, better outputs — because RLHF teaches preference, not prediction.

Status:

Pre-training teaches language. SFT teaches instruction-following. RLHF and DPO teach the model what humans actually prefer — helpful, harmless, and honest outputs. This is the alignment stage that separates a base model from a chatbot.

🎮

RLHF Pipeline

What you’re seeing: the four stages of RLHF — pretraining, supervised fine-tuning, reward model training on human preference pairs, and PPO optimization against the reward signal — with the same base model serving different roles in stages 3 and 4. What to try: notice how the KL penalty in stage 4 prevents the policy from drifting too far from the SFT model while still maximizing reward.

💡

The Intuition

Pre-training teaches the model language — how to predict the next token. SFT teaches it to follow instructions by training on (instruction, response) pairs. But neither teaches the model which responses humans prefer.

RLHF adds a reward model trained on human preferences. The reward model scores outputs, and PPO optimizes the LLM to maximize that score. The critical constraint: KL penalty keeps the model close to the SFT policy, preventing reward hacking.

How the reward model is trained: Labelers are shown a prompt and two responses, and asked which is better. These pairwise comparisons are modeled with the Bradley-Terry model — the probability that response is preferred over is:

where is the reward model. Training minimizes the negative log-likelihood of the observed preferences. This is just binary cross-entropy over all pairs — the reward model learns a scalar score that ranks responses consistently with human judgments ().

DPO skips the reward model entirely. It directly optimizes the policy using preference pairs. DPO is derived from the same KL-constrained reward maximization objective as RLHF, but bypasses the reward model and PPO training — it's a closed-form solution under specific assumptions, not a blanket equivalent to PPO-based RLHF.

✨ Insight · Think of RLHF as teaching a student with a rubric (reward model) and a tutor (PPO). DPO is like giving the student paired examples: “this answer is better than that one” — and letting them learn the rubric implicitly.

Worked Example — Reward Scoring in Practice

Prompt: “Explain quantum entanglement.”

Response A — Jargon-heavy

“Quantum entanglement is a phenomenon in which the quantum states of two or more particles are described by a non-separable joint wavefunction, such that measuring one particle instantaneously collapses the superposition of the entangled partner regardless of spatial separation.”

r(x, A) = 2.1— accurate but inaccessible

Response B — Clear and concrete

“Imagine two magic coins that always land opposite each other, no matter how far apart they are. Flip one in New York and instantly the other in Tokyo lands the opposite way. Quantum entanglement is that link — two particles whose fates are connected even across vast distances.”

r(x, B) = 3.7— clear, relatable, preferred

How PPO uses this signal

1.The reward model scores both responses: advantage of B over A is 3.7 − 2.1 = +1.6.
2.PPO computes the policy gradient: increase the log-probability of tokens in Response B (the high-reward trajectory).
3.The KL penalty keeps the update small — the model shifts toward B-like responses without forgetting how to produce A-like technical detail.
4.After thousands of such updates, the model consistently chooses clarity over jargon when the reward model reflects that preference.

Quick check

Trade-off

A colleague proposes setting β = 0 to maximize reward without constraint. What is the first failure mode you would observe?

Training diverges immediately because PPO becomes unstable without a KL anchor.The reward model overfits to the policy's outputs and stops providing gradient signal.Outputs exploit reward model weaknesses — excessive verbosity, hedging phrases — while reward score rises and human preference score falls.The policy collapses to the reference model because gradient variance is too high.

🚀

SOTA 2024–2025: GRPO, DPO Variants & RLAIF at Scale

GRPO (Group Relative Policy Optimization) — introduced in DeepSeekMath (Shao et al., 2024), later adopted by , GRPO eliminates the critic/value network that PPO requires. Rather than learning a value function to estimate baseline advantages, GRPO samples a groupof outputs for each prompt, scores every output with the reward model, and computes each output's advantage relative to the group mean. This removes an entire model from the training pipeline, significantly reducing training resources versus PPO, with more stable training — because reward normalization happens naturally within each group rather than relying on a learned value head that can diverge early in training.

✨ Insight · GRPO is essentially “best-of-N sampling turned into a gradient signal.” The group mean acts as a dynamic baseline — outputs above average get positive advantage, below average get negative. No critic required.

Online DPO Variants— two notable extensions address DPO's original failure modes. IPO (Identity Preference Optimization, Azar et al. 2023) replaces DPO's log-sigmoid with a squared loss, removing the implicit assumption that preferences are generated by a Bradley-Terry model. This makes IPO more robust to label noise and prevents the model from over-fitting to margin — a known failure mode when the preference dataset contains contradictory labels. KTO (Kahneman-Tversky Optimization, Ethayarajh et al. 2024) goes further: it works with binary feedback (good / bad) instead of contrastive pairs, making data collection dramatically cheaper at scale. KTO is inspired by prospect theory — humans are more sensitive to losses than gains — and optimizes a utility function that mirrors this asymmetry.

RLAIF at Scale — showed that AI feedback can scale harmlessness training with far fewer human labels. The core insight: use a strong LLM to critique and rank outputs against a written “constitution” of principles, then train on those AI-generated preferences. The Llama 3 technical report describes using both human and AI-generated preference data for reward model training, reflecting the emerging consensus: expensive human annotation is best reserved for calibration and red-teaming, while AI feedback handles scale.

💡 Tip · The practical split emerging across labs: AI feedback for volume (millions of examples), human feedback for calibration and adversarial edge cases. Neither alone is sufficient — human feedback anchors the AI's judgment, and AI feedback scales it.

Quick Check

What does the KL penalty in RLHF prevent?

📐

Step-by-Step Derivation

PPO Objective (RLHF)

Maximize expected reward while staying close to the reference policy. controls the KL penalty strength:

💡 Tip · When , no constraint — pure reward maximization (leads to reward hacking). When , the model can't deviate from the reference at all (no learning).

DPO Loss

Direct Preference Optimization reformulates the RLHF objective into a simple classification loss. Given preferred output and dispreferred :

The key insight: the log-probability ratio acts as an implicit reward. DPO increases this ratio for preferred outputs and decreases it for dispreferred ones.

PyTorch: DPO Loss

python

import torch.nn.functional as F

def dpo_loss(pi_logps_w, pi_logps_l, ref_logps_w, ref_logps_l, beta=0.1):
    """Direct Preference Optimization loss."""
    # Log-probability ratios (implicit rewards)
    pi_ratio_w = pi_logps_w - ref_logps_w   # preferred
    pi_ratio_l = pi_logps_l - ref_logps_l   # dispreferred

    # DPO loss: -log sigmoid(beta * (preferred - dispreferred))
    logits = beta * (pi_ratio_w - pi_ratio_l)
    loss = -F.logsigmoid(logits).mean()
    return loss

PPO-Clip Objective (RLHF)

PPO clips the probability ratio to prevent large policy updates. The KL penalty term keeps the policy anchored to the reference model. () is the clip range; is the KL coefficient (typically 0.01–0.1 in RLHF):

PyTorch: PPO-Clip Loss for RLHF

python

import torch

def ppo_clip_loss(new_log_probs, old_log_probs, ref_log_probs,
                  advantages, eps=0.2, beta=0.05):
    """PPO-Clip objective for RLHF.
    new_log_probs: log π_θ(y|x) under current policy
    old_log_probs: log π_old(y|x) from the rollout policy
    ref_log_probs: log π_ref(y|x) from frozen SFT checkpoint
    advantages:    r(x,y) - V(x) — reward minus value baseline
    """
    ratio = (new_log_probs - old_log_probs).exp()
    clipped = torch.clamp(ratio, 1 - eps, 1 + eps)
    policy_loss = -torch.min(ratio * advantages, clipped * advantages).mean()

    # KL divergence penalty against reference (SFT) policy
    kl_penalty = beta * (new_log_probs - ref_log_probs).mean()
    return policy_loss + kl_penalty

PyTorch implementation

# PPO clipped surrogate objective for RLHF
import torch, torch.nn.functional as F

def ppo_rlhf_step(policy, ref_policy, reward_model, prompts, eps=0.2, beta=0.05):
    """One PPO update step in the RLHF loop."""
    # 1. Sample completions from current policy
    with torch.no_grad():
        completions = policy.generate(prompts, do_sample=True, max_new_tokens=256)
        rewards = reward_model(prompts, completions)          # scalar per sample
        old_log_probs = policy.log_prob(completions)          # log π_old
        ref_log_probs = ref_policy.log_prob(completions)      # log π_ref (frozen SFT)

    # 2. Compute advantages (reward - value baseline; simplified: reward - mean)
    advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8)

    # 3. PPO-clip + KL penalty
    new_log_probs = policy.log_prob(completions)
    ratio = (new_log_probs - old_log_probs).exp()
    clipped = torch.clamp(ratio, 1 - eps, 1 + eps)
    policy_loss = -torch.min(ratio * advantages, clipped * advantages).mean()
    kl_loss = beta * (new_log_probs - ref_log_probs).mean()
    return policy_loss + kl_loss

Quick check

Derivation

In the DPO loss, the term log(π_θ(y|x) / π_ref(y|x)) acts as an implicit reward. If this ratio is > 1 for the preferred output, what does that mean?

The preferred output has lower perplexity under the reference model than the dispreferred output.The policy assigns higher probability to the preferred output than the reference model does.The KL divergence between policy and reference has exceeded the β threshold.The preferred output is more likely than the dispreferred output under the policy.

⚖️

Alignment Methods Compared

	PPO (RLHF)	DPO	GRPO	RLAIF
Reward model needed?	Yes	No	Yes (or rule-based)	AI-generated
Critic/value network?	Yes	No	No	Varies
Online data generation?	Yes	No (offline)	Yes	Yes
Complexity	High	Low	Medium	Medium
Used by	InstructGPT, Llama-2	Zephyr, Tulu	DeepSeek-V2/V3	Claude (CAI)

🔧

Break It — See What Happens

Remove KL Penalty (β = 0)

Remove Reward Model (Random Rewards)

Train Reward Model on Biased Data

Quick check

Trade-off

You observe that after 20K PPO steps, reward model score is at an all-time high but human evaluators rank outputs worse than the SFT baseline. What is the most likely diagnosis?

The KL penalty β is too high, preventing the policy from learning anything useful.The reward model has been fine-tuned by PPO gradients and become easier to satisfy.The policy has overoptimized — exploiting length bias and style patterns the reward model rewards, past the quality peak at ~6–10 KL units.PPO learning rate is too high, causing catastrophic forgetting of the SFT checkpoint.

📊

Real-World Numbers

InstructGPT (Ouyang et al. 2022) — Key Numbers

model params (RLHF model that beat 175B GPT-3 in human evals)

labelers (contractors with strong English skills, not ML experts)

comparison pairs used to train the reward model

inter-labeler agreement — human preferences are noisy

KL coefficient β range (InstructGPT used β ≈ 0.02)

256K

PPO episodes (batch size 64, ~4K steps total)

6–10 KL

units from reference before quality degrades (overoptimization cliff)

✨ Insight · The 1.3B InstructGPT model outperformed the 175B GPT-3 base model in human preference studies — alignment quality matters more than raw scale. This is the core argument for RLHF as a capability multiplier, not just a safety technique.

Model	Alignment Method	Details
InstructGPT	RLHF (PPO)	40 labelers, , β ≈ 0.02, ~256K PPO episodes
Claude	Constitutional AI	RLAIF — AI-generated feedback from a constitution
Llama-2 Chat	RLHF (PPO)	, 2 reward models (helpfulness + safety)
Zephyr	DPO
DeepSeek-V2	GRPO	Group-relative optimization, no critic network needed

Compute & Labeling Costs

Model	Labeling Scale	Estimated Cost / Notes
InstructGPT	~40 labelers	Est. $1–10M for human labeling (Ouyang et al. 2022); small scale, high quality curation
Llama-2 Chat	~1M preference pairs	2 separate reward models (helpfulness + safety); largest public RLHF dataset at the time
DeepSeek-V2 (GRPO)	Rule-based + RM	No critic network → significant memory reduction vs. PPO; single model pipeline simplifies infrastructure

✨ Insight · The trend is clear: PPO-based RLHF is being replaced by simpler alternatives. DPO for offline training, GRPO for online training, and Constitutional AI for scaling feedback. The core idea — learning from preferences — remains the same.

Quick check

Trade-off

InstructGPT's 1.3B RLHF model beat GPT-3 175B in human preference. A product team concludes they should always use the smallest model with RLHF. What critical caveat does the paper data reveal?

Human evaluators were biased toward shorter, simpler outputs, which smaller models tend to produce due to capacity constraints.The 1.3B win was narrowly on instruction-following preference — on factual recall and code, 175B GPT-3 still dominated.The 1.3B RLHF model used a substantially larger pre-training dataset than the 175B GPT-3 base model.The preference advantage only held for English prompts; the 175B base model was preferred in multilingual evaluation settings.

🧠

Key Takeaways

What to remember for interviews

1RLHF has 3 stages: SFT → Reward Model → PPO optimization
2The reward model learns a scalar score from human preference pairs (Bradley-Terry model)
3KL penalty prevents reward hacking — beta controls the constraint strength
4DPO skips the reward model by directly optimizing on preference pairs
5GRPO (DeepSeek) eliminates the critic network using group-relative advantages

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 10 of 10

Walk through the full RLHF pipeline. What are the three stages and why is each necessary?

★★☆

AnthropicOpenAI

Explain DPO and how it differs from PPO-based RLHF. What are its advantages?

★★★

AnthropicMeta

What is reward hacking and how does the KL penalty prevent it?

★★☆

OpenAIAnthropic

What is Constitutional AI (CAI) and how does it reduce reliance on human labelers?

★★☆

Anthropic

How does GRPO (Group Relative Policy Optimization) differ from PPO?

★★★

OpenAIMeta

How is preference data collected? What are the challenges and biases?

★★☆

OpenAIGoogle

What is the role of the reference model in RLHF/DPO and what happens if you remove it?

★★☆

AnthropicOpenAI

Compare RLHF, DPO, and RLAIF. When would you choose each approach?

★★★

GoogleMeta

What is sycophancy in RLHF and how does it emerge?

★★☆

AnthropicOpenAI

What is reward overoptimization? At what point does optimizing against a reward model hurt quality?

★★★

OpenAIAnthropic

←

🎮 RL Foundations

⚖️ DPO, GRPO & Alternatives

→