Skip to content

Transformer Math

Module 20 · Training

🎯 RLHF & Reward Models

OpenAI fine-tuned InstructGPT on 13,000 labeled examples and it beat GPT-3 (175B) on human preference. Smaller dataset, smaller model, better outputs — because RLHF teaches preference, not prediction.

Status:

Pre-training teaches language. SFT teaches instruction-following. RLHF and DPO teach the model what humans actually prefer — helpful, harmless, and honest outputs. This is the alignment stage that separates a base model from a chatbot.

🎮

RLHF Pipeline

What you’re seeing: the four stages of RLHF — pretraining, supervised fine-tuning, reward model training on human preference pairs, and PPO optimization against the reward signal — with the same base model serving different roles in stages 3 and 4. What to try: notice how the KL penalty in stage 4 prevents the policy from drifting too far from the SFT model while still maximizing reward.

Alignment PipelineStandard RLHF PathPre-trainingNext-token predictionSFTSupervised Fine-tuningReward ModelScores outputsRLHF (PPO)Policy optimizationDeployedPreference DataHuman ranks: y_w > y_lDPO ShortcutSkips reward model entirelyDPOPre-trainingSFTReward ModelRLHFDPO
💡

The Intuition

Pre-training teaches the model language — how to predict the next token. SFT teaches it to follow instructions by training on (instruction, response) pairs. But neither teaches the model which responses humans prefer.

RLHF adds a reward model trained on human preferences. The reward model scores outputs, and PPO optimizes the LLM to maximize that score. The critical constraint: KL penalty keeps the model close to the SFT policy, preventing reward hacking.

How the reward model is trained: Labelers are shown a prompt and two responses, and asked which is better. These pairwise comparisons are modeled with the Bradley-Terry model — the probability that response is preferred over is:

where is the reward model. Training minimizes the negative log-likelihood of the observed preferences. This is just binary cross-entropy over all pairs — the reward model learns a scalar score that ranks responses consistently with human judgments ().

DPO skips the reward model entirely. It directly optimizes the policy using preference pairs. DPO is derived from the same KL-constrained reward maximization objective as RLHF, but bypasses the reward model and PPO training — it's a closed-form solution under specific assumptions, not a blanket equivalent to PPO-based RLHF.

✨ Insight · Think of RLHF as teaching a student with a rubric (reward model) and a tutor (PPO). DPO is like giving the student paired examples: “this answer is better than that one” — and letting them learn the rubric implicitly.

Worked Example — Reward Scoring in Practice

Prompt: “Explain quantum entanglement.”

Response A — Jargon-heavy

“Quantum entanglement is a phenomenon in which the quantum states of two or more particles are described by a non-separable joint wavefunction, such that measuring one particle instantaneously collapses the superposition of the entangled partner regardless of spatial separation.”

r(x, A) = 2.1— accurate but inaccessible

Response B — Clear and concrete

“Imagine two magic coins that always land opposite each other, no matter how far apart they are. Flip one in New York and instantly the other in Tokyo lands the opposite way. Quantum entanglement is that link — two particles whose fates are connected even across vast distances.”

r(x, B) = 3.7— clear, relatable, preferred

How PPO uses this signal

  1. 1.The reward model scores both responses: advantage of B over A is 3.7 − 2.1 = +1.6.
  2. 2.PPO computes the policy gradient: increase the log-probability of tokens in Response B (the high-reward trajectory).
  3. 3.The KL penalty keeps the update small — the model shifts toward B-like responses without forgetting how to produce A-like technical detail.
  4. 4.After thousands of such updates, the model consistently chooses clarity over jargon when the reward model reflects that preference.

Quick check

Trade-off

A colleague proposes setting β = 0 to maximize reward without constraint. What is the first failure mode you would observe?

A colleague proposes setting β = 0 to maximize reward without constraint. What is the first failure mode you would observe?
🚀

SOTA 2024–2025: GRPO, DPO Variants & RLAIF at Scale

GRPO (Group Relative Policy Optimization) — introduced in DeepSeekMath (Shao et al., 2024), later adopted by , GRPO eliminates the critic/value network that PPO requires. Rather than learning a value function to estimate baseline advantages, GRPO samples a groupof outputs for each prompt, scores every output with the reward model, and computes each output's advantage relative to the group mean. This removes an entire model from the training pipeline, significantly reducing training resources versus PPO, with more stable training — because reward normalization happens naturally within each group rather than relying on a learned value head that can diverge early in training.

✨ Insight · GRPO is essentially “best-of-N sampling turned into a gradient signal.” The group mean acts as a dynamic baseline — outputs above average get positive advantage, below average get negative. No critic required.

Online DPO Variants— two notable extensions address DPO's original failure modes. IPO (Identity Preference Optimization, Azar et al. 2023) replaces DPO's log-sigmoid with a squared loss, removing the implicit assumption that preferences are generated by a Bradley-Terry model. This makes IPO more robust to label noise and prevents the model from over-fitting to margin — a known failure mode when the preference dataset contains contradictory labels. KTO (Kahneman-Tversky Optimization, Ethayarajh et al. 2024) goes further: it works with binary feedback (good / bad) instead of contrastive pairs, making data collection dramatically cheaper at scale. KTO is inspired by prospect theory — humans are more sensitive to losses than gains — and optimizes a utility function that mirrors this asymmetry.

RLAIF at Scale showed that AI feedback can scale harmlessness training with far fewer human labels. The core insight: use a strong LLM to critique and rank outputs against a written “constitution” of principles, then train on those AI-generated preferences. The Llama 3 technical report describes using both human and AI-generated preference data for reward model training, reflecting the emerging consensus: expensive human annotation is best reserved for calibration and red-teaming, while AI feedback handles scale.

💡 Tip · The practical split emerging across labs: AI feedback for volume (millions of examples), human feedback for calibration and adversarial edge cases. Neither alone is sufficient — human feedback anchors the AI's judgment, and AI feedback scales it.
Quick Check

What does the KL penalty in RLHF prevent?

📐

Step-by-Step Derivation

PPO Objective (RLHF)

Maximize expected reward while staying close to the reference policy. controls the KL penalty strength:

💡 Tip · When , no constraint — pure reward maximization (leads to reward hacking). When , the model can't deviate from the reference at all (no learning).

DPO Loss

Direct Preference Optimization reformulates the RLHF objective into a simple classification loss. Given preferred output and dispreferred :

The key insight: the log-probability ratio acts as an implicit reward. DPO increases this ratio for preferred outputs and decreases it for dispreferred ones.

PyTorch: DPO Loss

python
import torch.nn.functional as F

def dpo_loss(pi_logps_w, pi_logps_l, ref_logps_w, ref_logps_l, beta=0.1):
    """Direct Preference Optimization loss."""
    # Log-probability ratios (implicit rewards)
    pi_ratio_w = pi_logps_w - ref_logps_w   # preferred
    pi_ratio_l = pi_logps_l - ref_logps_l   # dispreferred

    # DPO loss: -log sigmoid(beta * (preferred - dispreferred))
    logits = beta * (pi_ratio_w - pi_ratio_l)
    loss = -F.logsigmoid(logits).mean()
    return loss

PPO-Clip Objective (RLHF)

PPO clips the probability ratio to prevent large policy updates. The KL penalty term keeps the policy anchored to the reference model. () is the clip range; is the KL coefficient (typically 0.01–0.1 in RLHF):

PyTorch: PPO-Clip Loss for RLHF

python
import torch

def ppo_clip_loss(new_log_probs, old_log_probs, ref_log_probs,
                  advantages, eps=0.2, beta=0.05):
    """PPO-Clip objective for RLHF.
    new_log_probs: log π_θ(y|x) under current policy
    old_log_probs: log π_old(y|x) from the rollout policy
    ref_log_probs: log π_ref(y|x) from frozen SFT checkpoint
    advantages:    r(x,y) - V(x) — reward minus value baseline
    """
    ratio = (new_log_probs - old_log_probs).exp()
    clipped = torch.clamp(ratio, 1 - eps, 1 + eps)
    policy_loss = -torch.min(ratio * advantages, clipped * advantages).mean()

    # KL divergence penalty against reference (SFT) policy
    kl_penalty = beta * (new_log_probs - ref_log_probs).mean()
    return policy_loss + kl_penalty
PyTorch implementation
# PPO clipped surrogate objective for RLHF
import torch, torch.nn.functional as F

def ppo_rlhf_step(policy, ref_policy, reward_model, prompts, eps=0.2, beta=0.05):
    """One PPO update step in the RLHF loop."""
    # 1. Sample completions from current policy
    with torch.no_grad():
        completions = policy.generate(prompts, do_sample=True, max_new_tokens=256)
        rewards = reward_model(prompts, completions)          # scalar per sample
        old_log_probs = policy.log_prob(completions)          # log π_old
        ref_log_probs = ref_policy.log_prob(completions)      # log π_ref (frozen SFT)

    # 2. Compute advantages (reward - value baseline; simplified: reward - mean)
    advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8)

    # 3. PPO-clip + KL penalty
    new_log_probs = policy.log_prob(completions)
    ratio = (new_log_probs - old_log_probs).exp()
    clipped = torch.clamp(ratio, 1 - eps, 1 + eps)
    policy_loss = -torch.min(ratio * advantages, clipped * advantages).mean()
    kl_loss = beta * (new_log_probs - ref_log_probs).mean()
    return policy_loss + kl_loss

Quick check

Derivation

In the DPO loss, the term log(π_θ(y|x) / π_ref(y|x)) acts as an implicit reward. If this ratio is > 1 for the preferred output, what does that mean?

In the DPO loss, the term log(π_θ(y|x) / π_ref(y|x)) acts as an implicit reward. If this ratio is > 1 for the preferred output, what does that mean?
⚖️

Alignment Methods Compared

PPO (RLHF)DPOGRPORLAIF
Reward model needed?YesNoYes (or rule-based)AI-generated
Critic/value network?YesNoNoVaries
Online data generation?YesNo (offline)YesYes
ComplexityHighLowMediumMedium
Used byInstructGPT, Llama-2Zephyr, TuluDeepSeek-V2/V3Claude (CAI)
🔧

Break It — See What Happens

Remove KL Penalty (β = 0)
Remove Reward Model (Random Rewards)
Train Reward Model on Biased Data

Quick check

Trade-off

You observe that after 20K PPO steps, reward model score is at an all-time high but human evaluators rank outputs worse than the SFT baseline. What is the most likely diagnosis?

You observe that after 20K PPO steps, reward model score is at an all-time high but human evaluators rank outputs worse than the SFT baseline. What is the most likely diagnosis?
📊

Real-World Numbers

InstructGPT (Ouyang et al. 2022) — Key Numbers

model params (RLHF model that beat 175B GPT-3 in human evals)

labelers (contractors with strong English skills, not ML experts)

comparison pairs used to train the reward model

inter-labeler agreement — human preferences are noisy

KL coefficient β range (InstructGPT used β ≈ 0.02)

256K

PPO episodes (batch size 64, ~4K steps total)

6–10 KL

units from reference before quality degrades (overoptimization cliff)

✨ Insight · The 1.3B InstructGPT model outperformed the 175B GPT-3 base model in human preference studies — alignment quality matters more than raw scale. This is the core argument for RLHF as a capability multiplier, not just a safety technique.
ModelAlignment MethodDetails
InstructGPTRLHF (PPO)40 labelers, , β ≈ 0.02, ~256K PPO episodes
ClaudeConstitutional AIRLAIF — AI-generated feedback from a constitution
Llama-2 ChatRLHF (PPO), 2 reward models (helpfulness + safety)
ZephyrDPO
DeepSeek-V2GRPOGroup-relative optimization, no critic network needed

Compute & Labeling Costs

ModelLabeling ScaleEstimated Cost / Notes
InstructGPT~40 labelersEst. $1–10M for human labeling (Ouyang et al. 2022); small scale, high quality curation
Llama-2 Chat~1M preference pairs2 separate reward models (helpfulness + safety); largest public RLHF dataset at the time
DeepSeek-V2 (GRPO)Rule-based + RMNo critic network → significant memory reduction vs. PPO; single model pipeline simplifies infrastructure
✨ Insight · The trend is clear: PPO-based RLHF is being replaced by simpler alternatives. DPO for offline training, GRPO for online training, and Constitutional AI for scaling feedback. The core idea — learning from preferences — remains the same.

Quick check

Trade-off

InstructGPT's 1.3B RLHF model beat GPT-3 175B in human preference. A product team concludes they should always use the smallest model with RLHF. What critical caveat does the paper data reveal?

InstructGPT's 1.3B RLHF model beat GPT-3 175B in human preference. A product team concludes they should always use the smallest model with RLHF. What critical caveat does the paper data reveal?
🧠

Key Takeaways

What to remember for interviews

  1. 1RLHF has 3 stages: SFT → Reward Model → PPO optimization
  2. 2The reward model learns a scalar score from human preference pairs (Bradley-Terry model)
  3. 3KL penalty prevents reward hacking — beta controls the constraint strength
  4. 4DPO skips the reward model by directly optimizing on preference pairs
  5. 5GRPO (DeepSeek) eliminates the critic network using group-relative advantages
🧠

Recap quiz

Trade-off

You raise β (the KL penalty coefficient) in PPO-RLHF from 0.02 to 0.5. What is the primary effect?

You raise β (the KL penalty coefficient) in PPO-RLHF from 0.02 to 0.5. What is the primary effect?
Trade-off

InstructGPT's 1.3B RLHF model outperformed the 175B GPT-3 base in human evals. What does this imply about the compute budget for alignment?

InstructGPT's 1.3B RLHF model outperformed the 175B GPT-3 base in human evals. What does this imply about the compute budget for alignment?
Trade-off

DPO eliminates the reward model from the training pipeline. What is the main practical limitation this introduces compared to PPO-RLHF?

DPO eliminates the reward model from the training pipeline. What is the main practical limitation this introduces compared to PPO-RLHF?
Trade-off

GRPO removes the critic network by using group-mean advantage estimation. What failure mode does this trade away compared to PPO?

GRPO removes the critic network by using group-mean advantage estimation. What failure mode does this trade away compared to PPO?
Derivation

Reward overoptimization peaks around 6–10 KL units from the reference policy. What causes quality to DROP beyond this point even as reward model score keeps rising?

Reward overoptimization peaks around 6–10 KL units from the reference policy. What causes quality to DROP beyond this point even as reward model score keeps rising?
Trade-off

Constitutional AI (RLAIF) replaces human preference labelers with AI-generated feedback. Which failure mode does this introduce that human labeling avoids?

Constitutional AI (RLAIF) replaces human preference labelers with AI-generated feedback. Which failure mode does this introduce that human labeling avoids?
Derivation

InstructGPT labelers agreed on pairwise preferences ~73% of the time. How should this noise be handled when training the reward model?

InstructGPT labelers agreed on pairwise preferences ~73% of the time. How should this noise be handled when training the reward model?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 10 of 10

Walk through the full RLHF pipeline. What are the three stages and why is each necessary?

★★☆
AnthropicOpenAI

Explain DPO and how it differs from PPO-based RLHF. What are its advantages?

★★★
AnthropicMeta

What is reward hacking and how does the KL penalty prevent it?

★★☆
OpenAIAnthropic

What is Constitutional AI (CAI) and how does it reduce reliance on human labelers?

★★☆
Anthropic

How does GRPO (Group Relative Policy Optimization) differ from PPO?

★★★
OpenAIMeta

How is preference data collected? What are the challenges and biases?

★★☆
OpenAIGoogle

What is the role of the reference model in RLHF/DPO and what happens if you remove it?

★★☆
AnthropicOpenAI

Compare RLHF, DPO, and RLAIF. When would you choose each approach?

★★★
GoogleMeta

What is sycophancy in RLHF and how does it emerge?

★★☆
AnthropicOpenAI

What is reward overoptimization? At what point does optimizing against a reward model hurt quality?

★★★
OpenAIAnthropic