🎯 RLHF & Reward Models
OpenAI fine-tuned InstructGPT on 13,000 labeled examples and it beat GPT-3 (175B) on human preference. Smaller dataset, smaller model, better outputs — because RLHF teaches preference, not prediction.
Pre-training teaches language. SFT teaches instruction-following. RLHF and DPO teach the model what humans actually prefer — helpful, harmless, and honest outputs. This is the alignment stage that separates a base model from a chatbot.
RLHF Pipeline
What you’re seeing: the four stages of RLHF — pretraining, supervised fine-tuning, reward model training on human preference pairs, and PPO optimization against the reward signal — with the same base model serving different roles in stages 3 and 4. What to try: notice how the KL penalty in stage 4 prevents the policy from drifting too far from the SFT model while still maximizing reward.
The Intuition
Pre-training teaches the model language — how to predict the next token. SFT teaches it to follow instructions by training on (instruction, response) pairs. But neither teaches the model which responses humans prefer.
RLHF adds a reward model trained on human preferences. The reward model scores outputs, and PPO optimizes the LLM to maximize that score. The critical constraint: KL penalty keeps the model close to the SFT policy, preventing reward hacking.
How the reward model is trained: Labelers are shown a prompt and two responses, and asked which is better. These pairwise comparisons are modeled with the Bradley-Terry model — the probability that response is preferred over is:
where is the reward model. Training minimizes the negative log-likelihood of the observed preferences. This is just binary cross-entropy over all pairs — the reward model learns a scalar score that ranks responses consistently with human judgments ().
DPO skips the reward model entirely. It directly optimizes the policy using preference pairs. DPO is derived from the same KL-constrained reward maximization objective as RLHF, but bypasses the reward model and PPO training — it's a closed-form solution under specific assumptions, not a blanket equivalent to PPO-based RLHF.
Worked Example — Reward Scoring in Practice
Prompt: “Explain quantum entanglement.”
Response A — Jargon-heavy
“Quantum entanglement is a phenomenon in which the quantum states of two or more particles are described by a non-separable joint wavefunction, such that measuring one particle instantaneously collapses the superposition of the entangled partner regardless of spatial separation.”
Response B — Clear and concrete
“Imagine two magic coins that always land opposite each other, no matter how far apart they are. Flip one in New York and instantly the other in Tokyo lands the opposite way. Quantum entanglement is that link — two particles whose fates are connected even across vast distances.”
How PPO uses this signal
- 1.The reward model scores both responses: advantage of B over A is 3.7 − 2.1 = +1.6.
- 2.PPO computes the policy gradient: increase the log-probability of tokens in Response B (the high-reward trajectory).
- 3.The KL penalty keeps the update small — the model shifts toward B-like responses without forgetting how to produce A-like technical detail.
- 4.After thousands of such updates, the model consistently chooses clarity over jargon when the reward model reflects that preference.
Quick check
A colleague proposes setting β = 0 to maximize reward without constraint. What is the first failure mode you would observe?
SOTA 2024–2025: GRPO, DPO Variants & RLAIF at Scale
GRPO (Group Relative Policy Optimization) — introduced in DeepSeekMath (Shao et al., 2024), later adopted by , GRPO eliminates the critic/value network that PPO requires. Rather than learning a value function to estimate baseline advantages, GRPO samples a groupof outputs for each prompt, scores every output with the reward model, and computes each output's advantage relative to the group mean. This removes an entire model from the training pipeline, significantly reducing training resources versus PPO, with more stable training — because reward normalization happens naturally within each group rather than relying on a learned value head that can diverge early in training.
Online DPO Variants— two notable extensions address DPO's original failure modes. IPO (Identity Preference Optimization, Azar et al. 2023) replaces DPO's log-sigmoid with a squared loss, removing the implicit assumption that preferences are generated by a Bradley-Terry model. This makes IPO more robust to label noise and prevents the model from over-fitting to margin — a known failure mode when the preference dataset contains contradictory labels. KTO (Kahneman-Tversky Optimization, Ethayarajh et al. 2024) goes further: it works with binary feedback (good / bad) instead of contrastive pairs, making data collection dramatically cheaper at scale. KTO is inspired by prospect theory — humans are more sensitive to losses than gains — and optimizes a utility function that mirrors this asymmetry.
RLAIF at Scale — showed that AI feedback can scale harmlessness training with far fewer human labels. The core insight: use a strong LLM to critique and rank outputs against a written “constitution” of principles, then train on those AI-generated preferences. The Llama 3 technical report describes using both human and AI-generated preference data for reward model training, reflecting the emerging consensus: expensive human annotation is best reserved for calibration and red-teaming, while AI feedback handles scale.
What does the KL penalty in RLHF prevent?
Step-by-Step Derivation
PPO Objective (RLHF)
Maximize expected reward while staying close to the reference policy. controls the KL penalty strength:
DPO Loss
Direct Preference Optimization reformulates the RLHF objective into a simple classification loss. Given preferred output and dispreferred :
The key insight: the log-probability ratio acts as an implicit reward. DPO increases this ratio for preferred outputs and decreases it for dispreferred ones.
PyTorch: DPO Loss
import torch.nn.functional as F
def dpo_loss(pi_logps_w, pi_logps_l, ref_logps_w, ref_logps_l, beta=0.1):
"""Direct Preference Optimization loss."""
# Log-probability ratios (implicit rewards)
pi_ratio_w = pi_logps_w - ref_logps_w # preferred
pi_ratio_l = pi_logps_l - ref_logps_l # dispreferred
# DPO loss: -log sigmoid(beta * (preferred - dispreferred))
logits = beta * (pi_ratio_w - pi_ratio_l)
loss = -F.logsigmoid(logits).mean()
return lossPPO-Clip Objective (RLHF)
PPO clips the probability ratio to prevent large policy updates. The KL penalty term keeps the policy anchored to the reference model. () is the clip range; is the KL coefficient (typically 0.01–0.1 in RLHF):
PyTorch: PPO-Clip Loss for RLHF
import torch
def ppo_clip_loss(new_log_probs, old_log_probs, ref_log_probs,
advantages, eps=0.2, beta=0.05):
"""PPO-Clip objective for RLHF.
new_log_probs: log π_θ(y|x) under current policy
old_log_probs: log π_old(y|x) from the rollout policy
ref_log_probs: log π_ref(y|x) from frozen SFT checkpoint
advantages: r(x,y) - V(x) — reward minus value baseline
"""
ratio = (new_log_probs - old_log_probs).exp()
clipped = torch.clamp(ratio, 1 - eps, 1 + eps)
policy_loss = -torch.min(ratio * advantages, clipped * advantages).mean()
# KL divergence penalty against reference (SFT) policy
kl_penalty = beta * (new_log_probs - ref_log_probs).mean()
return policy_loss + kl_penaltyPyTorch implementation
# PPO clipped surrogate objective for RLHF
import torch, torch.nn.functional as F
def ppo_rlhf_step(policy, ref_policy, reward_model, prompts, eps=0.2, beta=0.05):
"""One PPO update step in the RLHF loop."""
# 1. Sample completions from current policy
with torch.no_grad():
completions = policy.generate(prompts, do_sample=True, max_new_tokens=256)
rewards = reward_model(prompts, completions) # scalar per sample
old_log_probs = policy.log_prob(completions) # log π_old
ref_log_probs = ref_policy.log_prob(completions) # log π_ref (frozen SFT)
# 2. Compute advantages (reward - value baseline; simplified: reward - mean)
advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
# 3. PPO-clip + KL penalty
new_log_probs = policy.log_prob(completions)
ratio = (new_log_probs - old_log_probs).exp()
clipped = torch.clamp(ratio, 1 - eps, 1 + eps)
policy_loss = -torch.min(ratio * advantages, clipped * advantages).mean()
kl_loss = beta * (new_log_probs - ref_log_probs).mean()
return policy_loss + kl_lossQuick check
In the DPO loss, the term log(π_θ(y|x) / π_ref(y|x)) acts as an implicit reward. If this ratio is > 1 for the preferred output, what does that mean?
Alignment Methods Compared
| PPO (RLHF) | DPO | GRPO | RLAIF | |
|---|---|---|---|---|
| Reward model needed? | Yes | No | Yes (or rule-based) | AI-generated |
| Critic/value network? | Yes | No | No | Varies |
| Online data generation? | Yes | No (offline) | Yes | Yes |
| Complexity | High | Low | Medium | Medium |
| Used by | InstructGPT, Llama-2 | Zephyr, Tulu | DeepSeek-V2/V3 | Claude (CAI) |
Break It — See What Happens
Quick check
You observe that after 20K PPO steps, reward model score is at an all-time high but human evaluators rank outputs worse than the SFT baseline. What is the most likely diagnosis?
Real-World Numbers
InstructGPT (Ouyang et al. 2022) — Key Numbers
model params (RLHF model that beat 175B GPT-3 in human evals)
labelers (contractors with strong English skills, not ML experts)
comparison pairs used to train the reward model
inter-labeler agreement — human preferences are noisy
KL coefficient β range (InstructGPT used β ≈ 0.02)
256K
PPO episodes (batch size 64, ~4K steps total)
6–10 KL
units from reference before quality degrades (overoptimization cliff)
| Model | Alignment Method | Details |
|---|---|---|
| InstructGPT | RLHF (PPO) | 40 labelers, , β ≈ 0.02, ~256K PPO episodes |
| Claude | Constitutional AI | RLAIF — AI-generated feedback from a constitution |
| Llama-2 Chat | RLHF (PPO) | , 2 reward models (helpfulness + safety) |
| Zephyr | DPO | |
| DeepSeek-V2 | GRPO | Group-relative optimization, no critic network needed |
Compute & Labeling Costs
| Model | Labeling Scale | Estimated Cost / Notes |
|---|---|---|
| InstructGPT | ~40 labelers | Est. $1–10M for human labeling (Ouyang et al. 2022); small scale, high quality curation |
| Llama-2 Chat | ~1M preference pairs | 2 separate reward models (helpfulness + safety); largest public RLHF dataset at the time |
| DeepSeek-V2 (GRPO) | Rule-based + RM | No critic network → significant memory reduction vs. PPO; single model pipeline simplifies infrastructure |
Quick check
InstructGPT's 1.3B RLHF model beat GPT-3 175B in human preference. A product team concludes they should always use the smallest model with RLHF. What critical caveat does the paper data reveal?
Key Takeaways
What to remember for interviews
- 1RLHF has 3 stages: SFT → Reward Model → PPO optimization
- 2The reward model learns a scalar score from human preference pairs (Bradley-Terry model)
- 3KL penalty prevents reward hacking — beta controls the constraint strength
- 4DPO skips the reward model by directly optimizing on preference pairs
- 5GRPO (DeepSeek) eliminates the critic network using group-relative advantages
Recap quiz
You raise β (the KL penalty coefficient) in PPO-RLHF from 0.02 to 0.5. What is the primary effect?
InstructGPT's 1.3B RLHF model outperformed the 175B GPT-3 base in human evals. What does this imply about the compute budget for alignment?
DPO eliminates the reward model from the training pipeline. What is the main practical limitation this introduces compared to PPO-RLHF?
GRPO removes the critic network by using group-mean advantage estimation. What failure mode does this trade away compared to PPO?
Reward overoptimization peaks around 6–10 KL units from the reference policy. What causes quality to DROP beyond this point even as reward model score keeps rising?
Constitutional AI (RLAIF) replaces human preference labelers with AI-generated feedback. Which failure mode does this introduce that human labeling avoids?
InstructGPT labelers agreed on pairwise preferences ~73% of the time. How should this noise be handled when training the reward model?
Further Reading
- Training language models to follow instructions with human feedback (InstructGPT) — OpenAI's InstructGPT paper — the SFT + reward model + PPO pipeline that launched the RLHF era.
- Constitutional AI: Harmlessness from AI Feedback — Anthropic's approach using AI-generated critiques and revisions to reduce reliance on human labelers.
- Fine-Tuning Language Models from Human Preferences (Ziegler et al. 2019) — The original RLHF paper applying reward learning from human preferences to language models.
- Lilian Weng's Blog — RLHF, reward modeling, and alignment — Comprehensive posts on RLHF, reward modeling, and alignment — covers reward hacking, Goodhart's Law applied to reward models, and mitigation strategies including KL penalties and ensemble methods.
- Reward Model Ensembles Help Mitigate Overoptimization — Coste et al. 2023 — shows that reward hacking can be reduced by ensembling multiple reward models, with quantitative analysis of the overoptimization curve.
- Andrej Karpathy — The State of GPT (Microsoft Build 2023) — Practical walkthrough of the full RLHF pipeline from SFT through reward modeling to PPO — includes real data requirements and compute costs.
Interview Questions
Showing 10 of 10
Walk through the full RLHF pipeline. What are the three stages and why is each necessary?
★★☆Explain DPO and how it differs from PPO-based RLHF. What are its advantages?
★★★What is reward hacking and how does the KL penalty prevent it?
★★☆What is Constitutional AI (CAI) and how does it reduce reliance on human labelers?
★★☆How does GRPO (Group Relative Policy Optimization) differ from PPO?
★★★How is preference data collected? What are the challenges and biases?
★★☆What is the role of the reference model in RLHF/DPO and what happens if you remove it?
★★☆Compare RLHF, DPO, and RLAIF. When would you choose each approach?
★★★What is sycophancy in RLHF and how does it emerge?
★★☆What is reward overoptimization? At what point does optimizing against a reward model hurt quality?
★★★