⚖️ DPO, GRPO & Alternatives
DPO skips the reward model and PPO entirely — yet matches RLHF quality. Here's the math that makes 2 stages collapse into 1.
RLHF works, but it requires training a reward model and running PPO — two complex, unstable steps. DPO collapses both into a single loss function. GRPO takes a different shortcut: keep the reward model, drop the critic. This module covers the evolution from PPO to DPO to GRPO — simpler alignment, same goal.
DPO vs RLHF Pipeline
What you’re seeing:RLHF’s four-stage pipeline (supervised fine-tune → reward model → PPO policy + critic) collapsed into DPO’s single offline loss — same KL-constrained preference objective, no reward model or rollout sampler needed. What to try: Follow the arrow from the preference dataset: in RLHF it feeds a separate reward model; in DPO it feeds the loss directly — trace which boxes disappear.
RLHF vs DPO Pipeline
Side-by-side comparison: . . .
RLHF (PPO) — 4 Models
Policy Model
being optimized
Reference Model
frozen SFT checkpoint
Reward Model
scores outputs
Value Model (Critic)
estimates advantages
DPO — 2 Models
Policy Model
being optimized
Reference Model
frozen SFT checkpoint
No Reward Model
implicit in log-prob ratios
No Critic
no advantage estimation needed
Evolution of Alignment Methods
PPO
2022
4 models, complex
DPO
2023
2 models, offline
GRPO
2024
no critic, online
The Intuition
The DPO insight:In RLHF, the optimal policy under KL-constrained reward maximization has a closed-form solution. The reward can be expressed as a function of the policy's log-probabilities and the reference model's log-probabilities. DPO inverts this — instead of learning a reward then optimizing, it directly optimizes the policy using preference pairs.
GRPO (DeepSeek): Group Relative Policy Optimization takes a different approach. Instead of eliminating the reward signal (like DPO), it eliminates the critic/value network. For each prompt, sample K outputs, score them all with a reward model or rule-based verifiable rewards (as in DeepSeek-R1), and compute advantages relative to the group mean. No value network needed — the group itself provides the baseline.
Constitutional AI (RLAIF):Instead of human labelers ranking outputs, an AI critiques its own outputs against a set of explicit principles — the "constitution." This produces preference data at scale without human annotation. Anthropic uses this for Claude's alignment.
takes simplification one step further by also removing the reference model. DPO's implicit reward uses log-probability ratios against a frozen reference: . SimPO replaces this with a reference-free reward based on the sequence-average log-probability of the policy itself:
Dividing by sequence length corrects the length bias that plagues standard DPO (longer outputs accumulate more log-prob mass). The margin enforces a minimum gap between preferred and dispreferred scores. SimPO trains with only one model in memory — no reference forward pass — which (Meng et al., 2024).
Why is DPO simpler than PPO-based RLHF?
Step-by-Step Derivation
DPO Loss (The Key Equation)
Given preferred output and dispreferred for prompt , DPO minimizes:
The term is the implicit reward. DPO increases this ratio for preferred outputs and decreases it for dispreferred outputs — no explicit reward model needed.
GRPO Advantage Estimation
For each prompt, sample outputs. Score each with the reward model, then compute group-relative advantages:
No critic network. The group of samples provides the baseline. Normalization across the group handles reward scale differences between prompts automatically.
Implicit Reward (DPO vs Explicit)
The closed-form relationship between optimal policy and reward:
The partition function cancels in the preference comparison — which is why DPO never needs to compute it. This cancellation is the mathematical trick that makes DPO work.
PyTorch: DPO Loss in 10 Lines
import torch.nn.functional as F
def dpo_loss(pi_logps_w, pi_logps_l, ref_logps_w, ref_logps_l, beta=0.1):
"""Direct Preference Optimization loss.
All inputs: (batch,) log-probabilities summed over tokens.
"""
# Implicit rewards: log-ratio of policy vs reference
reward_w = pi_logps_w - ref_logps_w # preferred
reward_l = pi_logps_l - ref_logps_l # dispreferred
# DPO loss = -log sigmoid(beta * (preferred_reward - dispreferred_reward))
logits = beta * (reward_w - reward_l)
return -F.logsigmoid(logits).mean()PyTorch implementation
# DPO training loop — no reward model, no PPO
import torch, torch.nn.functional as F
def compute_log_probs(model, input_ids, labels):
"""Sum log-probs over assistant tokens (labels != -100)."""
logits = model(input_ids).logits[:, :-1] # (B, T-1, V)
shift_labels = labels[:, 1:] # (B, T-1)
log_probs = F.log_softmax(logits, dim=-1)
token_lp = log_probs.gather(-1, shift_labels.clamp(0).unsqueeze(-1)).squeeze(-1)
mask = (shift_labels != -100).float()
return (token_lp * mask).sum(-1) # (B,)
def dpo_loss(policy, ref_policy, batch, beta=0.1):
"""batch contains: chosen_ids, rejected_ids, chosen_labels, rejected_labels."""
pi_w = compute_log_probs(policy, batch["chosen_ids"], batch["chosen_labels"])
pi_l = compute_log_probs(policy, batch["rejected_ids"], batch["rejected_labels"])
ref_w = compute_log_probs(ref_policy, batch["chosen_ids"], batch["chosen_labels"])
ref_l = compute_log_probs(ref_policy, batch["rejected_ids"],batch["rejected_labels"])
logits = beta * ((pi_w - ref_w) - (pi_l - ref_l))
return -F.logsigmoid(logits).mean()Quick check
In the DPO loss, beta multiplies the log-ratio difference before the sigmoid. If beta is doubled while keeping all else equal, what happens to the gradient magnitude?
Break It — See What Happens
Quick check
If the same model instance (shared weights) is used for both pi_theta and pi_ref in DPO, what happens to the gradient?
Real-World Numbers
| Model | Method | Details |
|---|---|---|
| Zephyr-7B | DPO | |
| DeepSeek-V2 | GRPO | 236B MoE model. |
| Llama-2 Chat | PPO (RLHF) | Training required 4 models simultaneously in memory. |
| InstructGPT | PPO (RLHF) | OpenAI, 2022. |
| Claude | Constitutional AI | RLAIF — AI generates preference data from principles. Scales to millions of preference pairs without human labelers. |
| Tulu-2 | DPO | Allen AI. DPO training on 70B was substantially cheaper than PPO-style RLHF — no reward model or value network inference during training. |
Quick check
PPO-based RLHF requires 4 models in GPU memory simultaneously. DPO requires 2. For a 7B parameter model in BF16 (14 bytes/param), what is the approximate peak VRAM difference?
SOTA 2024–2025: DPO Successors
DPO spawned a family of reference-model-free and SFT-stage-free algorithms. Each removes another component from the RLHF pipeline while matching or beating DPO on win-rate benchmarks.
DPO Successor Landscape (2024–2025)
| Algorithm | Reward model? | Reference model? | SFT stage? | Memory cost |
|---|---|---|---|---|
| DPO | No | Yes (frozen) | Yes | 2 models |
| SimPO (May 2024) | No | No | Yes | 1 model (~20% faster) |
| ORPO (2024) | No | No | No (merged) | 1 model |
| KTO (2024) | No | Yes (frozen) | Yes | 2 models |
| GRPO (DeepSeek) | Yes (rule-based) | Optional | Yes | No critic; lower than PPO |
| DAPO (ByteDance, 2025) | Yes (rule-based) | Optional | Yes | GRPO + dynamic sampling |
SimPO — Simple Preference Optimization (May 2024)
Removes the reference model by using sequence-average log-probability as a reference-free reward, adds a length normalization term and a margin . Result: (per arxiv:2405.14734, 2024-05).
ORPO — Odds-Ratio Preference Optimization (2024)
Fuses SFT and preference optimization into a single loss by adding an odds-ratio penalty on dispreferred outputs — no reference model, no separate SFT stage. Trains on unpaired data (no pairwise annotations needed) (per arxiv:2403.07691, 2024-03).
KTO — Kahneman-Tversky Optimization (2024)
Frames preference learning as prospect theory: losses loom larger than gains. Trains on unpaired binary feedback (thumbs-up / thumbs-down) rather than preference pairs — dramatically cheaper to collect. Matches DPO quality at scale without paired comparisons.
GRPO — Group Relative Policy Optimization (DeepSeek-R1, 2025)
Samples , scores each with a rule-based verifier, uses group mean/std as the baseline — no critic network. DeepSeek-R1 training hyperparameters: lr=3e-6, KL coefficient=0.001, batch=512 (per arxiv:2501.12948, 2025-01). Memory substantially lower than PPO because the value network is eliminated entirely.
DAPO — Dynamic Sampling Policy Optimization (ByteDance, 2025)
Extends GRPO with dynamic token-level sampling to stabilize training at scale. Removes samples where all outputs have the same reward (no gradient signal) and clips token-level loss rather than just the policy ratio. Addresses entropy collapse and reward hacking at long-horizon reasoning scale (per arxiv:2503.14476, 2025-03).
Key Takeaways
What to remember for interviews
- 1DPO collapses PPO's reward model + RL optimizer into a single binary cross-entropy loss over preference pairs — reducing from 4 models to 2.
- 2The implicit reward is the log-probability ratio log(π_θ / π_ref). DPO increases this ratio for preferred outputs and decreases it for dispreferred ones.
- 3Beta controls KL divergence from the reference model. Too low: mode collapse and noisy gradients. Too high: training is too conservative. Typical range: 0.1–0.5.
- 4GRPO eliminates the critic network by using group-relative advantages: sample K outputs per prompt, normalize rewards by group mean and std.
- 5SimPO removes the reference model entirely, using sequence-average log-probability as a reference-free reward with a length-normalization and margin term.
- 6SOTA 2024–2025: SimPO +6.4 AlpacaEval 2 over DPO; ORPO fuses SFT + preference into one pass; KTO trains on binary feedback (no pairs); GRPO (DeepSeek-R1) uses rule-based verifiers with 16 samples/question; DAPO (ByteDance) stabilizes GRPO at scale via dynamic token sampling.
Recap quiz
DPO is derived from the same KL-constrained objective as PPO-based RLHF. What mathematical property lets DPO skip the reward model entirely?
A team sets DPO beta=0.01 hoping for aggressive preference learning. What is the most likely failure mode?
What does the quantity beta * log(pi_theta(y|x) / pi_ref(y|x)) represent in DPO?
A team is building a code-generation assistant and has 10K preference pairs but limited human labelers. They can generate unlimited test cases programmatically. Which method best exploits this setup?
Offline DPO shows good loss but degrades at eval time after 5 epochs. What is the most likely root cause?
Zephyr-7B (DPO on Mistral 7B) achieved MT-Bench 7.34 vs Llama-2-Chat-70B at 6.27. What does this benchmark result imply about DPO vs PPO training at 7B scale?
Standard DPO often produces longer outputs even when the preferred example is shorter. SimPO fixes this with length normalization. What is the root cause in DPO?
Further Reading
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model — The DPO paper — eliminates the reward model by directly optimizing policy from preference pairs.
- SimPO: Simple Preference Optimization with a Reference-Free Reward — Meng et al. 2024 — removes the reference model from DPO entirely, using sequence-average log-probability as the implicit reward.
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — DeepSeek's GRPO approach — group relative policy optimization without a critic model.
- Constitutional AI: Harmlessness from AI Feedback — Anthropic's RLAIF method generating preference data from AI feedback instead of human annotators.
- Lilian Weng's Blog — In-depth posts on preference learning, RLHF variants, and alignment techniques including DPO analysis.
- A General Theoretical Paradigm to Understand Learning from Human Feedback (IPO) — Azar et al. 2024 — introduces IPO to address DPO's implicit assumption about the data generation process, fixing overfitting on deterministic preferences.
- ORPO: Monolithic Preference Optimization without Reference Model — Hong et al. 2024 — combines SFT and preference optimization in a single pass using an odds ratio penalty, eliminating the need for a reference model entirely.
- Zephyr: Direct Distillation of LM Alignment — Tunstall et al. 2023 — the first widely-adopted DPO model (Mistral-7B base), with a concrete recipe for distilled preference data generation.
Interview Questions
Showing 8 of 8
Why is DPO simpler than PPO-based RLHF? What does it eliminate?
★★☆When would you choose PPO-based RLHF over DPO? When would you choose DPO?
★★★What are the advantages of GRPO over both PPO and DPO?
★★★What is Constitutional AI (RLAIF) and why does it matter for scaling alignment?
★★☆What does beta control in DPO, and what happens if you set it too high or too low?
★★☆Compare offline DPO vs online DPO. What is the distribution mismatch problem?
★★★What is the offline-to-online gap in DPO? When does DPO underperform PPO?
★★★How does iterative DPO / online DPO address DPO's weaknesses?
★★★