Skip to content

Transformer Math

Module 21 · Training

⚖️ DPO, GRPO & Alternatives

DPO skips the reward model and PPO entirely — yet matches RLHF quality. Here's the math that makes 2 stages collapse into 1.

Status:

RLHF works, but it requires training a reward model and running PPO — two complex, unstable steps. DPO collapses both into a single loss function. GRPO takes a different shortcut: keep the reward model, drop the critic. This module covers the evolution from PPO to DPO to GRPO — simpler alignment, same goal.

DPO vs RLHF Pipeline

What you’re seeing:RLHF’s four-stage pipeline (supervised fine-tune → reward model → PPO policy + critic) collapsed into DPO’s single offline loss — same KL-constrained preference objective, no reward model or rollout sampler needed. What to try: Follow the arrow from the preference dataset: in RLHF it feeds a separate reward model; in DPO it feeds the loss directly — trace which boxes disappear.

PPO/RLHFDPOPreferenceDataTrain RewardModelRL Training(PPO + KL penalty)AlignedModel4 steps, complexPreferenceDataTrain RewardModelRL Training(PPO + KL penalty)SKIPPEDDirectOptimizationAlignedModel2 steps, simpleDPO derives the optimal policy directly from preferences — no reward model, no RL
🎮

RLHF vs DPO Pipeline

Side-by-side comparison: . . .

RLHF (PPO) — 4 Models

Policy Model

being optimized

Reference Model

frozen SFT checkpoint

Reward Model

scores outputs

Value Model (Critic)

estimates advantages

+ PPO clipping, GAE, multiple epochs per batch

DPO — 2 Models

Policy Model

being optimized

Reference Model

frozen SFT checkpoint

No Reward Model

implicit in log-prob ratios

No Critic

no advantage estimation needed

Just binary cross-entropy on preference pairs

Evolution of Alignment Methods

PPO

2022

4 models, complex

DPO

2023

2 models, offline

GRPO

2024

no critic, online

💡

The Intuition

The DPO insight:In RLHF, the optimal policy under KL-constrained reward maximization has a closed-form solution. The reward can be expressed as a function of the policy's log-probabilities and the reference model's log-probabilities. DPO inverts this — instead of learning a reward then optimizing, it directly optimizes the policy using preference pairs.

GRPO (DeepSeek): Group Relative Policy Optimization takes a different approach. Instead of eliminating the reward signal (like DPO), it eliminates the critic/value network. For each prompt, sample K outputs, score them all with a reward model or rule-based verifiable rewards (as in DeepSeek-R1), and compute advantages relative to the group mean. No value network needed — the group itself provides the baseline.

Constitutional AI (RLAIF):Instead of human labelers ranking outputs, an AI critiques its own outputs against a set of explicit principles — the "constitution." This produces preference data at scale without human annotation. Anthropic uses this for Claude's alignment.

✨ Insight · The evolution: PPO asks "how good is this output?" (reward model) then "how do I improve?" (PPO). DPO asks "which output is better?" directly. GRPO asks "how does this output compare to its siblings?" Each step removes a component while preserving the core signal: human preferences.

takes simplification one step further by also removing the reference model. DPO's implicit reward uses log-probability ratios against a frozen reference: . SimPO replaces this with a reference-free reward based on the sequence-average log-probability of the policy itself:

Dividing by sequence length corrects the length bias that plagues standard DPO (longer outputs accumulate more log-prob mass). The margin enforces a minimum gap between preferred and dispreferred scores. SimPO trains with only one model in memory — no reference forward pass — which (Meng et al., 2024).

Quick Check

Why is DPO simpler than PPO-based RLHF?

📐

Step-by-Step Derivation

DPO Loss (The Key Equation)

Given preferred output and dispreferred for prompt , DPO minimizes:

The term is the implicit reward. DPO increases this ratio for preferred outputs and decreases it for dispreferred outputs — no explicit reward model needed.

💡 Tip · controls divergence from the reference. This is the single most important hyperparameter in DPO.

GRPO Advantage Estimation

For each prompt, sample outputs. Score each with the reward model, then compute group-relative advantages:

No critic network. The group of samples provides the baseline. Normalization across the group handles reward scale differences between prompts automatically.

Implicit Reward (DPO vs Explicit)

The closed-form relationship between optimal policy and reward:

The partition function cancels in the preference comparison — which is why DPO never needs to compute it. This cancellation is the mathematical trick that makes DPO work.

PyTorch: DPO Loss in 10 Lines

python
import torch.nn.functional as F

def dpo_loss(pi_logps_w, pi_logps_l, ref_logps_w, ref_logps_l, beta=0.1):
    """Direct Preference Optimization loss.
    All inputs: (batch,) log-probabilities summed over tokens.
    """
    # Implicit rewards: log-ratio of policy vs reference
    reward_w = pi_logps_w - ref_logps_w  # preferred
    reward_l = pi_logps_l - ref_logps_l  # dispreferred

    # DPO loss = -log sigmoid(beta * (preferred_reward - dispreferred_reward))
    logits = beta * (reward_w - reward_l)
    return -F.logsigmoid(logits).mean()
PyTorch implementation
# DPO training loop — no reward model, no PPO
import torch, torch.nn.functional as F

def compute_log_probs(model, input_ids, labels):
    """Sum log-probs over assistant tokens (labels != -100)."""
    logits = model(input_ids).logits[:, :-1]           # (B, T-1, V)
    shift_labels = labels[:, 1:]                        # (B, T-1)
    log_probs = F.log_softmax(logits, dim=-1)
    token_lp = log_probs.gather(-1, shift_labels.clamp(0).unsqueeze(-1)).squeeze(-1)
    mask = (shift_labels != -100).float()
    return (token_lp * mask).sum(-1)                    # (B,)

def dpo_loss(policy, ref_policy, batch, beta=0.1):
    """batch contains: chosen_ids, rejected_ids, chosen_labels, rejected_labels."""
    pi_w = compute_log_probs(policy,     batch["chosen_ids"],   batch["chosen_labels"])
    pi_l = compute_log_probs(policy,     batch["rejected_ids"], batch["rejected_labels"])
    ref_w = compute_log_probs(ref_policy, batch["chosen_ids"],  batch["chosen_labels"])
    ref_l = compute_log_probs(ref_policy, batch["rejected_ids"],batch["rejected_labels"])

    logits = beta * ((pi_w - ref_w) - (pi_l - ref_l))
    return -F.logsigmoid(logits).mean()

Quick check

Derivation

In the DPO loss, beta multiplies the log-ratio difference before the sigmoid. If beta is doubled while keeping all else equal, what happens to the gradient magnitude?

In the DPO loss, beta multiplies the log-ratio difference before the sigmoid. If beta is doubled while keeping all else equal, what happens to the gradient magnitude?
🔧

Break It — See What Happens

Set beta too low (beta → 0)
Use the same (updating) model as both reference and policy

Quick check

Derivation

If the same model instance (shared weights) is used for both pi_theta and pi_ref in DPO, what happens to the gradient?

If the same model instance (shared weights) is used for both pi_theta and pi_ref in DPO, what happens to the gradient?
📊

Real-World Numbers

ModelMethodDetails
Zephyr-7BDPO
DeepSeek-V2GRPO236B MoE model.
Llama-2 ChatPPO (RLHF) Training required 4 models simultaneously in memory.
InstructGPTPPO (RLHF)OpenAI, 2022.
ClaudeConstitutional AIRLAIF — AI generates preference data from principles. Scales to millions of preference pairs without human labelers.
Tulu-2DPOAllen AI. DPO training on 70B was substantially cheaper than PPO-style RLHF — no reward model or value network inference during training.
✨ Insight ·

Quick check

Derivation

PPO-based RLHF requires 4 models in GPU memory simultaneously. DPO requires 2. For a 7B parameter model in BF16 (14 bytes/param), what is the approximate peak VRAM difference?

PPO-based RLHF requires 4 models in GPU memory simultaneously. DPO requires 2. For a 7B parameter model in BF16 (14 bytes/param), what is the approximate peak VRAM difference?
🚀

SOTA 2024–2025: DPO Successors

DPO spawned a family of reference-model-free and SFT-stage-free algorithms. Each removes another component from the RLHF pipeline while matching or beating DPO on win-rate benchmarks.

DPO Successor Landscape (2024–2025)

AlgorithmReward model?Reference model?SFT stage?Memory cost
DPONoYes (frozen)Yes2 models
SimPO (May 2024)NoNoYes1 model (~20% faster)
ORPO (2024)NoNoNo (merged)1 model
KTO (2024)NoYes (frozen)Yes2 models
GRPO (DeepSeek)Yes (rule-based)OptionalYesNo critic; lower than PPO
DAPO (ByteDance, 2025)Yes (rule-based)OptionalYesGRPO + dynamic sampling

SimPO — Simple Preference Optimization (May 2024)

Removes the reference model by using sequence-average log-probability as a reference-free reward, adds a length normalization term and a margin . Result: (per arxiv:2405.14734, 2024-05).

ORPO — Odds-Ratio Preference Optimization (2024)

Fuses SFT and preference optimization into a single loss by adding an odds-ratio penalty on dispreferred outputs — no reference model, no separate SFT stage. Trains on unpaired data (no pairwise annotations needed) (per arxiv:2403.07691, 2024-03).

KTO — Kahneman-Tversky Optimization (2024)

Frames preference learning as prospect theory: losses loom larger than gains. Trains on unpaired binary feedback (thumbs-up / thumbs-down) rather than preference pairs — dramatically cheaper to collect. Matches DPO quality at scale without paired comparisons.

GRPO — Group Relative Policy Optimization (DeepSeek-R1, 2025)

Samples , scores each with a rule-based verifier, uses group mean/std as the baseline — no critic network. DeepSeek-R1 training hyperparameters: lr=3e-6, KL coefficient=0.001, batch=512 (per arxiv:2501.12948, 2025-01). Memory substantially lower than PPO because the value network is eliminated entirely.

DAPO — Dynamic Sampling Policy Optimization (ByteDance, 2025)

Extends GRPO with dynamic token-level sampling to stabilize training at scale. Removes samples where all outputs have the same reward (no gradient signal) and clips token-level loss rather than just the policy ratio. Addresses entropy collapse and reward hacking at long-horizon reasoning scale (per arxiv:2503.14476, 2025-03).

✨ Insight · The evolution from DPO → SimPO → ORPO → GRPO/DAPO is a progressive ablation: each step removes a component (reference model, SFT stage, critic network) and replaces it with a simpler signal. The culmination is GRPO's rule-based verifiers: for math and code, you don't need human preferences at all — just check correctness.
🧠

Key Takeaways

What to remember for interviews

  1. 1DPO collapses PPO's reward model + RL optimizer into a single binary cross-entropy loss over preference pairs — reducing from 4 models to 2.
  2. 2The implicit reward is the log-probability ratio log(π_θ / π_ref). DPO increases this ratio for preferred outputs and decreases it for dispreferred ones.
  3. 3Beta controls KL divergence from the reference model. Too low: mode collapse and noisy gradients. Too high: training is too conservative. Typical range: 0.1–0.5.
  4. 4GRPO eliminates the critic network by using group-relative advantages: sample K outputs per prompt, normalize rewards by group mean and std.
  5. 5SimPO removes the reference model entirely, using sequence-average log-probability as a reference-free reward with a length-normalization and margin term.
  6. 6SOTA 2024–2025: SimPO +6.4 AlpacaEval 2 over DPO; ORPO fuses SFT + preference into one pass; KTO trains on binary feedback (no pairs); GRPO (DeepSeek-R1) uses rule-based verifiers with 16 samples/question; DAPO (ByteDance) stabilizes GRPO at scale via dynamic token sampling.
🧠

Recap quiz

Derivation

DPO is derived from the same KL-constrained objective as PPO-based RLHF. What mathematical property lets DPO skip the reward model entirely?

DPO is derived from the same KL-constrained objective as PPO-based RLHF. What mathematical property lets DPO skip the reward model entirely?
Trade-off

A team sets DPO beta=0.01 hoping for aggressive preference learning. What is the most likely failure mode?

A team sets DPO beta=0.01 hoping for aggressive preference learning. What is the most likely failure mode?
Derivation

What does the quantity beta * log(pi_theta(y|x) / pi_ref(y|x)) represent in DPO?

What does the quantity beta * log(pi_theta(y|x) / pi_ref(y|x)) represent in DPO?
Trade-off

A team is building a code-generation assistant and has 10K preference pairs but limited human labelers. They can generate unlimited test cases programmatically. Which method best exploits this setup?

A team is building a code-generation assistant and has 10K preference pairs but limited human labelers. They can generate unlimited test cases programmatically. Which method best exploits this setup?
Trade-off

Offline DPO shows good loss but degrades at eval time after 5 epochs. What is the most likely root cause?

Offline DPO shows good loss but degrades at eval time after 5 epochs. What is the most likely root cause?
Trade-off

Zephyr-7B (DPO on Mistral 7B) achieved MT-Bench 7.34 vs Llama-2-Chat-70B at 6.27. What does this benchmark result imply about DPO vs PPO training at 7B scale?

Zephyr-7B (DPO on Mistral 7B) achieved MT-Bench 7.34 vs Llama-2-Chat-70B at 6.27. What does this benchmark result imply about DPO vs PPO training at 7B scale?
Derivation

Standard DPO often produces longer outputs even when the preferred example is shorter. SimPO fixes this with length normalization. What is the root cause in DPO?

Standard DPO often produces longer outputs even when the preferred example is shorter. SimPO fixes this with length normalization. What is the root cause in DPO?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 8 of 8

Why is DPO simpler than PPO-based RLHF? What does it eliminate?

★★☆
AnthropicMeta

When would you choose PPO-based RLHF over DPO? When would you choose DPO?

★★★
OpenAIMeta

What are the advantages of GRPO over both PPO and DPO?

★★★
OpenAIMeta

What is Constitutional AI (RLAIF) and why does it matter for scaling alignment?

★★☆
Anthropic

What does beta control in DPO, and what happens if you set it too high or too low?

★★☆
AnthropicMeta

Compare offline DPO vs online DPO. What is the distribution mismatch problem?

★★★
GoogleMeta

What is the offline-to-online gap in DPO? When does DPO underperform PPO?

★★★
MetaGoogle

How does iterative DPO / online DPO address DPO's weaknesses?

★★★
GoogleMeta