DPO, GRPO & Alternatives — Transformer Math

Module 21 · Training

⚖️ DPO, GRPO & Alternatives

DPO skips the reward model and PPO entirely — yet matches RLHF quality. Here's the math that makes 2 stages collapse into 1.

Status:

RLHF works, but it requires training a reward model and running PPO — two complex, unstable steps. DPO collapses both into a single loss function. GRPO takes a different shortcut: keep the reward model, drop the critic. This module covers the evolution from PPO to DPO to GRPO — simpler alignment, same goal.

⚡

DPO vs RLHF Pipeline

What you’re seeing:RLHF’s four-stage pipeline (supervised fine-tune → reward model → PPO policy + critic) collapsed into DPO’s single offline loss — same KL-constrained preference objective, no reward model or rollout sampler needed. What to try: Follow the arrow from the preference dataset: in RLHF it feeds a separate reward model; in DPO it feeds the loss directly — trace which boxes disappear.

🎮

RLHF vs DPO Pipeline

Side-by-side comparison: . . .

RLHF (PPO) — 4 Models

Policy Model

being optimized

Reference Model

frozen SFT checkpoint

Reward Model

scores outputs

Value Model (Critic)

estimates advantages

+ PPO clipping, GAE, multiple epochs per batch

DPO — 2 Models

Policy Model

being optimized

Reference Model

frozen SFT checkpoint

No Reward Model

implicit in log-prob ratios

No Critic

no advantage estimation needed

Just binary cross-entropy on preference pairs

Evolution of Alignment Methods

PPO

2022

4 models, complex

→

DPO

2023

2 models, offline

→

GRPO

2024

no critic, online

💡

The Intuition

The DPO insight:In RLHF, the optimal policy under KL-constrained reward maximization has a closed-form solution. The reward can be expressed as a function of the policy's log-probabilities and the reference model's log-probabilities. DPO inverts this — instead of learning a reward then optimizing, it directly optimizes the policy using preference pairs.

GRPO (DeepSeek): Group Relative Policy Optimization takes a different approach. Instead of eliminating the reward signal (like DPO), it eliminates the critic/value network. For each prompt, sample K outputs, score them all with a reward model or rule-based verifiable rewards (as in DeepSeek-R1), and compute advantages relative to the group mean. No value network needed — the group itself provides the baseline.

Constitutional AI (RLAIF):Instead of human labelers ranking outputs, an AI critiques its own outputs against a set of explicit principles — the "constitution." This produces preference data at scale without human annotation. Anthropic uses this for Claude's alignment.

✨ Insight · The evolution: PPO asks "how good is this output?" (reward model) then "how do I improve?" (PPO). DPO asks "which output is better?" directly. GRPO asks "how does this output compare to its siblings?" Each step removes a component while preserving the core signal: human preferences.

takes simplification one step further by also removing the reference model. DPO's implicit reward uses log-probability ratios against a frozen reference: . SimPO replaces this with a reference-free reward based on the sequence-average log-probability of the policy itself:

Dividing by sequence length corrects the length bias that plagues standard DPO (longer outputs accumulate more log-prob mass). The margin enforces a minimum gap between preferred and dispreferred scores. SimPO trains with only one model in memory — no reference forward pass — which (Meng et al., 2024).

Quick Check

Why is DPO simpler than PPO-based RLHF?

📐

Step-by-Step Derivation

DPO Loss (The Key Equation)

Given preferred output and dispreferred for prompt , DPO minimizes:

The term is the implicit reward. DPO increases this ratio for preferred outputs and decreases it for dispreferred outputs — no explicit reward model needed.

💡 Tip · controls divergence from the reference. This is the single most important hyperparameter in DPO.

GRPO Advantage Estimation

For each prompt, sample outputs. Score each with the reward model, then compute group-relative advantages:

No critic network. The group of samples provides the baseline. Normalization across the group handles reward scale differences between prompts automatically.

Implicit Reward (DPO vs Explicit)

The closed-form relationship between optimal policy and reward:

The partition function cancels in the preference comparison — which is why DPO never needs to compute it. This cancellation is the mathematical trick that makes DPO work.

PyTorch: DPO Loss in 10 Lines

python

import torch.nn.functional as F

def dpo_loss(pi_logps_w, pi_logps_l, ref_logps_w, ref_logps_l, beta=0.1):
    """Direct Preference Optimization loss.
    All inputs: (batch,) log-probabilities summed over tokens.
    """
    # Implicit rewards: log-ratio of policy vs reference
    reward_w = pi_logps_w - ref_logps_w  # preferred
    reward_l = pi_logps_l - ref_logps_l  # dispreferred

    # DPO loss = -log sigmoid(beta * (preferred_reward - dispreferred_reward))
    logits = beta * (reward_w - reward_l)
    return -F.logsigmoid(logits).mean()

PyTorch implementation

# DPO training loop — no reward model, no PPO
import torch, torch.nn.functional as F

def compute_log_probs(model, input_ids, labels):
    """Sum log-probs over assistant tokens (labels != -100)."""
    logits = model(input_ids).logits[:, :-1]           # (B, T-1, V)
    shift_labels = labels[:, 1:]                        # (B, T-1)
    log_probs = F.log_softmax(logits, dim=-1)
    token_lp = log_probs.gather(-1, shift_labels.clamp(0).unsqueeze(-1)).squeeze(-1)
    mask = (shift_labels != -100).float()
    return (token_lp * mask).sum(-1)                    # (B,)

def dpo_loss(policy, ref_policy, batch, beta=0.1):
    """batch contains: chosen_ids, rejected_ids, chosen_labels, rejected_labels."""
    pi_w = compute_log_probs(policy,     batch["chosen_ids"],   batch["chosen_labels"])
    pi_l = compute_log_probs(policy,     batch["rejected_ids"], batch["rejected_labels"])
    ref_w = compute_log_probs(ref_policy, batch["chosen_ids"],  batch["chosen_labels"])
    ref_l = compute_log_probs(ref_policy, batch["rejected_ids"],batch["rejected_labels"])

    logits = beta * ((pi_w - ref_w) - (pi_l - ref_l))
    return -F.logsigmoid(logits).mean()

Quick check

Derivation

In the DPO loss, beta multiplies the log-ratio difference before the sigmoid. If beta is doubled while keeping all else equal, what happens to the gradient magnitude?

Gradient is unchanged because sigma is applied after the beta scaling.Gradient halves because the sigmoid squashes larger inputs harder.Gradient roughly doubles near the decision boundary but saturates faster for confident predictions.Gradient doubles for all inputs uniformly regardless of logit magnitude.

🔧

Break It — See What Happens

Set beta too low (beta → 0)

Use the same (updating) model as both reference and policy

Quick check

Derivation

If the same model instance (shared weights) is used for both pi_theta and pi_ref in DPO, what happens to the gradient?

The gradient doubles because both terms reinforce each other.The log-ratio is identically zero at every step, so the loss gradient vanishes and training stalls.The policy drifts uncontrolled because KL regularization is disabled.Training converges immediately because the loss is minimized at initialization.

📊

Real-World Numbers

Model	Method	Details
Zephyr-7B	DPO
DeepSeek-V2	GRPO	236B MoE model.
Llama-2 Chat	PPO (RLHF)	Training required 4 models simultaneously in memory.
InstructGPT	PPO (RLHF)	OpenAI, 2022.
Claude	Constitutional AI	RLAIF — AI generates preference data from principles. Scales to millions of preference pairs without human labelers.
Tulu-2	DPO	Allen AI. DPO training on 70B was substantially cheaper than PPO-style RLHF — no reward model or value network inference during training.

✨ Insight ·

Quick check

Derivation

PPO-based RLHF requires 4 models in GPU memory simultaneously. DPO requires 2. For a 7B parameter model in BF16 (14 bytes/param), what is the approximate peak VRAM difference?

~14 GB — only one extra model compared to DPO.~196 GB — 4 models at 14 GB each vs 2 at 14 GB each.~7 GB — only the reward model is extra, since policy and reference are shared.~560 GB — PPO needs 4x the DPO footprint across all memory categories.

🚀

SOTA 2024–2025: DPO Successors

DPO spawned a family of reference-model-free and SFT-stage-free algorithms. Each removes another component from the RLHF pipeline while matching or beating DPO on win-rate benchmarks.

DPO Successor Landscape (2024–2025)

Algorithm	Reward model?	Reference model?	SFT stage?	Memory cost
DPO	No	Yes (frozen)	Yes	2 models
SimPO (May 2024)	No	No	Yes	1 model (~20% faster)
ORPO (2024)	No	No	No (merged)	1 model
KTO (2024)	No	Yes (frozen)	Yes	2 models
GRPO (DeepSeek)	Yes (rule-based)	Optional	Yes	No critic; lower than PPO
DAPO (ByteDance, 2025)	Yes (rule-based)	Optional	Yes	GRPO + dynamic sampling

SimPO — Simple Preference Optimization (May 2024)

Removes the reference model by using sequence-average log-probability as a reference-free reward, adds a length normalization term and a margin . Result: (per arxiv:2405.14734, 2024-05).

ORPO — Odds-Ratio Preference Optimization (2024)

Fuses SFT and preference optimization into a single loss by adding an odds-ratio penalty on dispreferred outputs — no reference model, no separate SFT stage. Trains on unpaired data (no pairwise annotations needed) (per arxiv:2403.07691, 2024-03).

KTO — Kahneman-Tversky Optimization (2024)

Frames preference learning as prospect theory: losses loom larger than gains. Trains on unpaired binary feedback (thumbs-up / thumbs-down) rather than preference pairs — dramatically cheaper to collect. Matches DPO quality at scale without paired comparisons.

GRPO — Group Relative Policy Optimization (DeepSeek-R1, 2025)

Samples , scores each with a rule-based verifier, uses group mean/std as the baseline — no critic network. DeepSeek-R1 training hyperparameters: lr=3e-6, KL coefficient=0.001, batch=512 (per arxiv:2501.12948, 2025-01). Memory substantially lower than PPO because the value network is eliminated entirely.

DAPO — Dynamic Sampling Policy Optimization (ByteDance, 2025)

Extends GRPO with dynamic token-level sampling to stabilize training at scale. Removes samples where all outputs have the same reward (no gradient signal) and clips token-level loss rather than just the policy ratio. Addresses entropy collapse and reward hacking at long-horizon reasoning scale (per arxiv:2503.14476, 2025-03).

✨ Insight · The evolution from DPO → SimPO → ORPO → GRPO/DAPO is a progressive ablation: each step removes a component (reference model, SFT stage, critic network) and replaces it with a simpler signal. The culmination is GRPO's rule-based verifiers: for math and code, you don't need human preferences at all — just check correctness.

🧠

Key Takeaways

What to remember for interviews

1DPO collapses PPO's reward model + RL optimizer into a single binary cross-entropy loss over preference pairs — reducing from 4 models to 2.
2The implicit reward is the log-probability ratio log(π_θ / π_ref). DPO increases this ratio for preferred outputs and decreases it for dispreferred ones.
3Beta controls KL divergence from the reference model. Too low: mode collapse and noisy gradients. Too high: training is too conservative. Typical range: 0.1–0.5.
4GRPO eliminates the critic network by using group-relative advantages: sample K outputs per prompt, normalize rewards by group mean and std.
5SimPO removes the reference model entirely, using sequence-average log-probability as a reference-free reward with a length-normalization and margin term.
6SOTA 2024–2025: SimPO +6.4 AlpacaEval 2 over DPO; ORPO fuses SFT + preference into one pass; KTO trains on binary feedback (no pairs); GRPO (DeepSeek-R1) uses rule-based verifiers with 16 samples/question; DAPO (ByteDance) stabilizes GRPO at scale via dynamic token sampling.

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 8 of 8

Why is DPO simpler than PPO-based RLHF? What does it eliminate?

★★☆

AnthropicMeta

When would you choose PPO-based RLHF over DPO? When would you choose DPO?

★★★

OpenAIMeta

What are the advantages of GRPO over both PPO and DPO?

★★★

OpenAIMeta

What is Constitutional AI (RLAIF) and why does it matter for scaling alignment?

★★☆

Anthropic

What does beta control in DPO, and what happens if you set it too high or too low?

★★☆

AnthropicMeta

Compare offline DPO vs online DPO. What is the distribution mismatch problem?

★★★

GoogleMeta

What is the offline-to-online gap in DPO? When does DPO underperform PPO?

★★★

MetaGoogle

How does iterative DPO / online DPO address DPO's weaknesses?

★★★

GoogleMeta

←

🎯 RLHF & Reward Models

🧬 Model Merging

→