RL Foundations — Transformer Math

Module 19 · Training

🎮 RL Foundations

REINFORCE was invented in 1992. 30 years later it’s training the world’s most capable AI — with one fundamental addition: a baseline.

Status:

Reinforcement Learning is how we teach LLMs to optimize for human preferences. The model (agent) generates tokens (actions), receives a reward signal, and updates its policy to maximize future reward. is the workhorse algorithm behind RLHF — and understanding it requires knowing the RL fundamentals it builds on.

🎮

The RL Loop

What you're seeing: The Markov Decision Process (MDP) — the formal framework underlying all RL. At each timestep the agent observes a state, takes an action, and the environment returns a new state plus a reward.

What to notice: The loop is cyclic — output from the environment becomes input to the agent. In RLHF every component maps directly onto LLM training.

🎮

The RL Loop

What you're seeing: The core RL loop applied to LLM training. The agent (LLM) generates a token, the environment (reward model) provides feedback, and the loop repeats.

Agent

LLM (Policy π)

Reward + Stater = score(response)s' = prompt + tokens

Actiona = next tokensampled from π(·|s)

Environment

Reward Model

Concrete Example:

State: "Explain quantum computing in simple terms"

Action: LLM generates "Quantum computing uses qubits..."

Reward: Reward model scores 0.85 (helpful, clear)

Update: Policy gradient increases probability of similar responses

💡 Tip · Key connection: In standard RL (games, robotics), the environment gives intermediate rewards. In LLM training, reward comes only at the end of the full generation — making credit assignment (which tokens were good?) much harder.

💡

The Intuition

MDP (Markov Decision Process): The formal framework behind RL. An MDP has states (prompt + generated tokens), actions (next token from vocabulary), transitions (deterministic — append token), and rewards (from human/reward model at end of generation).

Policy (π): A mapping from states to a probability distribution over actions. For LLMs, the policy IS the model — is the probability the LLM assigns to token given context .

: The simplest policy gradient algorithm. Generate a complete trajectory, compute total reward, then increase the probability of actions that led to high reward. Problem: high variance — the same prompt can get very different rewards depending on the random sampling.

: Fixes REINFORCE's problems with two key ideas: (1) use advantages instead of raw returns (subtract a baseline to reduce variance), (2) clip the policy ratio to prevent destructively large updates. The "proximal" means "stay close to the current policy."

The LLM connection: Policy = LLM, Action = next token, State = prompt + tokens so far, Reward = human preference score (via reward model). RLHF trains the LLM-as-policy to maximize expected reward while staying close to the SFT reference policy (KL penalty).

✨ Insight · REINFORCE asks: "did the whole trajectory get high reward?" and nudges all actions equally. PPO asks: "was this specific action better than average?" and makes conservative updates. That's why PPO is the standard for LLM training — it's more sample-efficient and stable.

Reward Hacking and Goodhart's Law:The reward model is a proxy for human preferences, not the real thing. As soon as you optimize hard against a proxy, the policy finds inputs that score highly on the proxy while violating the underlying intent — this is Goodhart's Law applied to RL. Concrete examples from InstructGPT and later RLHF work: models learned to produce very long responses (reward models often correlate length with quality), to repeat the user's words back (surface-level similarity scores high), and to generate confident-sounding but factually wrong answers (the reward model can't verify facts). The KL penalty between the policy and the SFT reference model is the primary defense — it limits how far the policy can drift before the structural constraints of the SFT model dominate. In practice, labs also run multiple independent reward models and take the minimum score (conservative ensemble), periodically retrain the reward model on data from the current policy, and monitor proxy-real reward correlation throughout training.

Quick check

Derivation

REINFORCE generates a full trajectory and uses the total return R as the gradient signal. What single change in PPO most reduces gradient variance?

Clipping the probability ratio to [1−epsilon, 1+epsilon] to prevent any single gradient step from being too large.Using a smaller learning rate to prevent the policy from moving outside the trust region.Replacing raw returns with advantages A_t = R_t − V(s_t), subtracting a baseline that captures the expected return.Running more trajectories in parallel so the average return converges to the true expected return.

Quick Check

Why does PPO clip the objective?

📐

Step-by-Step Derivation

Expected Return

The RL objective: find a policy that maximizes the expected discounted sum of rewards over a trajectory:

For LLM training, (no discounting — all tokens matter equally) and reward is typically given only at the final token.

Gradient

The policy gradient theorem gives us a way to compute without differentiating through the environment:

Where is the return from time step t. High return? Increase the probability of those actions. Low return? Decrease it. Simple but high variance.

PPO Clipped Surrogate Objective

PPO constrains the policy update using a clipped probability ratio :

is the advantage (how much better this action was vs. average). is typically . The min + clip ensures: if the advantage is positive, don't increase the ratio beyond . If negative, don't decrease beyond . Conservative updates.

computes advantages using TD errors :

gives the 1-step TD advantage (low variance, high bias). gives the Monte Carlo advantage (high variance, low bias). Standard choice: .

PyTorch: REINFORCE

python

def reinforce_loss(log_probs, rewards, gamma=1.0):
    """REINFORCE with baseline (mean return)."""
    # log_probs: [T] log-probabilities of chosen actions
    # rewards:   [T] reward at each step (often 0 except last)

    # Compute returns (cumulative future reward)
    returns = torch.zeros_like(rewards)
    R = 0
    for t in reversed(range(len(rewards))):
        R = rewards[t] + gamma * R
        returns[t] = R

    # Subtract baseline to reduce variance
    baseline = returns.mean()
    advantages = returns - baseline

    # Policy gradient: -log_prob * advantage
    loss = -(log_probs * advantages).mean()
    return loss

PyTorch: PPO Clipped Loss

python

def ppo_loss(new_log_probs, old_log_probs, advantages, epsilon=0.2):
    """PPO clipped surrogate objective."""
    # Probability ratio: pi_new / pi_old
    ratio = torch.exp(new_log_probs - old_log_probs)

    # Clipped ratio
    clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)

    # PPO loss: min(ratio * A, clipped_ratio * A)
    loss = -torch.min(
        ratio * advantages,
        clipped_ratio * advantages,
    ).mean()
    return loss

PyTorch implementation

# REINFORCE with advantage (policy gradient for LLM fine-tuning)
import torch

def reinforce_step(model, prompt_ids, reward, gamma=1.0):
    """Single REINFORCE update. reward: scalar for the full generation."""
    # Sample a completion from the current policy
    output = model.generate(prompt_ids, do_sample=True, max_new_tokens=200)
    gen_ids = output[:, prompt_ids.size(1):]        # generated tokens only

    # Get log-probabilities for each generated token
    logits = model(output).logits[:, prompt_ids.size(1)-1:-1]
    log_probs = torch.log_softmax(logits, dim=-1)
    token_log_probs = log_probs.gather(-1, gen_ids.unsqueeze(-1)).squeeze(-1)

    # Advantage = reward - mean_reward_baseline (reduces variance)
    advantage = reward - reward.mean()

    # Policy gradient loss: -E[log π(a|s) * A]
    loss = -(token_log_probs.sum(-1) * advantage).mean()
    loss.backward()
    return loss.item()

Quick check

Derivation

A practitioner switches GAE lambda from 0.95 to 0.0 in an LLM PPO run. What is the concrete change in the advantage estimate for token t?

A_t becomes zero for all intermediate tokens because no future reward signal is included.A_t becomes the full discounted return minus V(s_t), summing all future rewards in the trajectory.A_t becomes the raw return R_t without any baseline subtraction, the same as REINFORCE.A_t collapses to the single-step TD error: r_t + gamma*V(s_{t+1}) − V(s_t), ignoring all future rewards beyond one step.

🔧

Break It — See What Happens

No advantage baseline (high variance)

Learning rate too high (policy collapse)

Quick check

Trade-off

Training without a reward baseline means every action gets a positive gradient when all rewards > 0. Why does this cause slow, oscillating convergence rather than fast convergence?

The value function cannot be trained without a baseline, causing the critic to output zeros.Gradient clipping in PPO prevents any positive updates, making the policy stagnate.All actions are reinforced regardless of quality — gradient direction oscillates batch to batch based on which high-reward trajectories were sampled.Rewards become negative due to normalization, making the policy decrease probability of all actions.

📊

Real-World Numbers

System	Algorithm	Key Hyperparameters
InstructGPT	PPO	(PPO clip), ; other hyperparameters (lr, epochs, KL schedule) vary by run — see appendix C of the paper for full sweep details
Llama-2 Chat	PPO + Rejection Sampling (iterated)	2 reward models (helpfulness + safety); iterated RLHF rounds with rejection sampling between SFT and PPO stages; KL penalty applied (exact beta varies by stage)
DeepSeek-R1	GRPO	No value network, , rule-based rewards for reasoning, multi-stage RL pipeline
Claude	Constitutional AI (RLAIF)	AI-generated preference pairs from constitutional principles, PPO-based optimization

✨ Insight · The trend: . dropped the value model. DPO dropped the reward model too. Each simplification trades some theoretical generality for practical stability and lower cost.

Quick check

Derivation

DeepSeek-R1 uses GRPO with G=64 completions per prompt instead of PPO. For a prompt where all 64 completions score 0.9–0.95 (narrow reward spread), what happens to the advantage estimates?

Advantages collapse near zero because group_std is tiny — the policy gets no useful gradient signal for that prompt.Rewards are re-scaled to [0, 1] within the group, so narrow spread becomes wide after normalization.The policy updates only for the single highest-reward completion and ignores the other 63.Advantages are large and positive for all 64 completions because the group mean reward exceeds 0.9.

🧠

Key Takeaways

What to remember for interviews

1LLM training is an MDP: state = prompt + tokens so far, action = next token, reward = human preference score at the end of generation.
2REINFORCE has high variance because it uses raw returns. PPO fixes this with advantages (reward minus baseline) and clips the policy ratio to prevent large updates.
3GAE (lambda=0.95) trades off bias and variance: lambda=0 is low-variance TD, lambda=1 is high-variance Monte Carlo.
4The KL penalty between the policy and the SFT reference model prevents reward hacking — the model exploiting the reward proxy rather than the underlying human preference.
5GRPO (DeepSeek) drops the value network entirely, using the group of sampled outputs as a self-normalizing baseline — simpler and more stable than PPO for LLMs.
6RLVR (2024–2025): rule-based verifiers for math/code replace the neural reward model entirely. Eliminates reward hacking but only works in domains with verifiable answers.

🚀

SOTA 2024–2025: RLVR

RLVR — Reinforcement Learning with Verifiable Rewards (2024–2025): Instead of training a neural reward model on human preferences, RLVR uses rule-based verifiers — exact match checkers, unit-test runners, or formal proof validators — as the reward signal. Used in DeepSeek-R1 (GRPO with rule-based math/code verifiers) and Qwen2.5. Eliminates reward hacking by making the reward non-gameable, but is limited to domains where correctness is objectively verifiable (per arxiv:2501.12599, 2025-01).

Contrast with standard RLHF: the neural reward model can be hacked (reward hacking, specification gaming), requires expensive human annotation, and can overfit to annotation artifacts. RLVR avoids all three — at the cost of being restricted to verifiable tasks. See also the DPO module for GRPO details and the Verifier / PRM module for how learned verifiers extend coverage to open-ended domains.

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 6 of 6

Formulate language generation as an MDP. What are the states, actions, transitions, and rewards?

★★☆

OpenAIAnthropic

Compare REINFORCE and PPO. Why is vanilla REINFORCE impractical for LLM training?

★★☆

OpenAIMeta

Explain advantage estimation. Why use GAE (Generalized Advantage Estimation)?

★★★

OpenAIGoogle

What is the role of the KL penalty in RLHF, and how does it relate to PPO's clipping?

★★☆

AnthropicOpenAI

What is the value function's role in PPO for RLHF, and why did GRPO remove it?

★★★

OpenAIMeta

Why is RL training for LLMs unstable, and what techniques stabilize it?

★★★

AnthropicGoogle

←

🎯 SFT & Post-Training Pipeline

🎯 RLHF & Reward Models

→

🎮 RL Foundations

The RL Loop

The RL Loop

The Intuition

Step-by-Step Derivation

Break It — See What Happens

Real-World Numbers

Key Takeaways

SOTA 2024–2025: RLVR

Recap quiz

Further Reading

Interview Questions