🎮 RL Foundations
REINFORCE was invented in 1992. 30 years later it’s training the world’s most capable AI — with one fundamental addition: a baseline.
Reinforcement Learning is how we teach LLMs to optimize for human preferences. The model (agent) generates tokens (actions), receives a reward signal, and updates its policy to maximize future reward. is the workhorse algorithm behind RLHF — and understanding it requires knowing the RL fundamentals it builds on.
The RL Loop
What you're seeing: The Markov Decision Process (MDP) — the formal framework underlying all RL. At each timestep the agent observes a state, takes an action, and the environment returns a new state plus a reward.
What to notice: The loop is cyclic — output from the environment becomes input to the agent. In RLHF every component maps directly onto LLM training.
The RL Loop
What you're seeing: The core RL loop applied to LLM training. The agent (LLM) generates a token, the environment (reward model) provides feedback, and the loop repeats.
Agent
LLM (Policy π)
Environment
Reward Model
Concrete Example:
State: "Explain quantum computing in simple terms"
Action: LLM generates "Quantum computing uses qubits..."
Reward: Reward model scores 0.85 (helpful, clear)
Update: Policy gradient increases probability of similar responses
The Intuition
MDP (Markov Decision Process): The formal framework behind RL. An MDP has states (prompt + generated tokens), actions (next token from vocabulary), transitions (deterministic — append token), and rewards (from human/reward model at end of generation).
Policy (π): A mapping from states to a probability distribution over actions. For LLMs, the policy IS the model — is the probability the LLM assigns to token given context .
: The simplest policy gradient algorithm. Generate a complete trajectory, compute total reward, then increase the probability of actions that led to high reward. Problem: high variance — the same prompt can get very different rewards depending on the random sampling.
: Fixes REINFORCE's problems with two key ideas: (1) use advantages instead of raw returns (subtract a baseline to reduce variance), (2) clip the policy ratio to prevent destructively large updates. The "proximal" means "stay close to the current policy."
The LLM connection: Policy = LLM, Action = next token, State = prompt + tokens so far, Reward = human preference score (via reward model). RLHF trains the LLM-as-policy to maximize expected reward while staying close to the SFT reference policy (KL penalty).
Reward Hacking and Goodhart's Law:The reward model is a proxy for human preferences, not the real thing. As soon as you optimize hard against a proxy, the policy finds inputs that score highly on the proxy while violating the underlying intent — this is Goodhart's Law applied to RL. Concrete examples from InstructGPT and later RLHF work: models learned to produce very long responses (reward models often correlate length with quality), to repeat the user's words back (surface-level similarity scores high), and to generate confident-sounding but factually wrong answers (the reward model can't verify facts). The KL penalty between the policy and the SFT reference model is the primary defense — it limits how far the policy can drift before the structural constraints of the SFT model dominate. In practice, labs also run multiple independent reward models and take the minimum score (conservative ensemble), periodically retrain the reward model on data from the current policy, and monitor proxy-real reward correlation throughout training.
Quick check
REINFORCE generates a full trajectory and uses the total return R as the gradient signal. What single change in PPO most reduces gradient variance?
Why does PPO clip the objective?
Step-by-Step Derivation
Expected Return
The RL objective: find a policy that maximizes the expected discounted sum of rewards over a trajectory:
For LLM training, (no discounting — all tokens matter equally) and reward is typically given only at the final token.
Gradient
The policy gradient theorem gives us a way to compute without differentiating through the environment:
Where is the return from time step t. High return? Increase the probability of those actions. Low return? Decrease it. Simple but high variance.
PPO Clipped Surrogate Objective
PPO constrains the policy update using a clipped probability ratio :
is the advantage (how much better this action was vs. average). is typically . The min + clip ensures: if the advantage is positive, don't increase the ratio beyond . If negative, don't decrease beyond . Conservative updates.
computes advantages using TD errors :
gives the 1-step TD advantage (low variance, high bias). gives the Monte Carlo advantage (high variance, low bias). Standard choice: .
PyTorch: REINFORCE
def reinforce_loss(log_probs, rewards, gamma=1.0):
"""REINFORCE with baseline (mean return)."""
# log_probs: [T] log-probabilities of chosen actions
# rewards: [T] reward at each step (often 0 except last)
# Compute returns (cumulative future reward)
returns = torch.zeros_like(rewards)
R = 0
for t in reversed(range(len(rewards))):
R = rewards[t] + gamma * R
returns[t] = R
# Subtract baseline to reduce variance
baseline = returns.mean()
advantages = returns - baseline
# Policy gradient: -log_prob * advantage
loss = -(log_probs * advantages).mean()
return lossPyTorch: PPO Clipped Loss
def ppo_loss(new_log_probs, old_log_probs, advantages, epsilon=0.2):
"""PPO clipped surrogate objective."""
# Probability ratio: pi_new / pi_old
ratio = torch.exp(new_log_probs - old_log_probs)
# Clipped ratio
clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)
# PPO loss: min(ratio * A, clipped_ratio * A)
loss = -torch.min(
ratio * advantages,
clipped_ratio * advantages,
).mean()
return lossPyTorch implementation
# REINFORCE with advantage (policy gradient for LLM fine-tuning)
import torch
def reinforce_step(model, prompt_ids, reward, gamma=1.0):
"""Single REINFORCE update. reward: scalar for the full generation."""
# Sample a completion from the current policy
output = model.generate(prompt_ids, do_sample=True, max_new_tokens=200)
gen_ids = output[:, prompt_ids.size(1):] # generated tokens only
# Get log-probabilities for each generated token
logits = model(output).logits[:, prompt_ids.size(1)-1:-1]
log_probs = torch.log_softmax(logits, dim=-1)
token_log_probs = log_probs.gather(-1, gen_ids.unsqueeze(-1)).squeeze(-1)
# Advantage = reward - mean_reward_baseline (reduces variance)
advantage = reward - reward.mean()
# Policy gradient loss: -E[log π(a|s) * A]
loss = -(token_log_probs.sum(-1) * advantage).mean()
loss.backward()
return loss.item()Quick check
A practitioner switches GAE lambda from 0.95 to 0.0 in an LLM PPO run. What is the concrete change in the advantage estimate for token t?
Break It — See What Happens
Quick check
Training without a reward baseline means every action gets a positive gradient when all rewards > 0. Why does this cause slow, oscillating convergence rather than fast convergence?
Real-World Numbers
| System | Algorithm | Key Hyperparameters |
|---|---|---|
| InstructGPT | PPO | (PPO clip), ; other hyperparameters (lr, epochs, KL schedule) vary by run — see appendix C of the paper for full sweep details |
| Llama-2 Chat | PPO + Rejection Sampling (iterated) | 2 reward models (helpfulness + safety); iterated RLHF rounds with rejection sampling between SFT and PPO stages; KL penalty applied (exact beta varies by stage) |
| DeepSeek-R1 | GRPO | No value network, , rule-based rewards for reasoning, multi-stage RL pipeline |
| Claude | Constitutional AI (RLAIF) | AI-generated preference pairs from constitutional principles, PPO-based optimization |
Quick check
DeepSeek-R1 uses GRPO with G=64 completions per prompt instead of PPO. For a prompt where all 64 completions score 0.9–0.95 (narrow reward spread), what happens to the advantage estimates?
Key Takeaways
What to remember for interviews
- 1LLM training is an MDP: state = prompt + tokens so far, action = next token, reward = human preference score at the end of generation.
- 2REINFORCE has high variance because it uses raw returns. PPO fixes this with advantages (reward minus baseline) and clips the policy ratio to prevent large updates.
- 3GAE (lambda=0.95) trades off bias and variance: lambda=0 is low-variance TD, lambda=1 is high-variance Monte Carlo.
- 4The KL penalty between the policy and the SFT reference model prevents reward hacking — the model exploiting the reward proxy rather than the underlying human preference.
- 5GRPO (DeepSeek) drops the value network entirely, using the group of sampled outputs as a self-normalizing baseline — simpler and more stable than PPO for LLMs.
- 6RLVR (2024–2025): rule-based verifiers for math/code replace the neural reward model entirely. Eliminates reward hacking but only works in domains with verifiable answers.
SOTA 2024–2025: RLVR
RLVR — Reinforcement Learning with Verifiable Rewards (2024–2025): Instead of training a neural reward model on human preferences, RLVR uses rule-based verifiers — exact match checkers, unit-test runners, or formal proof validators — as the reward signal. Used in DeepSeek-R1 (GRPO with rule-based math/code verifiers) and Qwen2.5. Eliminates reward hacking by making the reward non-gameable, but is limited to domains where correctness is objectively verifiable (per arxiv:2501.12599, 2025-01).
Contrast with standard RLHF: the neural reward model can be hacked (reward hacking, specification gaming), requires expensive human annotation, and can overfit to annotation artifacts. RLVR avoids all three — at the cost of being restricted to verifiable tasks. See also the DPO module for GRPO details and the Verifier / PRM module for how learned verifiers extend coverage to open-ended domains.
Recap quiz
LLM token generation is modeled as an MDP. What makes credit assignment harder here than in a game like Chess?
PPO clips the probability ratio at epsilon=0.2. Without this clip, what specific failure mode emerges during LLM training?
GAE computes advantages using lambda. A practitioner sets lambda=0 for one run and lambda=1 for another. What is the concrete difference in gradient quality?
A team trains an LLM with RLHF and finds responses drift toward very long, confident-sounding but factually wrong answers — the reward model scores these highly. What is the primary architectural defense?
PPO requires 4 models in GPU memory (policy, reference, reward, value). GRPO drops the value model and uses group size G=64. What is the memory saving and what does GRPO trade for it?
An LLM trainer sets the learning rate 10× too high for a PPO run. Even with epsilon=0.2 clipping, what failure mode remains?
Further Reading
- Proximal Policy Optimization Algorithms (Schulman et al. 2017) — The PPO paper — clipped surrogate objective that became the default RL algorithm for RLHF.
- Reinforcement Learning: An Introduction (Sutton & Barto, 2nd ed.) — The definitive RL textbook covering MDPs, policy gradients, temporal-difference learning, and more.
- High-Dimensional Continuous Control Using Generalized Advantage Estimation — Schulman et al. 2016 — introduces GAE, the variance-reduction technique that PPO relies on for stable RLHF training.
- OpenAI Spinning Up in Deep RL — Practical deep RL resource covering policy gradients, PPO, and SAC with working implementations — the best entry point before diving into RLHF.
- Lilian Weng — Policy Gradient Algorithms — Comprehensive derivation of REINFORCE, Actor-Critic, and PPO — essential prerequisite for understanding the RLHF training loop.
Interview Questions
Showing 6 of 6
Formulate language generation as an MDP. What are the states, actions, transitions, and rewards?
★★☆Compare REINFORCE and PPO. Why is vanilla REINFORCE impractical for LLM training?
★★☆Explain advantage estimation. Why use GAE (Generalized Advantage Estimation)?
★★★What is the role of the KL penalty in RLHF, and how does it relate to PPO's clipping?
★★☆What is the value function's role in PPO for RLHF, and why did GRPO remove it?
★★★Why is RL training for LLMs unstable, and what techniques stabilize it?
★★★