🐛 PyTorch Debugging
optimizer.zero_grad() is missing — can you spot it in 5 minutes?
Every ML engineer has spent hours debugging silent PyTorch bugs — models that train without errors but produce garbage results. All three major AI labs (Anthropic, OpenAI, Google DeepMind) test debugging skills in interviews. This module covers the bugs you will actually encounter.
Jump to a bug category:
Spot the Bug
What you’re seeing: three side-by-side code cards, each with a subtle production bug — missing zero_grad(), wrong model.eval() placement, and an in-place operation that silently breaks autograd. What to try:read the buggy snippet first and form a hypothesis before expanding the fix — the bugs are deliberately minimal so the error isn’t obvious at a glance.
Three of the most common PyTorch bugs in production and interviews. Can you spot what is wrong in each snippet before reading the fix?
Spot the bug
for batch in loader:
logits = model(batch)
loss = criterion(logits, y)
loss.backward()
optimizer.step()
# What's missing?Reveal answer
Missing optimizer.zero_grad(). Gradients accumulate across batches, so the model trains on increasingly stale gradient sums instead of the current batch gradient.
Spot the bug
probs = F.softmax(logits, dim=-1)
loss = F.cross_entropy(
probs, targets
)Reveal answer
Double softmax! CrossEntropyLoss already applies softmax internally. Passing softmax output means it applies softmax(softmax(logits)), collapsing the probability distribution toward uniform.
Spot the bug
# logits: [batch, seq, vocab] probs = F.softmax(logits, dim=0) next_token = probs.argmax(dim=-1)
Reveal answer
Wrong dimension! dim=0 normalizes across the batch, not across the vocabulary. The fix is dim=-1 (or dim=2) to normalize across vocab for each position.
The Most Common PyTorch Bugs
Category 1: Training Bugs
1. Missing optimizer.zero_grad()
PyTorch accumulates gradients by default. Without zeroing, each backward() adds to the existing gradients. After N steps, you are optimizing on the sum of all past gradients, not the current batch.
# BUGGY: gradients accumulate loss.backward() optimizer.step()
# FIXED: zero gradients first optimizer.zero_grad() loss.backward() optimizer.step()
2. Double softmax with CrossEntropyLoss
F.cross_entropy expects raw logits — it applies LogSoftMax internally. If you apply softmax first, the loss computes softmax(softmax(x)), which collapses the distribution toward uniform and kills gradients.
# BUGGY: double softmax probs = F.softmax(logits, dim=-1) loss = F.cross_entropy(probs, targets)
# FIXED: pass raw logits loss = F.cross_entropy(logits, targets)
3. Forgetting model.train() / model.eval()
BatchNorm and Dropout behave differently in train vs. eval mode. Without model.eval(), Dropout still randomly zeros neurons during inference, and BatchNorm uses batch statistics instead of running averages — causing inconsistent, noisy predictions.
# BUGGY: dropout active during eval predictions = model(test_input) accuracy = compute_accuracy(predictions)
# FIXED: switch to eval mode
model.eval()
with torch.no_grad():
predictions = model(test_input)
accuracy = compute_accuracy(predictions)
model.train() # switch backCategory 2: Tensor Bugs
4. Wrong dimension in softmax/loss
For a tensor of shape [batch, seq_len, vocab_size], dim=-1 normalizes across vocab (correct), dim=0 normalizes across batch (wrong — makes tokens compete across samples), and dim=1 normalizes across sequence (wrong — makes positions compete).
# BUGGY: softmax across batch probs = F.softmax(logits, dim=0)
# FIXED: softmax across vocab probs = F.softmax(logits, dim=-1)
5. Silent broadcasting errors
PyTorch broadcasts silently. If you add tensors of shape [32, 1] and [1, 64], you get [32, 64] — no error, but possibly wrong semantics. This is especially dangerous with loss masks.
# BUGGY: mask shape [B] not [B,S] mask = (labels != -100) # shape [B] loss = (loss_per_token * mask).mean()
# FIXED: ensure matching shapes mask = (labels != -100) # [B, S] assert mask.shape == loss_per_token.shape loss = (loss_per_token * mask).sum() loss = loss / mask.sum()
6. In-place operations breaking autograd
In-place operations (like relu_, x += 1, x[0] = val) modify tensors that autograd may need for the backward pass. This can cause cryptic errors or silently wrong gradients.
# BUGGY: in-place modifies graph x = F.relu_(self.linear(x)) # May error: "modified by inplace op"
# FIXED: out-of-place operation x = F.relu(self.linear(x))
Category 3: Data Bugs
7. No shuffle in DataLoader
If your data is sorted by class (all 0s then all 1s), the model sees only one class per batch. It learns to always predict the most recent class and never converges. Default shuffle=False.
# BUGGY: data may be ordered loader = DataLoader(dataset, batch_size=32)
# FIXED: always shuffle training
loader = DataLoader(
dataset, batch_size=32, shuffle=True
)8. Data leakage (test in training)
Computing normalization statistics, fitting tokenizers, or selecting features on the full dataset before splitting leaks test information into training. The model appears to generalize but fails on truly unseen data.
9. Wrong normalization
Using ImageNet mean/std on non-ImageNet data, normalizing per-batch instead of per-dataset, or forgetting to normalize at all. The model can still train but converges slower and to worse minima.
Category 4: GPU Bugs
10. Tensors on different devices
# BUGGY: device mismatch model = Model().cuda() x = torch.randn(32, 768) # CPU! out = model(x) # RuntimeError
# FIXED: match devices device = next(model.parameters()).device x = torch.randn(32, 768, device=device) out = model(x)
11. OOM debugging
Training uses 2–4x more memory than inference (activations + optimizer states). . Fixes: , mixed precision, smaller batch size, or Adafactor optimizer.
12. Gradient checkpointing
# Trade compute for memory
from torch.utils.checkpoint import checkpoint
class Block(nn.Module):
def forward(self, x):
# Recompute activations during backward
return checkpoint(self._forward, x, use_reentrant=False)
def _forward(self, x):
return self.ffn(self.attn(self.norm(x)) + x)Category 5: NaN Debugging
13. Learning rate too high
Gradients overshoot, weights explode, loss goes to inf then NaN. Fix: reduce lr by 10x, add warmup, or use .
14. log(0) and division by zero
# BUGGY: log(0) = -inf → NaN log_probs = torch.log(probs) # BUGGY: division by zero normalized = x / x.norm(dim=-1)
# FIXED: clamp to avoid log(0) log_probs = torch.log(probs + 1e-8) # FIXED: add epsilon normalized = x / (x.norm(dim=-1) + 1e-8)
15. Exploding gradients
# Detect and fix exploding gradients
# max_norm=1.0 is the standard LLM pretraining default (Llama-2, GPT-3)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Debug: print gradient norms
for name, p in model.named_parameters():
if p.grad is not None:
print(f"{name}: grad_norm={p.grad.norm():.4f}")torch.autograd.set_detect_anomaly(True) during development. It pinpoints exactly which operation produced the NaN, at the cost of slower training.Quick check
A model gets 90% accuracy during training but wildly fluctuating accuracy during inference (sometimes 60%, sometimes 89%). Dropout rate is 0.3. What is the most likely cause?
What happens if you forget optimizer.zero_grad() in a training loop?
Why These Bugs Matter — The Math
Why zero_grad() Matters: Gradient Accumulation
Without zeroing, the gradient at step becomes the sum of all past gradients:
The buggy gradient grows linearly with step count. After 100 steps, you are effectively using a learning rate 100x too large, weighted toward early batches.
Why Double Softmax Kills Training
Softmax compresses logits into a probability distribution. Applying it twice flattens the distribution dramatically. If , then:
The confident prediction [0.7, 0.2, 0.1] collapses to near-uniform [0.38, 0.32, 0.30]. The loss gradient becomes tiny because the model appears already uncertain, so it barely updates. Training stalls.
Why Training Uses So Much More Memory Than Inference
Two compounding factors:
- .
- .
PyTorch: Systematic NaN Debugging
def debug_nan_loss(model, batch, criterion):
"""Systematic NaN debugging checklist."""
# Step 1: Check inputs
assert not torch.isnan(batch['x']).any(), "NaN in inputs!"
assert not torch.isinf(batch['x']).any(), "Inf in inputs!"
# Step 2: Check forward pass
with torch.autograd.detect_anomaly():
logits = model(batch['x'])
print(f"Logits: min={logits.min():.4f}, max={logits.max():.4f}")
loss = criterion(logits, batch['y'])
print(f"Loss: {loss.item():.4f}")
# Step 3: Check backward
loss.backward()
# Step 4: Check gradients
for name, p in model.named_parameters():
if p.grad is not None:
if torch.isnan(p.grad).any():
print(f"NaN gradient in: {name}")
grad_norm = p.grad.norm()
if grad_norm > 100:
print(f"Exploding gradient in: {name} ({grad_norm:.1f})")Quick check
You run 50 training steps without zero_grad(). At step 50, what is the effective learning rate multiplier compared to a correct single-step update?
Break It — See What Happens
Quick check
You replace F.relu(x) with F.relu_(x) in a residual block. Under what condition does PyTorch raise an error versus silently compute wrong gradients?
Advanced Incidents — Research Engineer Interview Scenarios
These are production-grade debugging scenarios tested at top AI labs. Each requires understanding distributed systems, numerical precision, or subtle API misuse. Try to diagnose before revealing the fix.
Incident 1: AMP Overflow — loss=NaN after 500 steps (fp16)
Mixed precision training with fp16. Loss looks normal for 500 steps, then suddenly becomes NaN. Works perfectly in fp32. Root cause: .
Without GradScaler, .
Buggy Code (no GradScaler):
model = LargeTransformer().cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
for batch in dataloader:
with torch.cuda.amp.autocast(dtype=torch.float16):
logits = model(batch['input_ids'].cuda())
loss = F.cross_entropy(logits.view(-1, vocab_size),
batch['labels'].cuda().view(-1))
# BUG: no GradScaler — fp16 gradients overflow silently
loss.backward()
optimizer.step()
optimizer.zero_grad()
# After ~500 steps, gradient magnitudes exceed fp16 max (65504)
# loss → inf → NaN, training is irrecoverableFixed Code (with GradScaler + dynamic loss scaling):
model = LargeTransformer().cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
scaler = torch.cuda.amp.GradScaler() # dynamic loss scaling
for batch in dataloader:
with torch.cuda.amp.autocast(dtype=torch.float16):
logits = model(batch['input_ids'].cuda())
loss = F.cross_entropy(logits.view(-1, vocab_size),
batch['labels'].cuda().view(-1))
# GradScaler scales loss up before backward (e.g., 2^16)
# so small gradients don't underflow in fp16
scaler.scale(loss).backward()
# Unscales gradients, skips step if inf/NaN detected
scaler.step(optimizer)
scaler.update() # adjusts scale factor dynamically
optimizer.zero_grad()
# If overflow detected: scale is halved, step is skipped
# Training self-heals instead of crashingIncident 2: FSDP Deadlock — training hangs, all GPUs at 0%
Distributed training with FSDP/DDP across 8 GPUs. After some steps, all GPUs drop to 0% utilization. Processes are alive but no progress. No error message — just a silent hang.
Buggy Code (rank skips forward pass):
# Distributed training with data filtering
for batch in dataloader:
# BUG: some ranks skip batches that don't meet criteria
if batch['length'].max() < MIN_SEQ_LEN:
continue # rank 0 skips, but ranks 1-7 proceed
logits = model(batch['input_ids'])
loss = F.cross_entropy(logits.view(-1, V), batch['labels'].view(-1))
loss.backward() # triggers all-reduce across ALL ranks
optimizer.step()
optimizer.zero_grad()
# Ranks 1-7 call all-reduce in backward()
# Rank 0 skipped → all-reduce waits forever → DEADLOCK
# nvidia-smi shows all GPUs at 0%, processes alive but blockedFixed Code (all ranks process same number of batches):
# Fix 1: Filter data BEFORE distributed sampler
filtered_dataset = [x for x in dataset if x['length'] >= MIN_SEQ_LEN]
sampler = DistributedSampler(filtered_dataset, shuffle=True)
dataloader = DataLoader(filtered_dataset, sampler=sampler)
for batch in dataloader:
# All ranks always execute forward + backward
logits = model(batch['input_ids'])
loss = F.cross_entropy(logits.view(-1, V), batch['labels'].view(-1))
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Fix 2: If you must filter dynamically, use a dummy forward pass
for batch in dataloader:
if batch['length'].max() < MIN_SEQ_LEN:
# Still do forward+backward so all-reduce is called
with model.no_sync(): # skip gradient sync for dummy
dummy = model(batch['input_ids'][:1])
(dummy.sum() * 0).backward() # zero gradient
continue
# ... normal training ...Memory note: even with FSDP, each rank still holds the full optimizer state unless you use ZeRO Stage 2+. .
torch.distributed.monitored_barrier() to debug which rank is stuck.Incident 3: Attention Mask Bug — padding tokens attend to everything
Custom attention implementation. Model trains but performs worse than expected on variable-length sequences. Loss is valid but accuracy degrades with more padding in the batch.
Buggy Code (mask applied AFTER softmax):
def attention(Q, K, V, mask):
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
attn_weights = F.softmax(scores, dim=-1)
# BUG: masking AFTER softmax — zeroes out weights but
# doesn't redistribute probability mass
attn_weights = attn_weights * mask.unsqueeze(1)
# Remaining weights no longer sum to 1.0
# Padding tokens already influenced the softmax denominator
# Output vectors are scaled down proportional to padding ratio
return torch.matmul(attn_weights, V)Fixed Code (mask applied BEFORE softmax as -inf):
def attention(Q, K, V, mask):
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
# FIXED: apply mask BEFORE softmax as -inf in logits
# softmax(-inf) = 0, and remaining weights naturally sum to 1
scores = scores.masked_fill(
mask.unsqueeze(1) == 0, # True where padding
float('-inf')
)
attn_weights = F.softmax(scores, dim=-1)
# Now: padding positions get exactly 0 attention weight
# Non-padding weights sum to 1.0 (proper probability distribution)
return torch.matmul(attn_weights, V)Incident 4: Profiler Interpretation — GPU util at 30%, what is the bottleneck?
You profile your training job and see this output. GPU utilization is stuck at 30%. Where is the bottleneck?
$ python -m torch.utils.bottleneck train.py
----- autograd profiler results -----
Name CPU time CUDA time Calls
----------------------------------------------------
aten::linear 12.3ms 8.1ms 48
aten::batch_norm 3.1ms 1.9ms 16
aten::relu 0.8ms 0.4ms 16
aten::cross_entropy 1.2ms 0.9ms 1
aten::backward 18.7ms 14.2ms 1
----------------------------------------------------
Total model time: 36.1ms 25.5ms
DataLoader time: 82.4ms (per batch)
GPU idle time: 67.3ms (per step)
GPU utilization: 30.1%
Breakdown per step:
[========---------] data loading: 82.4ms (68.2%)
[====] forward: 12.3ms (10.2%)
[======] backward: 18.7ms (15.5%)
[==] optimizer: 7.4ms (6.1%)Reveal diagnosis
Bottleneck: Data loading, not the model. The DataLoader takes 82.4ms per batch (68% of step time) while the GPU finishes forward+backward in 31ms. The GPU sits idle for 67.3ms every step waiting for the next batch.
This is a CPU-bound pipeline. Fixes:
# Fix: parallelize data loading
loader = DataLoader(
dataset,
batch_size=64,
num_workers=8, # parallel data loading processes
pin_memory=True, # faster CPU→GPU transfer
persistent_workers=True, # don't respawn workers each epoch
prefetch_factor=3, # prefetch 3 batches per worker
)
# Also: move preprocessing (tokenization, augmentation)
# into the Dataset.__getitem__ or use NVIDIA DALI for
# GPU-accelerated data preprocessingQuick check
GradScaler shows ‘skipped steps’ in training logs every few hundred steps. Should you stop training?
Interview Frequency — Most Common Debugging Questions
| Bug | Frequency | Labs That Ask |
|---|---|---|
| NaN/loss debugging | █████████░ 90% | Anthropic, OpenAI, Google |
| Model not learning / wrong loss | ████████░░ 80% | Anthropic, OpenAI, Google |
| Data leakage / wrong eval | ███████░░░ 70% | Google, Meta |
| OOM / memory bugs | ██████░░░░ 60% | Google, Meta |
| Device mismatch / GPU bugs | █████░░░░░ 50% | OpenAI, Google |
| Slow training / GPU utilization | ████░░░░░░ 40% | Meta, OpenAI |
Emerging Bug Category: torch.compile Graph Breaks
torch.compile (introduced in PyTorch 2.0) . But it works by tracing your model into a computation graph — any Python control flow that depends on tensor values (not shapes) causes a graph break: the compiler falls back to eager execution at that point, losing the speedup. Common causes: if tensor.item() > 0, print(tensor), unsupported ops, and data-dependent shapes. Diagnose with torch._dynamo.explain(model)(inputs) — it lists every graph break and its cause. In production training runs at scale, a single graph break in a tight loop can eliminate the entire compile speedup. The fix is to push value-dependent logic outside the compiled region or replace it with tensor-friendly alternatives (e.g., torch.where instead of if/else).
Quick check
torch.compile promises 1.5–2× speedup. You add `if loss.item() < threshold: break` inside the compiled training loop. What happens to the speedup?
Key Takeaways
What to remember for interviews
- 1Missing optimizer.zero_grad() causes gradient accumulation across batches — the model optimizes on a noisy sum of all past gradients, not the current batch.
- 2Double softmax destroys training: F.cross_entropy() already applies LogSoftmax internally — passing pre-softmaxed probabilities collapses the distribution toward uniform and kills gradients.
- 3model.eval() is mandatory before inference: without it, Dropout randomly zeros neurons and BatchNorm uses live batch statistics, producing inconsistent and noisy predictions.
- 4AMP NaN explosions happen because fp16's max value is ~65504 — gradient magnitudes can overflow after hundreds of steps. GradScaler with dynamic loss scaling detects overflows and skips those steps.
- 5GPU underutilization (20%) is usually a data pipeline bottleneck: num_workers=0 starves the GPU; adding num_workers=4 + pin_memory=True + non_blocking=True typically restores full utilization.
Recap quiz
PyTorch Debugging recap
After 100 training steps without optimizer.zero_grad(), the effective gradient magnitude is roughly how many times larger than the single-batch gradient?
F.cross_entropy(F.softmax(logits), targets) vs F.cross_entropy(logits, targets). Which training symptom best distinguishes the buggy call?
A 7B parameter model in fp32 uses ~28 GB for parameters. Approximately how much total GPU memory does Adam training require (params + optimizer state, ignoring activations)?
AMP training (fp16) with no GradScaler works for 500 steps then NaN appears. Which mechanism best explains the delayed failure?
In DDP training on 8 GPUs, rank 0 encounters an empty batch and skips it with `continue`. All other ranks proceed normally. What happens next?
A custom attention implementation applies `attn_weights = attn_weights * mask` AFTER softmax, zeroing padding positions. Why does this still corrupt the output?
Gradient checkpointing reduces activation memory from O(n) to O(√n) for an n-layer model. What is the primary cost?
Further Reading
- A Recipe for Training Neural Networks — Karpathy 2019 — systematic approach to debugging and training neural networks from scratch
- PyTorch Frequently Asked Questions — PyTorch docs — common issues with memory, parallelism, and reproducibility
- PyTorch Autograd Mechanics — Official deep-dive into how autograd builds the computation graph, handles in-place ops, and propagates gradients — essential for debugging gradient issues
- Karpathy — micrograd: building autograd from scratch (YouTube) — Building a scalar-valued autograd engine from scratch — the best way to develop intuition for what PyTorch is doing under the hood
- PyTorch Compile Troubleshooting Guide — Debugging torch.compile graph breaks, dynamic shapes, and recompilations — increasingly important for modern training pipelines
Interview Questions
Showing 8 of 8
Debug NaN Loss: Your training loss becomes NaN after a few hundred steps. The model was training fine initially. Here's your training loop: ```python for batch in dataloader: logits = model(batch['input_ids']) loss = -torch.log(F.softmax(logits, dim=-1)) loss = loss.mean() loss.backward() optimizer.step() optimizer.zero_grad() ``` What's wrong and how do you fix it?
★★☆Debug Model Not Learning: Your model's training loss barely decreases. Validation loss stays flat. Here's the code: ```python model = TransformerLM(config) optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) for epoch in range(100): for batch in train_loader: logits = model(batch['x']) loss = F.cross_entropy(logits, batch['y']) loss.backward() # optimizer.step() is called every 4 batches if step % 4 == 0: optimizer.step() ``` What's wrong?
★★☆Debug OOM: Your model fits in GPU memory during eval but crashes with OOM during training. The model uses 8GB and you have 16GB free. Why? ```python model = BigModel().cuda() # 8GB for batch in dataloader: outputs = model(batch.cuda()) loss = criterion(outputs, targets.cuda()) loss.backward() optimizer.step() optimizer.zero_grad() ```
★★☆Debug Wrong Accuracy: Your model gets 99% train accuracy but only 52% test accuracy (binary classification). The dataset is balanced. Here's your data pipeline: ```python all_data = load_dataset() # 10k samples # Normalize using all data mean = all_data.mean() std = all_data.std() all_data = (all_data - mean) / std # Split after normalization train = all_data[:8000] test = all_data[8000:] train_loader = DataLoader(train, batch_size=32) test_loader = DataLoader(test, batch_size=32) ```
★★☆Debug Data Leakage: You're fine-tuning a model for sentiment analysis. It gets 97% accuracy on your test set but only 60% in production. Your preprocessing: ```python # Load and preprocess df = pd.read_csv('reviews.csv') df['text'] = df['text'].apply(clean_text) # Feature engineering on FULL dataset tfidf = TfidfVectorizer(max_features=5000) X = tfidf.fit_transform(df['text']) # fit on ALL data y = df['label'] # Split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) ``` What's causing the production gap?
★★☆Debug Slow Training: Your training is 5x slower than expected. GPU utilization is only 20%. Here's your setup: ```python train_loader = DataLoader( dataset, batch_size=64, num_workers=0, pin_memory=False ) for batch in train_loader: x = batch['input'].cuda() y = batch['label'].cuda() logits = model(x) loss = F.cross_entropy(logits, y) loss.backward() optimizer.step() optimizer.zero_grad() print(f'Loss: {loss.item()}') ``` What's causing the slowdown?
★★★Debug: Your distributed training job hangs after 100 steps. All GPUs show 0% utilization. `nvidia-smi` shows processes alive but idle. What do you check and how do you diagnose?
★★★Debug: Your AMP (automatic mixed precision) training shows loss=NaN after 500 steps but works fine in fp32. Training loss looks normal for the first 499 steps. Diagnose the issue.
★★★