PyTorch Debugging — Transformer Math

Module 45 · Interview Prep

🐛 PyTorch Debugging

optimizer.zero_grad() is missing — can you spot it in 5 minutes?

Status:

Every ML engineer has spent hours debugging silent PyTorch bugs — models that train without errors but produce garbage results. All three major AI labs (Anthropic, OpenAI, Google DeepMind) test debugging skills in interviews. This module covers the bugs you will actually encounter.

Jump to a bug category:

Training Bugs Tensor Bugs Data Bugs GPU Bugs NaN Debugging Advanced Incidents

🎮

Spot the Bug

What you’re seeing: three side-by-side code cards, each with a subtle production bug — missing zero_grad(), wrong model.eval() placement, and an in-place operation that silently breaks autograd. What to try:read the buggy snippet first and form a hypothesis before expanding the fix — the bugs are deliberately minimal so the error isn’t obvious at a glance.

Three of the most common PyTorch bugs in production and interviews. Can you spot what is wrong in each snippet before reading the fix?

Spot the bug

for batch in loader:
    logits = model(batch)
    loss = criterion(logits, y)
    loss.backward()
    optimizer.step()
    # What's missing?

Reveal answer

Missing optimizer.zero_grad(). Gradients accumulate across batches, so the model trains on increasingly stale gradient sums instead of the current batch gradient.

Spot the bug

probs = F.softmax(logits, dim=-1)
loss = F.cross_entropy(
    probs, targets
)

Reveal answer

Double softmax! CrossEntropyLoss already applies softmax internally. Passing softmax output means it applies softmax(softmax(logits)), collapsing the probability distribution toward uniform.

Spot the bug

# logits: [batch, seq, vocab]
probs = F.softmax(logits, dim=0)
next_token = probs.argmax(dim=-1)

Reveal answer

Wrong dimension! dim=0 normalizes across the batch, not across the vocabulary. The fix is dim=-1 (or dim=2) to normalize across vocab for each position.

💡

The Most Common PyTorch Bugs

Category 1: Training Bugs

1. Missing optimizer.zero_grad()

PyTorch accumulates gradients by default. Without zeroing, each backward() adds to the existing gradients. After N steps, you are optimizing on the sum of all past gradients, not the current batch.

python

# BUGGY: gradients accumulate
loss.backward()
optimizer.step()

python

# FIXED: zero gradients first
optimizer.zero_grad()
loss.backward()
optimizer.step()

2. Double softmax with CrossEntropyLoss

F.cross_entropy expects raw logits — it applies LogSoftMax internally. If you apply softmax first, the loss computes softmax(softmax(x)), which collapses the distribution toward uniform and kills gradients.

python

# BUGGY: double softmax
probs = F.softmax(logits, dim=-1)
loss = F.cross_entropy(probs, targets)

python

# FIXED: pass raw logits
loss = F.cross_entropy(logits, targets)

3. Forgetting model.train() / model.eval()

BatchNorm and Dropout behave differently in train vs. eval mode. Without model.eval(), Dropout still randomly zeros neurons during inference, and BatchNorm uses batch statistics instead of running averages — causing inconsistent, noisy predictions.

python

# BUGGY: dropout active during eval
predictions = model(test_input)
accuracy = compute_accuracy(predictions)

python

# FIXED: switch to eval mode
model.eval()
with torch.no_grad():
    predictions = model(test_input)
accuracy = compute_accuracy(predictions)
model.train()  # switch back

Category 2: Tensor Bugs

4. Wrong dimension in softmax/loss

For a tensor of shape [batch, seq_len, vocab_size], dim=-1 normalizes across vocab (correct), dim=0 normalizes across batch (wrong — makes tokens compete across samples), and dim=1 normalizes across sequence (wrong — makes positions compete).

python

# BUGGY: softmax across batch
probs = F.softmax(logits, dim=0)

python

# FIXED: softmax across vocab
probs = F.softmax(logits, dim=-1)

5. Silent broadcasting errors

PyTorch broadcasts silently. If you add tensors of shape [32, 1] and [1, 64], you get [32, 64] — no error, but possibly wrong semantics. This is especially dangerous with loss masks.

python

# BUGGY: mask shape [B] not [B,S]
mask = (labels != -100)  # shape [B]
loss = (loss_per_token * mask).mean()

python

# FIXED: ensure matching shapes
mask = (labels != -100)  # [B, S]
assert mask.shape == loss_per_token.shape
loss = (loss_per_token * mask).sum()
loss = loss / mask.sum()

6. In-place operations breaking autograd

In-place operations (like relu_, x += 1, x[0] = val) modify tensors that autograd may need for the backward pass. This can cause cryptic errors or silently wrong gradients.

python

# BUGGY: in-place modifies graph
x = F.relu_(self.linear(x))
# May error: "modified by inplace op"

python

# FIXED: out-of-place operation
x = F.relu(self.linear(x))

Category 3: Data Bugs

7. No shuffle in DataLoader

If your data is sorted by class (all 0s then all 1s), the model sees only one class per batch. It learns to always predict the most recent class and never converges. Default shuffle=False.

python

# BUGGY: data may be ordered
loader = DataLoader(dataset, batch_size=32)

python

# FIXED: always shuffle training
loader = DataLoader(
    dataset, batch_size=32, shuffle=True
)

8. Data leakage (test in training)

Computing normalization statistics, fitting tokenizers, or selecting features on the full dataset before splitting leaks test information into training. The model appears to generalize but fails on truly unseen data.

9. Wrong normalization

Using ImageNet mean/std on non-ImageNet data, normalizing per-batch instead of per-dataset, or forgetting to normalize at all. The model can still train but converges slower and to worse minima.

Category 4: GPU Bugs

10. Tensors on different devices

python

# BUGGY: device mismatch
model = Model().cuda()
x = torch.randn(32, 768)  # CPU!
out = model(x)  # RuntimeError

python

# FIXED: match devices
device = next(model.parameters()).device
x = torch.randn(32, 768, device=device)
out = model(x)

11. OOM debugging

Training uses 2–4x more memory than inference (activations + optimizer states). . Fixes: , mixed precision, smaller batch size, or Adafactor optimizer.

12. Gradient checkpointing

python

# Trade compute for memory
from torch.utils.checkpoint import checkpoint

class Block(nn.Module):
    def forward(self, x):
        # Recompute activations during backward
        return checkpoint(self._forward, x, use_reentrant=False)

    def _forward(self, x):
        return self.ffn(self.attn(self.norm(x)) + x)

Category 5: NaN Debugging

✨ Insight · NaN usually means one of four things: learning rate too high, log(0), division by zero, or exploding gradients. Check in that order.

13. Learning rate too high

Gradients overshoot, weights explode, loss goes to inf then NaN. Fix: reduce lr by 10x, add warmup, or use .

14. log(0) and division by zero

python

# BUGGY: log(0) = -inf → NaN
log_probs = torch.log(probs)
# BUGGY: division by zero
normalized = x / x.norm(dim=-1)

python

# FIXED: clamp to avoid log(0)
log_probs = torch.log(probs + 1e-8)
# FIXED: add epsilon
normalized = x / (x.norm(dim=-1) + 1e-8)

15. Exploding gradients

python

# Detect and fix exploding gradients
# max_norm=1.0 is the standard LLM pretraining default (Llama-2, GPT-3)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Debug: print gradient norms
for name, p in model.named_parameters():
    if p.grad is not None:
        print(f"{name}: grad_norm={p.grad.norm():.4f}")

💡 Tip · Pro debugging trick: use torch.autograd.set_detect_anomaly(True) during development. It pinpoints exactly which operation produced the NaN, at the cost of slower training.

Quick check

Trade-off

A model gets 90% accuracy during training but wildly fluctuating accuracy during inference (sometimes 60%, sometimes 89%). Dropout rate is 0.3. What is the most likely cause?

Missing model.eval() — Dropout still fires randomly at inferenceBatch size is too small, causing BatchNorm to use noisy statisticsLearning rate is too high, causing the model to oscillate near the optimumThe data pipeline shuffles test batches differently each run

Quick Check

What happens if you forget optimizer.zero_grad() in a training loop?

📐

Why These Bugs Matter — The Math

Why zero_grad() Matters: Gradient Accumulation

Without zeroing, the gradient at step becomes the sum of all past gradients:

The buggy gradient grows linearly with step count. After 100 steps, you are effectively using a learning rate 100x too large, weighted toward early batches.

Why Double Softmax Kills Training

Softmax compresses logits into a probability distribution. Applying it twice flattens the distribution dramatically. If , then:

The confident prediction [0.7, 0.2, 0.1] collapses to near-uniform [0.38, 0.32, 0.30]. The loss gradient becomes tiny because the model appears already uncertain, so it barely updates. Training stalls.

Why Training Uses So Much More Memory Than Inference

Two compounding factors:

PyTorch: Systematic NaN Debugging

python

def debug_nan_loss(model, batch, criterion):
    """Systematic NaN debugging checklist."""
    # Step 1: Check inputs
    assert not torch.isnan(batch['x']).any(), "NaN in inputs!"
    assert not torch.isinf(batch['x']).any(), "Inf in inputs!"

    # Step 2: Check forward pass
    with torch.autograd.detect_anomaly():
        logits = model(batch['x'])
        print(f"Logits: min={logits.min():.4f}, max={logits.max():.4f}")

        loss = criterion(logits, batch['y'])
        print(f"Loss: {loss.item():.4f}")

        # Step 3: Check backward
        loss.backward()

    # Step 4: Check gradients
    for name, p in model.named_parameters():
        if p.grad is not None:
            if torch.isnan(p.grad).any():
                print(f"NaN gradient in: {name}")
            grad_norm = p.grad.norm()
            if grad_norm > 100:
                print(f"Exploding gradient in: {name} ({grad_norm:.1f})")

Quick check

Derivation

You run 50 training steps without zero_grad(). At step 50, what is the effective learning rate multiplier compared to a correct single-step update?

~1× — PyTorch normalizes gradients to unit norm automatically~7× — the RMS of 50 i.i.d. gradients scales as √50~50× — gradients from all 50 batches are summed in .grad~2× — Adam's second moment dampens the effect of large gradients

🔧

Break It — See What Happens

Remove model.eval() during inference

Use in-place relu (relu_) with autograd

Quick check

Trade-off

You replace F.relu(x) with F.relu_(x) in a residual block. Under what condition does PyTorch raise an error versus silently compute wrong gradients?

Error always — in-place ops are banned when requires_grad=TrueError if autograd saved the tensor for backward; silent wrong grad if not neededSilent wrong gradient always — PyTorch never raises for in-place opsError only in torch.compile mode; eager mode never raises

🚨

Advanced Incidents — Research Engineer Interview Scenarios

These are production-grade debugging scenarios tested at top AI labs. Each requires understanding distributed systems, numerical precision, or subtle API misuse. Try to diagnose before revealing the fix.

Incident 1: AMP Overflow — loss=NaN after 500 steps (fp16)

Mixed precision training with fp16. Loss looks normal for 500 steps, then suddenly becomes NaN. Works perfectly in fp32. Root cause: .

Without GradScaler, .

Buggy Code (no GradScaler):

python

model = LargeTransformer().cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

for batch in dataloader:
    with torch.cuda.amp.autocast(dtype=torch.float16):
        logits = model(batch['input_ids'].cuda())
        loss = F.cross_entropy(logits.view(-1, vocab_size),
                               batch['labels'].cuda().view(-1))
    # BUG: no GradScaler — fp16 gradients overflow silently
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    # After ~500 steps, gradient magnitudes exceed fp16 max (65504)
    # loss → inf → NaN, training is irrecoverable

Fixed Code (with GradScaler + dynamic loss scaling):

python

model = LargeTransformer().cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
scaler = torch.cuda.amp.GradScaler()  # dynamic loss scaling

for batch in dataloader:
    with torch.cuda.amp.autocast(dtype=torch.float16):
        logits = model(batch['input_ids'].cuda())
        loss = F.cross_entropy(logits.view(-1, vocab_size),
                               batch['labels'].cuda().view(-1))
    # GradScaler scales loss up before backward (e.g., 2^16)
    # so small gradients don't underflow in fp16
    scaler.scale(loss).backward()
    # Unscales gradients, skips step if inf/NaN detected
    scaler.step(optimizer)
    scaler.update()  # adjusts scale factor dynamically
    optimizer.zero_grad()
    # If overflow detected: scale is halved, step is skipped
    # Training self-heals instead of crashing

✨ Insight · . If inf/NaN is detected, it halves the scale and skips that step. This is why AMP training occasionally shows "skipped steps" in logs — it is working as intended.

Incident 2: FSDP Deadlock — training hangs, all GPUs at 0%

Distributed training with FSDP/DDP across 8 GPUs. After some steps, all GPUs drop to 0% utilization. Processes are alive but no progress. No error message — just a silent hang.

Buggy Code (rank skips forward pass):

python

# Distributed training with data filtering
for batch in dataloader:
    # BUG: some ranks skip batches that don't meet criteria
    if batch['length'].max() < MIN_SEQ_LEN:
        continue  # rank 0 skips, but ranks 1-7 proceed

    logits = model(batch['input_ids'])
    loss = F.cross_entropy(logits.view(-1, V), batch['labels'].view(-1))
    loss.backward()  # triggers all-reduce across ALL ranks
    optimizer.step()
    optimizer.zero_grad()

# Ranks 1-7 call all-reduce in backward()
# Rank 0 skipped → all-reduce waits forever → DEADLOCK
# nvidia-smi shows all GPUs at 0%, processes alive but blocked

Fixed Code (all ranks process same number of batches):

python

# Fix 1: Filter data BEFORE distributed sampler
filtered_dataset = [x for x in dataset if x['length'] >= MIN_SEQ_LEN]
sampler = DistributedSampler(filtered_dataset, shuffle=True)
dataloader = DataLoader(filtered_dataset, sampler=sampler)

for batch in dataloader:
    # All ranks always execute forward + backward
    logits = model(batch['input_ids'])
    loss = F.cross_entropy(logits.view(-1, V), batch['labels'].view(-1))
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

# Fix 2: If you must filter dynamically, use a dummy forward pass
for batch in dataloader:
    if batch['length'].max() < MIN_SEQ_LEN:
        # Still do forward+backward so all-reduce is called
        with model.no_sync():  # skip gradient sync for dummy
            dummy = model(batch['input_ids'][:1])
            (dummy.sum() * 0).backward()  # zero gradient
        continue
    # ... normal training ...

Memory note: even with FSDP, each rank still holds the full optimizer state unless you use ZeRO Stage 2+. .

💡 Tip · Rule of thumb: in distributed training, every rank must call exactly the same sequence of collectives (all-reduce, all-gather, broadcast). Any conditional skip = potential deadlock. Use torch.distributed.monitored_barrier() to debug which rank is stuck.

Incident 3: Attention Mask Bug — padding tokens attend to everything

Custom attention implementation. Model trains but performs worse than expected on variable-length sequences. Loss is valid but accuracy degrades with more padding in the batch.

Buggy Code (mask applied AFTER softmax):

python

def attention(Q, K, V, mask):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)

    attn_weights = F.softmax(scores, dim=-1)

    # BUG: masking AFTER softmax — zeroes out weights but
    # doesn't redistribute probability mass
    attn_weights = attn_weights * mask.unsqueeze(1)
    # Remaining weights no longer sum to 1.0
    # Padding tokens already influenced the softmax denominator
    # Output vectors are scaled down proportional to padding ratio

    return torch.matmul(attn_weights, V)

Fixed Code (mask applied BEFORE softmax as -inf):

python

def attention(Q, K, V, mask):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)

    # FIXED: apply mask BEFORE softmax as -inf in logits
    # softmax(-inf) = 0, and remaining weights naturally sum to 1
    scores = scores.masked_fill(
        mask.unsqueeze(1) == 0,  # True where padding
        float('-inf')
    )

    attn_weights = F.softmax(scores, dim=-1)
    # Now: padding positions get exactly 0 attention weight
    # Non-padding weights sum to 1.0 (proper probability distribution)

    return torch.matmul(attn_weights, V)

✨ Insight · This is one of the most common attention bugs. The post-softmax mask creates two problems: (1) weights no longer sum to 1 so output magnitudes shrink, and (2) padding tokens still contribute to the softmax denominator, diluting attention to real tokens. Always mask in logit space with -inf.

Incident 4: Profiler Interpretation — GPU util at 30%, what is the bottleneck?

You profile your training job and see this output. GPU utilization is stuck at 30%. Where is the bottleneck?

$ python -m torch.utils.bottleneck train.py
----- autograd profiler results -----
        Name          CPU time    CUDA time   Calls
----------------------------------------------------
aten::linear           12.3ms      8.1ms       48
aten::batch_norm        3.1ms      1.9ms       16
aten::relu              0.8ms      0.4ms       16
aten::cross_entropy     1.2ms      0.9ms        1
aten::backward         18.7ms     14.2ms        1
----------------------------------------------------
Total model time:      36.1ms     25.5ms

DataLoader time:       82.4ms (per batch)
GPU idle time:         67.3ms (per step)
GPU utilization:       30.1%

Breakdown per step:
  [========---------] data loading:  82.4ms (68.2%)
  [====]              forward:       12.3ms (10.2%)
  [======]            backward:      18.7ms (15.5%)
  [==]                optimizer:      7.4ms  (6.1%)

Reveal diagnosis

Bottleneck: Data loading, not the model. The DataLoader takes 82.4ms per batch (68% of step time) while the GPU finishes forward+backward in 31ms. The GPU sits idle for 67.3ms every step waiting for the next batch.

This is a CPU-bound pipeline. Fixes:

python

# Fix: parallelize data loading
loader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=8,          # parallel data loading processes
    pin_memory=True,        # faster CPU→GPU transfer
    persistent_workers=True, # don't respawn workers each epoch
    prefetch_factor=3,      # prefetch 3 batches per worker
)
# Also: move preprocessing (tokenization, augmentation)
# into the Dataset.__getitem__ or use NVIDIA DALI for
# GPU-accelerated data preprocessing

💡 Tip · Low GPU utilization almost always means the GPU is starved for data. Before optimizing the model, check if DataLoader is the bottleneck. Target: data loading time should be less than forward+backward time so the GPU never waits.

Quick check

Trade-off

GradScaler shows ‘skipped steps’ in training logs every few hundred steps. Should you stop training?

Yes — skipped steps mean gradients are corrupted and the model is damagedNo — skipped steps only affect the learning rate schedule, not parametersNo — occasional skips are GradScaler working correctly; only worry if skips are continuousYes — skipped steps mean fp16 is insufficient and you must switch to bf16

📊

Interview Frequency — Most Common Debugging Questions

Bug	Frequency	Labs That Ask
NaN/loss debugging	█████████░ 90%	Anthropic, OpenAI, Google
Model not learning / wrong loss	████████░░ 80%	Anthropic, OpenAI, Google
Data leakage / wrong eval	███████░░░ 70%	Google, Meta
OOM / memory bugs	██████░░░░ 60%	Google, Meta
Device mismatch / GPU bugs	█████░░░░░ 50%	OpenAI, Google
Slow training / GPU utilization	████░░░░░░ 40%	Meta, OpenAI

✨ Insight · NaN debugging and "model not learning" are near-universal in ML interviews. If you can systematically debug these two, you pass most PyTorch debugging rounds.

Emerging Bug Category: torch.compile Graph Breaks

torch.compile (introduced in PyTorch 2.0) . But it works by tracing your model into a computation graph — any Python control flow that depends on tensor values (not shapes) causes a graph break: the compiler falls back to eager execution at that point, losing the speedup. Common causes: if tensor.item() > 0, print(tensor), unsupported ops, and data-dependent shapes. Diagnose with torch._dynamo.explain(model)(inputs) — it lists every graph break and its cause. In production training runs at scale, a single graph break in a tight loop can eliminate the entire compile speedup. The fix is to push value-dependent logic outside the compiled region or replace it with tensor-friendly alternatives (e.g., torch.where instead of if/else).

Quick check

Trade-off

torch.compile promises 1.5–2× speedup. You add `if loss.item() < threshold: break` inside the compiled training loop. What happens to the speedup?

Speedup is preserved — torch.compile handles Python control flow automaticallySpeedup improves — early stopping reduces total stepsSpeedup is halved — .item() triggers recompilation onceSpeedup drops to near 1× — .item() forces a graph break and CPU sync

🧠

Key Takeaways

What to remember for interviews

1Missing optimizer.zero_grad() causes gradient accumulation across batches — the model optimizes on a noisy sum of all past gradients, not the current batch.
2Double softmax destroys training: F.cross_entropy() already applies LogSoftmax internally — passing pre-softmaxed probabilities collapses the distribution toward uniform and kills gradients.
3model.eval() is mandatory before inference: without it, Dropout randomly zeros neurons and BatchNorm uses live batch statistics, producing inconsistent and noisy predictions.
4AMP NaN explosions happen because fp16's max value is ~65504 — gradient magnitudes can overflow after hundreds of steps. GradScaler with dynamic loss scaling detects overflows and skips those steps.
5GPU underutilization (20%) is usually a data pipeline bottleneck: num_workers=0 starves the GPU; adding num_workers=4 + pin_memory=True + non_blocking=True typically restores full utilization.

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 8 of 8

Debug NaN Loss: Your training loss becomes NaN after a few hundred steps. The model was training fine initially. Here's your training loop: ```python for batch in dataloader: logits = model(batch['input_ids']) loss = -torch.log(F.softmax(logits, dim=-1)) loss = loss.mean() loss.backward() optimizer.step() optimizer.zero_grad() ``` What's wrong and how do you fix it?

★★☆

AnthropicOpenAI

Debug Model Not Learning: Your model's training loss barely decreases. Validation loss stays flat. Here's the code: ```python model = TransformerLM(config) optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) for epoch in range(100): for batch in train_loader: logits = model(batch['x']) loss = F.cross_entropy(logits, batch['y']) loss.backward() # optimizer.step() is called every 4 batches if step % 4 == 0: optimizer.step() ``` What's wrong?

★★☆

GoogleAnthropic

Debug OOM: Your model fits in GPU memory during eval but crashes with OOM during training. The model uses 8GB and you have 16GB free. Why? ```python model = BigModel().cuda() # 8GB for batch in dataloader: outputs = model(batch.cuda()) loss = criterion(outputs, targets.cuda()) loss.backward() optimizer.step() optimizer.zero_grad() ```

★★☆

GoogleMeta

Debug Wrong Accuracy: Your model gets 99% train accuracy but only 52% test accuracy (binary classification). The dataset is balanced. Here's your data pipeline: ```python all_data = load_dataset() # 10k samples # Normalize using all data mean = all_data.mean() std = all_data.std() all_data = (all_data - mean) / std # Split after normalization train = all_data[:8000] test = all_data[8000:] train_loader = DataLoader(train, batch_size=32) test_loader = DataLoader(test, batch_size=32) ```

★★☆

AnthropicGoogle

Debug Data Leakage: You're fine-tuning a model for sentiment analysis. It gets 97% accuracy on your test set but only 60% in production. Your preprocessing: ```python # Load and preprocess df = pd.read_csv('reviews.csv') df['text'] = df['text'].apply(clean_text) # Feature engineering on FULL dataset tfidf = TfidfVectorizer(max_features=5000) X = tfidf.fit_transform(df['text']) # fit on ALL data y = df['label'] # Split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) ``` What's causing the production gap?

★★☆

GoogleMeta

Debug Slow Training: Your training is 5x slower than expected. GPU utilization is only 20%. Here's your setup: ```python train_loader = DataLoader( dataset, batch_size=64, num_workers=0, pin_memory=False ) for batch in train_loader: x = batch['input'].cuda() y = batch['label'].cuda() logits = model(x) loss = F.cross_entropy(logits, y) loss.backward() optimizer.step() optimizer.zero_grad() print(f'Loss: {loss.item()}') ``` What's causing the slowdown?

★★★

OpenAIMeta

Debug: Your distributed training job hangs after 100 steps. All GPUs show 0% utilization. `nvidia-smi` shows processes alive but idle. What do you check and how do you diagnose?

★★★

GoogleMeta

Debug: Your AMP (automatic mixed precision) training shows loss=NaN after 500 steps but works fine in fp32. Training loss looks normal for the first 499 steps. Diagnose the issue.

★★★

AnthropicOpenAI

Transformer Math