Skip to content

Transformer Math

Module 75 · Design Reviews

🔥 Case: Design Llama Training Infra

At 16K GPUs, a GPU fails every 3 hours — design for it

Status:

Training a Llama-class model on is not mainly a parallelism problem. It is a reliability problem. At that scale, the mean time between hardware failures is measured in hours, not days. A single slow rank blocks every other rank's all-reduce. A corrupted tokenizer batch can poison 2B tokens before anyone notices. A checkpoint written without an integrity check is a recovery that fails exactly when you need it most.

The senior insight is simple: failure is the normal operating mode. Goodput, not utilization, is the north-star metric. This case study designs the data pipeline, orchestration, checkpointing, and eval fleet around that fact. For a cross-system comparison of SLO trade-offs and failure taxonomies, see Distributed Training and Failure Taxonomy Comparison and the Cost & Eval module.

📋

Requirements & SLOs

Working backwards from the training researcher

“I submit a training run. Over the next 90 days, the cluster makes progress on a 2-trillion-token dataset without requiring me to manually restart jobs, monitor for silent failures, or wonder whether the checkpoint I'm resuming from is corrupt. When a failure happens, the system detects it within minutes and recovers to a consistent state in under 30 minutes. My eval dashboard shows me loss curves, downstream benchmark scores, and data quality metrics on a continuous basis — not just at the end of the run.”

SLO table

MetricTargetWhy this value
Goodput≥ 85%Industry benchmark for well-tuned large clusters. Below 80% implies a systemic failure-detection or checkpointing gap. Dubey et al. (2024) report (per Dubey et al., arxiv 2407.21783, §3.3).
Cluster MTBF assumption4–8 hEmpirical at 16K GPU scale. (inferred from Llama 3 paper §3, Dubey et al. 2024); with 16K GPUs that is ~16 failures/day cluster-wide.
Checkpoint RTO< 30 minFrom failure detection to resumed training. Includes node replacement + checkpoint load + NCCL ring re-initialization.
Data pipeline throughput floor> GPU token consumption rate × 1.2Pipeline must never stall training. 20% headroom absorbs IO jitter from the object store.
Eval cadenceLoss: continuous; benchmarks: every 5K stepsLoss spikes are detected in real time. Full benchmark runs gate checkpoint commits at coarser intervals.
Data quality proxy refreshEvery 500 stepsUnique n-gram rate + language distribution drift checked frequently enough to catch a tokenizer bug before it poisons 2B+ tokens.
✨ Insight · Goodput is not GPU utilization. A GPU running at 100% utilization on repeated identical micro-batches because the data pipeline stalled has 0% goodput for those steps. Goodput counts only steps whose output advances the accepted training trajectory — it penalizes failures, stalls, and corrupted batches equally.
🧪

Eval Harness (continuous, not post-hoc)

For inference, the eval harness measures quality. For training, it serves a second purpose: it is the gating mechanism that decides whether a checkpoint is safe to commit. Every checkpoint that gets promoted to the object store should have passed the eval harness — otherwise a corrupted resume will go undetected until the loss curve diverges hours later.

Signals monitored

  • Continuous loss curves — training loss and gradient norm per step. Spikes beyond 3 standard deviations of the rolling window trigger a data-pipeline audit. This is the fastest signal: milliseconds to detect vs. hours for benchmark regressions.
  • Periodic held-out eval — a frozen held-out split, never seen by the model, evaluated every 5,000 steps. Divergence between training and held-out loss flags overfitting or data leakage early.
  • Downstream benchmark gating — MMLU, HumanEval, GSM8K, and domain-specific benchmarks run on the eval fleet after every 5K-step checkpoint. A checkpoint fails promotion if any benchmark regresses more than a threshold from the previous promoted checkpoint. This is the safety net that catches silent quality degradation — loss can look fine while a downstream capability collapses.
  • Loss spike detection — a streaming detector looking for step-level loss spikes. When detected, it cross- references the data loader position and flags the batch for inspection. Most spikes come from a single bad sequence, not a systematic problem.
  • Data quality proxy metrics— unique n-gram rate (diversity proxy) and language distribution drift (checks that the token bucket hasn't silently shifted to a different language mix). These are cheap to compute per-batch and catch tokenizer bugs before they propagate.
💡 Tip · Eval-before-commit is load-bearing. The eval fleet gates checkpoint promotion — it is not a reporting dashboard. A checkpoint that writes successfully to the object store but has not passed the eval harness is stored as pending, not committed. Recovery rolls back only to the last committed checkpoint, avoiding the silent-corruption failure mode.
🧮

Back-of-Envelope — Training vs Inference

The calculator below is seeded for an inference workload, which is what it was designed for. Read the gotcha card after adjusting the sliders — it reveals the critical difference between inference capacity planning and training capacity planning.

Scenario: 16,384 H100s, 2T token dataset, 90-day budget.At peak throughput a 70B model on 16K H100s processes roughly 1–2B tokens per day. A 2T token run therefore takes 1,000–2,000 GPU-days — roughly 60–120 days at full utilization. The relevant axis is not “how many GPUs can serve this QPS” but how much of those GPU-hours produce accepted tokens.

Scenario: Training scenario: 16,384 H100s, 70B model, 2T token dataset, 90-day budget. Each 'request' is one training micro-batch of 4K tokens; QPS represents micro-batches/sec per model replica.
Model Size70B
GPU TypeH100-80GB
QPS Target50
Input Tokens4096
Output Tokens1
Cache Hit Rate0%
Model Weights (FP16)140 GB
KV Cache / Request1374.7 MB (4097 tokens)
Tokens/sec per GPU600
Effective QPS (after cache)50
GPUs Needed3
Cost / Month$7,665
Est. p95 Latency0.23s
BottleneckMemory (model too large for single GPU)

GPU Memory Usage

100%

Compute Utilization

33%

Monthly Cost

$7,665

⚠ Warning · Gotcha: This calculator measures inference capacity. Training cost is dominated by effective goodput, not raw flops. If goodput is 70% vs 90%, you are paying 30% more GPU-hours for the same tokens-seen — roughly $10–15M extra on a large cluster run. The inference calculator cannot show you this; you need a goodput model that accounts for failure rate, checkpoint overhead, and pipeline bubble.
✨ Insight · Goodput is the real axis. A at costs ~$57,500/hour. The difference between 70% and 90% goodput on a 90-day run is roughly 4,320 wasted GPU-hours × 16,384 GPUs × $3.50 = on the order of tens of millions of dollars. This is why the SLO table lists goodput first, before any latency metric.

Baseline: Llama-3 405B pretraining cluster16000 GPUs @ $3.5/hr at p99 600000 ms, 1 QPS, 0% cache hit.

Here the “SLO” is training throughput: iteration latency (time per step) and cost per step. Adjust GPU count or hourly rate to see how cluster spend scales. A 10% goodput drop on this cluster costs ~$5M extra over a 90-day run.

p99 Latency Target600000 ms
Peak QPS1
Cache Hit Rate0%
Effective QPS (after cache)1
Latency-batch factor1.00×
GPUs Needed16,000 (+0% latency vs baseline)
Hourly Burn$56,000 (+0% vs baseline)
Cost / Request$15.55556
Monthly Burn (24×7)$40,880,000
BottleneckBalanced
⚠ Warning · Gotcha: Pretraining has no cache hits and no QPS scaling — cost is entirely a function of step time × GPU count × hourly rate. Latency improvements come from reducing pipeline bubbles and collective communication overhead, not from caching or request batching.
🏛️

Architecture

Eight components, each justified by one failure mode it prevents. Hover over nodes to see their role.

Llama-Scale Training Infrastructure

Hover over each component to see its role. Data flows left to right; the health monitor and eval fleet form feedback loops back into the orchestrator.

Data IngestTokenizer FleetStreaming LoaderTraining OrchestratorAsync Ckpt OffloadRing-Health MonitorEval FleetCheckpoint Store

Justification table

ComponentWhy it existsFailure if removed
Data IngestDedup, quality filter, license gate before tokenizationDuplicate sequences inflate benchmark scores without improving generalization; license violations create legal risk
Tokenizer FleetParallel BPE; deterministic seed per shard for reproducibilityNon-deterministic tokenization makes loss-spike root-cause analysis impossible — you can't replay the bad step
Streaming LoaderMaintains 1.2× headroom above GPU consumption rate; parity monitor tracks language/domain distributionData stall wastes GPU-hours; distribution drift causes silent capability collapse without a loss-curve signal
Training Orchestrator (3D DP × TP × PP); matches comms topology to hardwarePure DP all-reduce at 70B scale saturates inter-rack bandwidth; pure PP maximizes pipeline bubble
Async Ckpt OffloadCPU shadow copy streams to object store while GPU continues training; checkpoint overhead is < 3 min per saveSynchronous checkpointing pauses training for 15–30 min per save; teams respond by checkpointing less often, raising recovery cost dramatically
Ring-Health MonitorDetects slow/flapping ranks before the training step fails; triggers preemptive replacement from warm-standby poolWithout pre-step detection, one slow rank causes the entire step to time out — hours of wasted compute per day at 16K GPUs
Eval FleetIndependent compute for loss curves, held-out evals, and downstream benchmark gating; gates checkpoint commitsTraining on the same GPUs as eval creates contention; eval is skipped under pressure, allowing corrupted checkpoints to propagate
Checkpoint StoreObject store with dual-write + eval-gated promotion; stores pending and committed states separatelySingle-write without eval gate means a corrupted checkpoint becomes the recovery target — detected only when resume fails hours later

Quick check

Trade-off

In 3D parallelism, why is tensor parallelism (TP) placed intra-node while pipeline parallelism (PP) crosses nodes?

In 3D parallelism, why is tensor parallelism (TP) placed intra-node while pipeline parallelism (PP) crosses nodes?
🔬

Deep Dives — the two load-bearing components

Expand the deep dives

Open for the full technical detail.

Expand

Two components dominate the failure cost here: checkpointing and ring-health detection.

Deep dive A — Async offloaded checkpointing + resume

  • Mechanism: when the orchestrator reaches a checkpoint step, it copies model shards, optimizer states, and the data-loader cursor to CPU RAM on each node. A background process streams those buffers to the object store while the GPU continues the next training step. The CPU copy is an in-memory snapshot, so the GPU never pauses.
  • Parity with training state: the checkpoint must include (1) all model shards in their sharded form, (2) Adam first and second moments, (3) the exact data loader position (dataset shard, sequence offset, RNG state) so resumed training sees no repeated or skipped tokens.
  • Trade-off: checkpoint frequency vs recovery cost. Checkpointing every 100 steps at 3 min overhead per save = ~5% overhead on a 6,000-step-per-hour cluster. Recovery from a failure at step 99 re-computes only 99 steps. Checkpointing every 500 steps = ~1% overhead but up to 499 steps of re-computation per failure.
  • Failure mode: a checkpoint that was written successfully but is actually corrupted (partial write, bit flip, truncated optimizer state). If this checkpoint is the recovery target, resume fails — and failure may not be detected until training diverges several thousand steps later. Mitigation: dual-write to two independent object store prefixes; run the eval harness on the checkpoint before promoting it to committed. Recovery always pulls from committed, never from pending.
  • Detection metric:checkpoint-overhead ratio = (total time spent in checkpoint saves) / (total training wall time). Target: <2%. If this rises above 5%, the async offload pipeline has stalled — typically because the object store write bandwidth is saturated. Alert at 3%, page at 7%.
  • Real-world example:PyTorch's distributed checkpoint library (pytorch.org/docs/stable/distributed.checkpoint) formalizes exactly this pattern — sharded state dicts saved asynchronously per rank, with a two-phase commit (write to staging, rename to final) that prevents partial checkpoints from poisoning the recovery target. The Llama 3 paper (Dubey et al., 2024) reports that async checkpointing allowed saves every few hundred steps on the 405B run without measurable impact on training throughput, and that checkpoint integrity was verified before each commit step to guard against the corrupted-file failure mode.

Deep dive B — Ring-health detection + preemptive node replacement

  • Mechanism: before each training step, every rank broadcasts a heartbeat through the NCCL ring. A rank that misses two consecutive heartbeats is flagged as slow/flapping. The health monitor signals the orchestrator, which pulls a replacement node from the warm-standby pool, re-initializes its model shard from the last committed checkpoint, and reinserts it into the ring — all before the training step is attempted.
  • Trade-off: false-positive replacements (a healthy node kicked due to transient network noise) waste a standby node and incur ~5 minutes of re-initialization overhead. Cascade failures from an undetected slow rank waste hours. The asymmetry strongly favors replacing on suspicion rather than waiting for confirmation.
  • Failure mode: correlated failures.A rack-level power event drops 64 GPUs simultaneously — they all miss the heartbeat at the same time. The ring-health monitor must distinguish “one flapping rank” (replace silently) from “64 ranks gone at once” (declare a cluster event, roll back to last committed checkpoint, bring up 64 replacements in parallel). Mitigation: correlated-failure threshold (if N > 8 ranks go silent simultaneously, escalate to cluster-event recovery). Rack-level spread ensures that no single rack holds more than 1/16 of the pipeline stages, so a rack failure loses at most 6% of the cluster rather than the entire pipeline.
  • Cross-rack checkpoint replication: the last committed checkpoint is replicated to at least two storage endpoints in different failure domains. This prevents the pathological case where the rack that failed also held the only copy of the checkpoint.
  • Detection metric:step-time histogram p99. On a healthy 16K-GPU cluster, p99 step time should be within 10% of median. A flapping rank inflates p99 by 5–20× before the NCCL timeout fires. Alert threshold: p99 > 1.5× median for three consecutive steps. Page threshold: p99 > 3× median or any step exceeding 10 minutes (indicating a hung collective).
  • Real-world example:Meta's engineering blog post “Building Meta's GenAI Infrastructure” describes how the Llama 3 training runs used automated fault-detection and node-replacement pipelines that handled thousands of individual hardware failures across the 54-day 405B training run. The system distinguished single-node failures (handled via warm-standby replacement within minutes) from correlated rack-level events (triggering cluster-level checkpointed rollback). The Dubey et al. (2024) paper additionally reports that in part because the ring-health infrastructure kept undetected-failure windows under 30 minutes — small enough that the re-computation cost per incident stayed well below 1% of total compute budget.
⚠ Warning · Both components are operationally invisible when working. The training researcher sees a smooth loss curve and doesn't know that the ring-health monitor replaced three nodes overnight and the async checkpoint offloader ran 840 saves without pausing training. This is the correct outcome — failure should be handled below the researcher's attention layer. The cost of getting it wrong is that the researcher notices, which means days of investigation and millions of dollars of wasted compute.
Quick Check

Your cluster is running at 88% goodput. A teammate proposes disabling the ring-health monitor to eliminate its ~0.5% overhead on step time. What is the correct response?

🔧

Break It

Three removals, each with a cascade through goodput, cost, and recovery time.

Remove async checkpointing — use synchronous saves

Synchronous checkpointing pauses training for 15–30 minutes per save on a 70B model. Teams respond by reducing save frequency to every 1,000+ steps to keep overhead below 3%. But now a failure at step 999 re-computes 999 steps — at that is on the order of 16,384 × × (999 steps × ~0.5 min/step) ≈ hundreds of thousands of dollars per incident. Goodput drops ~20% during hardware-churn periods when failures cluster. Detection: checkpoint-overhead metric exceeds 3% of total compute budget. Mitigation: async offload restores low overhead at high frequency — 3 min per save regardless of model size, because the GPU is never paused.

Remove the ring-health monitor

Without pre-step detection, the first signal of a flapping rank is an all-reduce timeout — a collective that hangs until the NCCL timeout fires (typically 30–60 min). At 16K GPUs a single timeout wastes 16,384 × 0.5 h × $3.50/h ≈ $28,700 of compute per incident. With a cluster MTBF of 6 hours, this happens 4× per day — effectively eliminating any goodput above 70%. The monitor adds ~0.5% to step time; the absence of it costs 18–20% of goodput. Detection: step-time histogram develops a fat tail of 30+ min outliers. Mitigation: re-enable the monitor; tune the heartbeat threshold to reduce false positives without removing the detector.

Skip data-pipeline parity monitoring

Without language-distribution and unique-n-gram monitoring, a tokenizer bug that doubles the weight of one language domain goes undetected. Training loss may actually improve (the model over-fits the over-represented domain), making this a silent failure. The signal only surfaces when downstream benchmarks show capability collapse in under-represented domains — by which time, 1–2 weeks of compute may be wasted. Detection:per-batch language distribution drift alert, computed with a cheap sliding histogram — zero cost on the GPU, runs on the streaming loader's CPU. Mitigation: halt the loader, patch the tokenizer, backfill the affected shard, resume. Total cost: 1–2 days of backfill compute vs. 1–2 weeks of wasted training.

Quick check

Trade-off

When synchronous checkpointing is used instead of async offload, teams respond by reducing checkpoint frequency. Why does this make recovery worse, not better?

When synchronous checkpointing is used instead of async offload, teams respond by reducing checkpoint frequency. Why does this make recovery worse, not better?
💸

What does a bad day cost?

Three incident modes, each with a different detection window and a very different cost curve. The key asymmetry: detection speed dominates cost more than the severity of the event.

  • Rack power event — 64 GPUs lost simultaneously. With ring-health detection and a warm standby pool: detected in ~2 min, replacement nodes online in ~10 min, checkpoint restored in ~20 min. Total re-computation cost: ~30 min × 16,320 remaining GPUs × ≈ $28,560. Without the monitor: NCCL timeout fires after 30–60 min, then manual investigation, then recovery. Total cost: 3–6h × 16,384 GPUs × ≈ $170K–$340K. Same event, 10× cost difference based purely on detection speed.
  • Tokenizer bug corrupts 1% of data.Detected by parity monitor within 500 steps: backfill 1% of one domain's shard, ~1–2 days of preprocessing compute. Not detected until evals show capability collapse: 1–2 weeks of training compute discarded plus backfill. Cost ratio: ~50× difference between early detection and late detection.
  • Cluster hang undetected for 3 hours — monitoring gap. A network partition isolates 512 GPUs. The ring-health monitor is disabled during a maintenance window and not re-enabled. All 16,384 GPUs hang waiting for the collective. No alarm fires because the monitoring gap also disabled the step-completion watchdog. 3 hours × 16,384 GPUs × ≈ $172,000 of pure wasted compute — training makes zero progress. No data is corrupted, no checkpoint is lost. Just compute burned. This is the most straightforward failure: it produces a flat loss curve with a 3-hour gap that is immediately obvious in retrospect.
⚠ Warning · Fast-detected failures are cheap; slow-detected failures are catastrophic. The rack event and the cluster hang cost roughly the same at 3 hours of undetected failure. But the rack event with good detection costs less than $30K. The asymmetry is not the failure mode — it is the detection window. Every architectural choice that shrinks detection latency (ring-health monitor, continuous loss alerting, parity monitoring) is actually a cost-reduction investment, not an operational overhead.
🚨

On-call Runbook

GPU fleet segment failure at step 42k

MTTR p50 / p99: 20–40 min with warm standby; 3–6h without

Blast radius: 64–512 GPUs lost; all-reduce hangs for remaining 15,872 ranks; training progress stops

  1. 1. DetectRing-health monitor fires within 2 min; step-completion watchdog triggers after 5 min if ring-health is degraded
  2. 2. EscalateOn-call infra engineer paged; warm standby pool activated; NCCL collective torn down and rebuilt
  3. 3. RollbackRestore from last async checkpoint (~10 min lag); replay 1–3 steps from data loader position log
  4. 4. PostRoot-cause faulty NIC or power rail; file hardware ticket; add correlated-failure signature to ring-health monitor

Checkpoint corruption on S3 multipart upload

MTTR p50 / p99: 30–90 min depending on checkpoint interval

Blast radius: Latest checkpoint unreadable; rollback falls to previous checkpoint, losing N steps of compute

  1. 1. DetectIntegrity check (md5/etag verification) on upload completion; detected immediately at write time
  2. 2. EscalateTraining job paused; checkpoint service alerts; previous valid checkpoint identified automatically
  3. 3. RollbackRestore from penultimate checkpoint; replay missing steps from data pipeline position
  4. 4. PostAdd write-verify step to checkpoint service; consider dual-write with cross-region replication for final checkpoints

Loss divergence after LR schedule miscalibration

MTTR p50 / p99: 2–8h depending on how far past divergence point the run continued

Blast radius: Training loss spikes and fails to recover; potentially hundreds of GPU-hours wasted before detection

  1. 1. DetectLoss-spike detector compares smoothed loss delta against rolling baseline; alerts within 50–200 steps
  2. 2. EscalateResearch-infra oncall + training scientist paged; run paused; LR schedule and optimizer state audited
  3. 3. RollbackRestore checkpoint from before divergence; correct LR schedule config; resume with conservative warmup
  4. 4. PostAdd LR schedule validation to pre-launch checklist; gate checkpoint promotion on loss monotonicity check for first 500 steps

Quick check

Derivation

A rack power event drops 64 GPUs from a 16K cluster. With ring-health detection the cost is ~$28K. Without it, the cost rises to $170K+. What mechanism causes the 6× gap?

A rack power event drops 64 GPUs from a 16K cluster. With ring-health detection the cost is ~$28K. Without it, the cost rises to $170K+. What mechanism causes the 6× gap?
🏢

Company Lens

Meta's push

Meta runs training at the scale this module describes — (per Dubey et al., arxiv 2407.21783, §3.3). Their interview bar emphasizes fault-tolerance primitives and infra scalability. Expect drill-downs on: “How does your ring-health monitor handle a correlated failure event?” “What is the exact recovery sequence from checkpoint to first accepted training step?” “How do you ensure the data loader position is consistent across 16K ranks after a partial failure?” Meta values engineers who can reason about the full stack: hardware failure rates, NCCL topology, checkpoint semantics, and data-pipeline throughput — not just the PyTorch-level abstractions.

OpenAI's push

OpenAI runs training clusters at comparable scale and has historically pushed on capacity planning and training-science collaboration. Expect questions like: “How do you balance researcher velocity against infra reliability constraints?” (i.e., when does a researcher's experimental run get pre-empted for a production training run?) “How does the eval fleet gate checkpoint promotion without slowing down the research cycle?” “If the research team wants to increase batch size mid-run, what does that change in your architecture?” The OpenAI bar rewards understanding the organizational interface between infra and research, not just the technical stack.

🧠

Key Takeaways

What to remember for interviews

  1. 1At 16K GPUs, failure is the normal operating mode. Design the entire stack around minimizing undetected failure windows, not around preventing failures.
  2. 2Goodput (accepted tokens / total GPU-hours) is the north-star metric. GPU utilization, step time, and throughput are leading indicators; goodput is the bottom line.
  3. 3Async checkpointing is load-bearing, not optional. Synchronous saves force teams to checkpoint rarely, turning every failure into a multi-hour recovery event.
  4. 4The ring-health monitor saves orders of magnitude more compute than it costs. 0.5% step overhead vs. 30-min NCCL timeouts at 16K GPU scale is not a real trade-off.
  5. 5Data-pipeline parity monitoring is the only guard against silent distribution drift. Loss curves look fine when one domain is over-represented — only n-gram rate and language distribution catch it early.
  6. 6Fast detection is worth more than failure prevention. The same rack power event costs $30K with 2-minute detection and $300K with 3-hour detection. Every architectural choice should be evaluated on its contribution to detection latency, not just on its nominal function.
🎯

Interview Questions

Difficulty:
Company:

Showing 5 of 5

You're running a 16K GPU training job. A rack-level power event drops 64 GPUs simultaneously. Walk through exactly what happens and how you recover.

★★★
MetaOpenAI

Your team is debating checkpoint frequency: every 100 steps vs every 500 steps on a 16K H100 cluster. How do you decide?

★★☆
MetaGoogle

Your loss curve shows a sharp spike at step 48,000, then returns to trend. The checkpoint at step 47,900 looks clean. What do you investigate and in what order?

★★★
AnthropicOpenAI

An interviewer asks: 'Why do we use 3D parallelism instead of just scaling data parallelism?' Give the constraints-first answer.

★★☆
GoogleMeta

A research engineer says: 'Our goodput is 72%. That means we're wasting 28% of our compute.' The infra lead replies: 'No, that's fine — we target 85%.' Who is right, and what should the real target be?

★★☆
MetaAnthropic
🧠

Recap quiz

🧠

Training Infrastructure recap

Trade-off

A GPU is running at 100% utilization but the data pipeline stalled, so all steps are identical micro-batch replays. What is the goodput for those steps?

A GPU is running at 100% utilization but the data pipeline stalled, so all steps are identical micro-batch replays. What is the goodput for those steps?
Derivation

At 16K H100s ($3.50/h spot) with async checkpointing (3 min per save), checkpointing every 100 steps vs every 500 steps changes the expected recovery cost how?

At 16K H100s ($3.50/h spot) with async checkpointing (3 min per save), checkpointing every 100 steps vs every 500 steps changes the expected recovery cost how?
Derivation

The ring-health monitor adds ~0.5% to step time. A flapping rank without it causes a 30-min NCCL timeout at 16K GPUs. Which argument correctly captures the cost asymmetry?

The ring-health monitor adds ~0.5% to step time. A flapping rank without it causes a 30-min NCCL timeout at 16K GPUs. Which argument correctly captures the cost asymmetry?
Derivation

Why can&apos;t a 70B model be trained with data parallelism alone on 16K H100s, even though each H100 has 80 GB VRAM?

Why can&apos;t a 70B model be trained with data parallelism alone on 16K H100s, even though each H100 has 80 GB VRAM?
Trade-off

The eval harness gates checkpoint promotion. A checkpoint writes successfully to the object store but fails the held-out eval. What state should the system assign it, and why?

The eval harness gates checkpoint promotion. A checkpoint writes successfully to the object store but fails the held-out eval. What state should the system assign it, and why?
Derivation

A rack power event drops 64 GPUs. With 2-min ring-health detection the cost is ~$28K. Without monitoring the NCCL timeout (30–60 min) costs ~$28K–$57K in idle compute, plus manual investigation time. What single variable drives this difference?

A rack power event drops 64 GPUs. With 2-min ring-health detection the cost is ~$28K. Without monitoring the NCCL timeout (30–60 min) costs ~$28K–$57K in idle compute, plus manual investigation time. What single variable drives this difference?
Trade-off

A tokenizer bug doubles the weight of one language domain. Training loss actually improves. What is the earliest signal that catches this before weeks of compute are wasted?

A tokenizer bug doubles the weight of one language domain. Training loss actually improves. What is the earliest signal that catches this before weeks of compute are wasted?
📚

Further Reading