Skip to content

Transformer Math

Module 14 · Training

📈 Scaling Laws

Why Llama-2 beats GPT-3 with half the parameters

Status:

How big should the model be? How much data? Scaling laws give precise, predictable answers. Chinchilla showed that model size and data should scale equally — and that most models before 2022 were severely undertrained.

🎮

Scaling Laws Visualized

What you’re seeing: loss curves plotted against model size, dataset size, and compute budget — each follows a smooth power law, and the Chinchilla frontier shows where model size and tokens should be balanced for a given FLOP budget. What to try: use the compute budget planner below to see how Chinchilla-optimal allocation shifts the parameter-to-token ratio as you scale up.

Chinchilla Scaling: Parameters vs Training TokensFor a given compute budget, scale model size and data equallyTraining Tokens (log scale)Parameters (log scale)300B1T1.4T2T5T15T7B13B70B175B540BChinchilla OptimalOver-parameterized(too few tokens)Under-parameterized(too few params)GPT-3175B / 300B tokChinchilla70B / 1.4T tokLlama-270B / 2T tokLlama-370B / 15T tokMore data, fewer paramsGPT-3 (over-parameterized)Chinchilla (optimal)Llama-2 (compute-optimal)Llama-3 (over-trained)

Set a compute budget and see the Chinchilla-optimal model size, training tokens, and estimated cost. The planner uses the formula N = √(C/120) and D = 20N.

10^1810^25
MetricValue
Compute (C)10^21.0 FLOPs
Optimal Model Size (N)2.9B params
Optimal Tokens (D)57.7B tokens
Verify: 6ND10^21.0 FLOPs
A100-hours (est.)2.2K hours
Training Cost (@$2/A100-hr)$4.4K
💡

The Intuition

Kaplan et al. (2020) at OpenAI discovered that LLM loss scales as a smooth power law with model size, dataset size, and compute. This was revolutionary: you could predict the performance of a 100B model from experiments on 1M-parameter models.

Chinchilla (2022) from DeepMind corrected a critical mistake: Kaplan suggested scaling model size faster than data. In reality, parameters and tokens should for a fixed compute budget. This explained why GPT-3 (175B params, 300B tokens) underperformed relative to its compute cost.

The practical impact was enormous: Llama-2 70B trained on can outperform GPT-3 175B on many downstream evaluations while using less than half the parameters. It used more data per parameter, trading model size for cheaper inference.

✨ Insight · Scaling laws turned LLM training from alchemy into engineering. You can now predict the final loss — and therefore the compute budget — before training a single step. This is why labs run small-scale experiments first and extrapolate.

Quick check

Derivation

GPT-3 used 175B params and 300B tokens. Chinchilla says ~20 tokens/param. How many tokens was GPT-3 short by?

GPT-3 used 175B params and 300B tokens. Chinchilla says ~20 tokens/param. How many tokens was GPT-3 short by?

Emergent Abilities — Real or Artifact? Wei et al. (2022) observed that certain capabilities — multi-step arithmetic, chain-of-thought reasoning, word-in-context understanding — appeared to emerge discontinuously: near-zero performance below a threshold compute level, then sharp gains above it. This was widely interpreted as a qualitative phase transition in model capability. Schaeffer et al. (2023) challenged this view, showing that the apparent discontinuity is largely a measurement artifact. When you switch from a nonlinear metric (exact-match accuracy, which gives 0 for nearly-correct answers) to a linear metric (token-level log probability), the same models show smooth, continuous improvement. The practical implication: emergent abilities are real in the sense that useful task performance crosses a human-meaningful threshold at a certain scale, but the underlying capability growth is smooth and predictable — which means you can extrapolate it from smaller runs using the right metrics.

Inference-time Scaling Laws:OpenAI's research on o1 and o3 shows test-time compute follows its own scaling law — increasing inference compute via sampling, verification, and search can materially improve reasoning performance. The implication is that for reasoning-heavy problems, it can be cheaper to think harder at inference than to train a much bigger model. This opens a new axis of scaling distinct from the Chinchilla compute-optimal frontier.

Post-Chinchilla Overtrained Models:Both Llama-3 8B and 70B were trained on up to 15T tokens (per Meta's April 2024 release). Why deliberately over-train? Because Chinchilla optimizes for training compute efficiency, not deployment cost. Smaller models are cheaper to serve per query, so training a compact model for far longer produces better quality-per-FLOP at inference time. The insight: the Chinchilla optimal point is right only if training cost is your bottleneck. When serving millions of queries, inference cost dominates — making over-trained small models the economically rational choice.

Quick Check

According to Chinchilla, if you double your compute budget, how should you split it?

📐

Step-by-Step Derivation

Kaplan Scaling Law (2020)

Loss scales as a power law with model size (non-embedding parameters):

Chinchilla Optimal Scaling (2022)

For a fixed compute budget , optimal model size and data scale equally:

💡 Tip · Rule of thumb: train on ~20 tokens per parameter. A 7B model needs ~140B tokens. A 70B model needs ~1.4T tokens. Modern practice pushes beyond this for inference efficiency.

Python: Compute-Optimal Model Size Calculator

python
# Chinchilla scaling law: C = 6ND, D_opt = 20 * N_opt
# Substituting: C = 6 * N * 20N = 120N² → N_opt = sqrt(C / 120)
def chinchilla_optimal(flops_budget):
    """Given a FLOP budget, compute optimal model size and data."""
    N = (flops_budget / 120) ** 0.5  # optimal params
    D = 20 * N                        # optimal tokens (20 tokens per param)
    return {"params": N, "tokens": D, "flops": flops_budget}

# Example: 1e24 FLOPs (roughly Chinchilla's budget)
result = chinchilla_optimal(1e24)
# → ~91B params, ~1.8T tokens

# GPT-3 budget: 3.15e23 FLOPs
gpt3 = chinchilla_optimal(3.15e23)
# → ~51B params, ~1.0T tokens
# GPT-3 used 175B params + 300B tokens — model was ~3.4× too large for the data budget
✨ Insight · The Chinchilla insight changed the industry: GPT-3 was 3× too large for its data budget. For the same compute, a 70B model trained on 1.4T tokens beats a 175B model trained on 300B tokens.

Training Compute Approximation

Total FLOPs for training a Transformer with parameters on tokens ():

Quick check

Derivation

In C ≈ 6ND, the factor of 6 comes from forward + backward pass FLOPs per parameter. What is the correct breakdown?

In C ≈ 6ND, the factor of 6 comes from forward + backward pass FLOPs per parameter. What is the correct breakdown?

Learning Rate Schedule: Warmup + Cosine Decay

Linear warmup for steps, then cosine decay to :

PyTorch: Cosine LR Scheduler with Warmup

python
import math

def get_lr(step, warmup_steps, total_steps, max_lr, min_lr):
    """Linear warmup + cosine decay schedule."""
    if step < warmup_steps:
        # Linear warmup
        return max_lr * step / warmup_steps
    # Cosine decay
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))

# Example: GPT-3 style schedule (units are steps, not tokens)
# max_lr=6e-5, min_lr=6e-6
# warmup_steps=375, total_steps=300_000
# (assumes batch_size=1M tokens: 375M-token warmup / 1M = 375 steps)
PyTorch implementation
# Chinchilla scaling loss: L(N, D) = E + A/N^alpha + B/D^beta
# Hoffmann et al. 2022 fitted constants
E = 1.69   # irreducible entropy loss
A = 406.4
B = 410.7
alpha = 0.34
beta = 0.28

def chinchilla_loss(N, D):
    """Predict cross-entropy loss given model params N and data tokens D."""
    return E + A / (N ** alpha) + B / (D ** beta)

def optimal_allocation(C):
    """Chinchilla-optimal N and D for a fixed FLOP budget C = 6ND."""
    N = (A * alpha / (B * beta)) ** (1 / (alpha + beta)) * (C / 6) ** (beta / (alpha + beta))
    D = C / (6 * N)
    return N, D

N_opt, D_opt = optimal_allocation(3.15e23)  # GPT-3 compute budget
print(f"Optimal: {N_opt/1e9:.1f}B params, {D_opt/1e12:.1f}T tokens")
print(f"Predicted loss: {chinchilla_loss(N_opt, D_opt):.3f}")

Quantization: Precision vs. Memory

FP32 (4 bytes): training master weights, full precision. 70B model = 280GB.

FP16/BF16 (2 bytes): standard training and inference. 70B = 140GB. BF16 preferred for its wider exponent range.

INT8 (1 byte): post-training quantization, <1% quality loss. 70B = 70GB. LLM.int8() handles outlier features in FP16.

INT4 (0.5 bytes): aggressive quantization via GPTQ/AWQ. . Fits on a single 48GB GPU. .

Serving: Continuous Batching & PagedAttention

Continuous batching (Orca): insert new requests as sequences complete, instead of waiting for the whole batch. GPU utilization jumps from ~30% to 90%+.

PagedAttention (vLLM): manages KV cache like virtual memory pages. Allocates small blocks on demand instead of pre-allocating max length. .

Model routing: use a small model for easy queries and route hard queries to the large model. Reduces average cost by 50-80% with minimal quality loss.

🔧

Break It — See What Happens

Train too few tokens (GPT-3 style)
Train far beyond Chinchilla-optimal
Undertrain (0.1× Chinchilla Tokens) — The Pre-Chinchilla Default
📊

Real-World Numbers

Publicly reported or commonly cited figures, with estimates labeled explicitly. These are useful for intuition, not as exact accounting records.

ModelParamsTokensTokens/ParamEst. Cost
GPT-3175B1.7x~$4.6M
Chinchilla70B20xSame as Gopher
Llama-270B29x~$2M
Llama-370B214x~$10M+
GPT-4 *~1.8T (est.)~13T (est.)~7x~$100M

* GPT-4 architecture details not officially confirmed by OpenAI — figures are community estimates from leaks and benchmarking.

✨ Insight · Notice the trend: tokens/param ratio keeps increasing. Labs are intentionally over-training relative to Chinchilla because inference cost dominates. A smaller, longer-trained model is cheaper to serve at scale.

Quick check

Trade-off

Llama-3 8B trained on 15T tokens. Chinchilla-optimal for 8B is ~160B tokens. What does the 94× over-training ratio reveal about the real optimization target?

Llama-3 8B trained on 15T tokens. Chinchilla-optimal for 8B is ~160B tokens. What does the 94× over-training ratio reveal about the real optimization target?
⏱️

Test-Time Compute as the Third Axis (2024–2025)

Kaplan and Chinchilla defined two axes: model parameters (N) and training tokens (D). A third axis emerged in 2024 — compute spent at inference time. The insight: a smaller, cheaper model given more thinking steps can match or beat a larger model on reasoning tasks.

o3 vs GPT-4o: same training, 172× inference compute

OpenAI's o3 uses the same base training as GPT-4o but applies extended chain-of-thought reasoning at inference. On ARC-AGI (a test explicitly designed to resist LLM shortcuts):

  • : 5% accuracy
  • : 87.5% accuracy — 172× more inference compute

Source: ARC Prize blog, Dec 2024

Deep dive: Snell et al. 2024 — scaling inference compute

Snell et al. (2024, arxiv:2408.03314) systematically studied how inference-time compute scales for language models. Key finding: a model 14× smaller than a frontier model, given extended thinking budget (beam search, self-refinement, verifier-guided search), matches the larger model on medium-difficulty math tasks.

The gain is task-difficulty-dependent. Easy tasks are already solved in one pass; hard tasks require domain knowledge no amount of test-time compute can substitute for. The sweet spot is medium-difficulty: problems the model “almost knows” — where search and self-correction surface the right answer.

Practical implication: serving cost is not fixed at training time. A 7B model with a verifier loop can replace a 70B model for the right task distribution, at 10× lower per-token cost.

Source: Snell et al. 2024, arxiv:2408.03314

Deep dive: Sparse MoE and Chinchilla-optimality

The Chinchilla rule applies to activeparameters, not total. DeepSeek-V3 has 671B total parameters but only 37B active per token. The Chinchilla-optimal question is: “How many tokens should we train this 37B-active-param model on?” — not “how many for a 671B dense model?”

At 37B active params, Chinchilla-optimal training is roughly 37B × 20 ≈ 740B tokens. DeepSeek-V3 was trained on — ~20× Chinchilla-optimal for the active parameter count. This is deliberate over-training (same pattern as Llama-3) because inference cost dominates at scale.

The 671B total parameters store specialised knowledge across 256 fine-grained experts. The scaling law governs compute efficiency; the large expert pool governs knowledge capacity. These are independent dimensions.

Updated frontier models (2024–2025)

ModelReleaseParams (active)TokensSignificance
Dec 202437B / 671B total14.8TMoE frontier; Chinchilla-optimal at 37B active
Feb 2025~(est.) denseundisclosedLast major dense pre-training scale-up before o3 reasoning paradigm (community estimate)
Apr 2025undisclosed87.5% ARC-AGI via inference-time compute scaling
✨ Insight · The paradigm shift: pre-2024 scaling meant “train a bigger model.” Post-2024, labs optimise across three axes simultaneously — training compute (N×D), inference compute (reasoning steps), and MoE capacity (total vs active params). A 37B-active MoE with extended inference can outperform a dense 70B model trained to Chinchilla-optimal on many benchmarks.
🧠

Key Takeaways

What to remember for interviews

  1. 1Kaplan et al. (2020) showed LLM loss follows a smooth power law with compute, data, and parameters — enabling you to predict a 100B model's performance from 1M-parameter experiments.
  2. 2Chinchilla (2022) corrected Kaplan: for a fixed compute budget, model size and token count should scale equally (both ∝ C^0.5). Rule of thumb: ~20 tokens per parameter.
  3. 3Training compute is approximated as C ≈ 6ND (6 FLOPs per parameter per token: 2 forward + 4 backward). GPT-3: 6 × 175B × 300B ≈ 3.15×10²³ FLOPs.
  4. 4Post-Chinchilla, labs deliberately over-train small models (Llama-3 8B: 15T tokens, ~94× Chinchilla-optimal for 8B) because inference cost is paid per query while training is a one-time expense.
  5. 5Test-time compute is a third scaling axis: o3 uses 172× more inference compute than GPT-4o, lifting ARC-AGI from 5% to 87.5% — no additional training required.
  6. 6For sparse MoE, Chinchilla-optimality is computed over active parameters, not total. DeepSeek-V3's 14.8T training tokens are ~20× Chinchilla-optimal for its 37B active parameter count.
🧠

Recap quiz

Trade-off

Kaplan (2020) and Chinchilla (2022) both studied LLM scaling, but disagreed on one key ratio. What was the core disagreement?

Kaplan (2020) and Chinchilla (2022) both studied LLM scaling, but disagreed on one key ratio. What was the core disagreement?
Derivation

GPT-3 has 175B non-embedding parameters and was trained on 300B tokens. What is its approximate training FLOP count using C ≈ 6ND?

GPT-3 has 175B non-embedding parameters and was trained on 300B tokens. What is its approximate training FLOP count using C ≈ 6ND?
Trade-off

Llama-3 8B was trained on 15T tokens — roughly 94× the Chinchilla-optimal amount for 8B params. Why would Meta deliberately over-train beyond the compute-optimal point?

Llama-3 8B was trained on 15T tokens — roughly 94× the Chinchilla-optimal amount for 8B params. Why would Meta deliberately over-train beyond the compute-optimal point?
Derivation

Kaplan et al. found L(N) ∝ N^{−0.076}. What does the small exponent (0.076) imply about the effort needed to halve model loss?

Kaplan et al. found L(N) ∝ N^{−0.076}. What does the small exponent (0.076) imply about the effort needed to halve model loss?
Trade-off

OpenAI o1 demonstrated inference-time scaling: spending more compute at test time improves reasoning. How does this interact with the Chinchilla compute-optimal frontier?

OpenAI o1 demonstrated inference-time scaling: spending more compute at test time improves reasoning. How does this interact with the Chinchilla compute-optimal frontier?
Trade-off

A team wants to deploy a 70B model on a single 48GB GPU. Which quantization format is the minimum needed, and what quality tradeoff must they accept?

A team wants to deploy a 70B model on a single 48GB GPU. Which quantization format is the minimum needed, and what quality tradeoff must they accept?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 6 of 6

Explain the Chinchilla scaling law. How does it differ from the original Kaplan scaling law?

★★☆
GoogleMeta

Why does Llama-2 70B outperform GPT-3 175B despite being smaller?

★★☆
MetaOpenAI

Derive the approximate training compute formula C = 6ND. Where does the 6 come from?

★★★
GoogleOpenAI

What is quantization and how do FP16, INT8, and INT4 compare for inference?

★★☆
MetaDatabricks

Explain continuous batching and PagedAttention. Why are they critical for LLM serving?

★★★
GoogleDatabricks

What is the learning rate schedule for training large language models? Why warmup + cosine decay?

★★☆
OpenAIGoogle