Scaling Laws — Transformer Math

Module 14 · Training

📈 Scaling Laws

Why Llama-2 beats GPT-3 with half the parameters

Status:

How big should the model be? How much data? Scaling laws give precise, predictable answers. Chinchilla showed that model size and data should scale equally — and that most models before 2022 were severely undertrained.

🎮

Scaling Laws Visualized

What you’re seeing: loss curves plotted against model size, dataset size, and compute budget — each follows a smooth power law, and the Chinchilla frontier shows where model size and tokens should be balanced for a given FLOP budget. What to try: use the compute budget planner below to see how Chinchilla-optimal allocation shifts the parameter-to-token ratio as you scale up.

Set a compute budget and see the Chinchilla-optimal model size, training tokens, and estimated cost. The planner uses the formula N = √(C/120) and D = 20N.

Compute Budget: 10^21.0 FLOPs

10^1810^25

Metric	Value
Compute (C)	10^21.0 FLOPs
Optimal Model Size (N)	2.9B params
Optimal Tokens (D)	57.7B tokens
Verify: 6ND	10^21.0 FLOPs
A100-hours (est.)	2.2K hours
Training Cost (@$2/A100-hr)	$4.4K

💡

The Intuition

Kaplan et al. (2020) at OpenAI discovered that LLM loss scales as a smooth power law with model size, dataset size, and compute. This was revolutionary: you could predict the performance of a 100B model from experiments on 1M-parameter models.

Chinchilla (2022) from DeepMind corrected a critical mistake: Kaplan suggested scaling model size faster than data. In reality, parameters and tokens should for a fixed compute budget. This explained why GPT-3 (175B params, 300B tokens) underperformed relative to its compute cost.

The practical impact was enormous: Llama-2 70B trained on can outperform GPT-3 175B on many downstream evaluations while using less than half the parameters. It used more data per parameter, trading model size for cheaper inference.

✨ Insight · Scaling laws turned LLM training from alchemy into engineering. You can now predict the final loss — and therefore the compute budget — before training a single step. This is why labs run small-scale experiments first and extrapolate.

Quick check

Derivation

GPT-3 used 175B params and 300B tokens. Chinchilla says ~20 tokens/param. How many tokens was GPT-3 short by?

~700B tokens short (4× undershoot)~3.2T tokens short (~11× undershoot)~175B tokens short (2× undershoot)~50T tokens short (100× undershoot)

Emergent Abilities — Real or Artifact? Wei et al. (2022) observed that certain capabilities — multi-step arithmetic, chain-of-thought reasoning, word-in-context understanding — appeared to emerge discontinuously: near-zero performance below a threshold compute level, then sharp gains above it. This was widely interpreted as a qualitative phase transition in model capability. Schaeffer et al. (2023) challenged this view, showing that the apparent discontinuity is largely a measurement artifact. When you switch from a nonlinear metric (exact-match accuracy, which gives 0 for nearly-correct answers) to a linear metric (token-level log probability), the same models show smooth, continuous improvement. The practical implication: emergent abilities are real in the sense that useful task performance crosses a human-meaningful threshold at a certain scale, but the underlying capability growth is smooth and predictable — which means you can extrapolate it from smaller runs using the right metrics.

Inference-time Scaling Laws:OpenAI's research on o1 and o3 shows test-time compute follows its own scaling law — increasing inference compute via sampling, verification, and search can materially improve reasoning performance. The implication is that for reasoning-heavy problems, it can be cheaper to think harder at inference than to train a much bigger model. This opens a new axis of scaling distinct from the Chinchilla compute-optimal frontier.

Post-Chinchilla Overtrained Models:Both Llama-3 8B and 70B were trained on up to 15T tokens (per Meta's April 2024 release). Why deliberately over-train? Because Chinchilla optimizes for training compute efficiency, not deployment cost. Smaller models are cheaper to serve per query, so training a compact model for far longer produces better quality-per-FLOP at inference time. The insight: the Chinchilla optimal point is right only if training cost is your bottleneck. When serving millions of queries, inference cost dominates — making over-trained small models the economically rational choice.

Quick Check

According to Chinchilla, if you double your compute budget, how should you split it?

📐

Step-by-Step Derivation

Kaplan Scaling Law (2020)

Loss scales as a power law with model size (non-embedding parameters):

Chinchilla Optimal Scaling (2022)

For a fixed compute budget , optimal model size and data scale equally:

💡 Tip · Rule of thumb: train on ~20 tokens per parameter. A 7B model needs ~140B tokens. A 70B model needs ~1.4T tokens. Modern practice pushes beyond this for inference efficiency.

Python: Compute-Optimal Model Size Calculator

python

# Chinchilla scaling law: C = 6ND, D_opt = 20 * N_opt
# Substituting: C = 6 * N * 20N = 120N² → N_opt = sqrt(C / 120)
def chinchilla_optimal(flops_budget):
    """Given a FLOP budget, compute optimal model size and data."""
    N = (flops_budget / 120) ** 0.5  # optimal params
    D = 20 * N                        # optimal tokens (20 tokens per param)
    return {"params": N, "tokens": D, "flops": flops_budget}

# Example: 1e24 FLOPs (roughly Chinchilla's budget)
result = chinchilla_optimal(1e24)
# → ~91B params, ~1.8T tokens

# GPT-3 budget: 3.15e23 FLOPs
gpt3 = chinchilla_optimal(3.15e23)
# → ~51B params, ~1.0T tokens
# GPT-3 used 175B params + 300B tokens — model was ~3.4× too large for the data budget

✨ Insight · The Chinchilla insight changed the industry: GPT-3 was 3× too large for its data budget. For the same compute, a 70B model trained on 1.4T tokens beats a 175B model trained on 300B tokens.

Training Compute Approximation

Total FLOPs for training a Transformer with parameters on tokens ():

Quick check

Derivation

In C ≈ 6ND, the factor of 6 comes from forward + backward pass FLOPs per parameter. What is the correct breakdown?

2 forward, 2 backward (equal splits)1 forward, 5 backward3 forward, 3 backward (symmetric)2 forward, 4 backward

Learning Rate Schedule: Warmup + Cosine Decay

Linear warmup for steps, then cosine decay to :

PyTorch: Cosine LR Scheduler with Warmup

python

import math

def get_lr(step, warmup_steps, total_steps, max_lr, min_lr):
    """Linear warmup + cosine decay schedule."""
    if step < warmup_steps:
        # Linear warmup
        return max_lr * step / warmup_steps
    # Cosine decay
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))

# Example: GPT-3 style schedule (units are steps, not tokens)
# max_lr=6e-5, min_lr=6e-6
# warmup_steps=375, total_steps=300_000
# (assumes batch_size=1M tokens: 375M-token warmup / 1M = 375 steps)

PyTorch implementation

# Chinchilla scaling loss: L(N, D) = E + A/N^alpha + B/D^beta
# Hoffmann et al. 2022 fitted constants
E = 1.69   # irreducible entropy loss
A = 406.4
B = 410.7
alpha = 0.34
beta = 0.28

def chinchilla_loss(N, D):
    """Predict cross-entropy loss given model params N and data tokens D."""
    return E + A / (N ** alpha) + B / (D ** beta)

def optimal_allocation(C):
    """Chinchilla-optimal N and D for a fixed FLOP budget C = 6ND."""
    N = (A * alpha / (B * beta)) ** (1 / (alpha + beta)) * (C / 6) ** (beta / (alpha + beta))
    D = C / (6 * N)
    return N, D

N_opt, D_opt = optimal_allocation(3.15e23)  # GPT-3 compute budget
print(f"Optimal: {N_opt/1e9:.1f}B params, {D_opt/1e12:.1f}T tokens")
print(f"Predicted loss: {chinchilla_loss(N_opt, D_opt):.3f}")

Quantization: Precision vs. Memory

FP32 (4 bytes): training master weights, full precision. 70B model = 280GB.

FP16/BF16 (2 bytes): standard training and inference. 70B = 140GB. BF16 preferred for its wider exponent range.

INT8 (1 byte): post-training quantization, <1% quality loss. 70B = 70GB. LLM.int8() handles outlier features in FP16.

INT4 (0.5 bytes): aggressive quantization via GPTQ/AWQ. . Fits on a single 48GB GPU. .

Serving: Continuous Batching & PagedAttention

Continuous batching (Orca): insert new requests as sequences complete, instead of waiting for the whole batch. GPU utilization jumps from ~30% to 90%+.

PagedAttention (vLLM): manages KV cache like virtual memory pages. Allocates small blocks on demand instead of pre-allocating max length. .

Model routing: use a small model for easy queries and route hard queries to the large model. Reduces average cost by 50-80% with minimal quality loss.

🔧

Break It — See What Happens

Train too few tokens (GPT-3 style)

Train far beyond Chinchilla-optimal

Undertrain (0.1× Chinchilla Tokens) — The Pre-Chinchilla Default

📊

Real-World Numbers

Publicly reported or commonly cited figures, with estimates labeled explicitly. These are useful for intuition, not as exact accounting records.

Model	Params	Tokens	Tokens/Param	Est. Cost
GPT-3	175B		1.7x	~$4.6M
Chinchilla	70B		20x	Same as Gopher
Llama-2	70B		29x	~$2M
Llama-3	70B		214x	~$10M+
GPT-4 *	~1.8T (est.)	~13T (est.)	~7x	~$100M

* GPT-4 architecture details not officially confirmed by OpenAI — figures are community estimates from leaks and benchmarking.

✨ Insight · Notice the trend: tokens/param ratio keeps increasing. Labs are intentionally over-training relative to Chinchilla because inference cost dominates. A smaller, longer-trained model is cheaper to serve at scale.

Quick check

Trade-off

Llama-3 8B trained on 15T tokens. Chinchilla-optimal for 8B is ~160B tokens. What does the 94× over-training ratio reveal about the real optimization target?

The real target is inference-cost efficiency, not training-FLOPs efficiency.15T tokens is the Chinchilla-optimal for 750B params, indicating Llama-3 8B is an MoE of 750B total params.The Chinchilla laws are wrong for models under 10B params, so a different curve applies.Pretraining at 15T tokens is required for instruction-following fine-tuning to converge.

⏱️

Test-Time Compute as the Third Axis (2024–2025)

Kaplan and Chinchilla defined two axes: model parameters (N) and training tokens (D). A third axis emerged in 2024 — compute spent at inference time. The insight: a smaller, cheaper model given more thinking steps can match or beat a larger model on reasoning tasks.

o3 vs GPT-4o: same training, 172× inference compute

OpenAI's o3 uses the same base training as GPT-4o but applies extended chain-of-thought reasoning at inference. On ARC-AGI (a test explicitly designed to resist LLM shortcuts):

: 5% accuracy
: 87.5% accuracy — 172× more inference compute

Source: ARC Prize blog, Dec 2024

Deep dive: Snell et al. 2024 — scaling inference compute

Snell et al. (2024, arxiv:2408.03314) systematically studied how inference-time compute scales for language models. Key finding: a model 14× smaller than a frontier model, given extended thinking budget (beam search, self-refinement, verifier-guided search), matches the larger model on medium-difficulty math tasks.

The gain is task-difficulty-dependent. Easy tasks are already solved in one pass; hard tasks require domain knowledge no amount of test-time compute can substitute for. The sweet spot is medium-difficulty: problems the model “almost knows” — where search and self-correction surface the right answer.

Practical implication: serving cost is not fixed at training time. A 7B model with a verifier loop can replace a 70B model for the right task distribution, at 10× lower per-token cost.

Source: Snell et al. 2024, arxiv:2408.03314

Deep dive: Sparse MoE and Chinchilla-optimality

The Chinchilla rule applies to activeparameters, not total. DeepSeek-V3 has 671B total parameters but only 37B active per token. The Chinchilla-optimal question is: “How many tokens should we train this 37B-active-param model on?” — not “how many for a 671B dense model?”

At 37B active params, Chinchilla-optimal training is roughly 37B × 20 ≈ 740B tokens. DeepSeek-V3 was trained on — ~20× Chinchilla-optimal for the active parameter count. This is deliberate over-training (same pattern as Llama-3) because inference cost dominates at scale.

The 671B total parameters store specialised knowledge across 256 fine-grained experts. The scaling law governs compute efficiency; the large expert pool governs knowledge capacity. These are independent dimensions.

Updated frontier models (2024–2025)

Release	Params (active)	Tokens	Significance
Dec 2024	37B / 671B total	14.8T	MoE frontier; Chinchilla-optimal at 37B active
Feb 2025	~(est.) dense	undisclosed	Last major dense pre-training scale-up before o3 reasoning paradigm (community estimate)
Apr 2025	undisclosed	—	87.5% ARC-AGI via inference-time compute scaling

✨ Insight · The paradigm shift: pre-2024 scaling meant “train a bigger model.” Post-2024, labs optimise across three axes simultaneously — training compute (N×D), inference compute (reasoning steps), and MoE capacity (total vs active params). A 37B-active MoE with extended inference can outperform a dense 70B model trained to Chinchilla-optimal on many benchmarks.

🧠

Key Takeaways

What to remember for interviews

1Kaplan et al. (2020) showed LLM loss follows a smooth power law with compute, data, and parameters — enabling you to predict a 100B model's performance from 1M-parameter experiments.
2Chinchilla (2022) corrected Kaplan: for a fixed compute budget, model size and token count should scale equally (both ∝ C^0.5). Rule of thumb: ~20 tokens per parameter.
3Training compute is approximated as C ≈ 6ND (6 FLOPs per parameter per token: 2 forward + 4 backward). GPT-3: 6 × 175B × 300B ≈ 3.15×10²³ FLOPs.
4Post-Chinchilla, labs deliberately over-train small models (Llama-3 8B: 15T tokens, ~94× Chinchilla-optimal for 8B) because inference cost is paid per query while training is a one-time expense.
5Test-time compute is a third scaling axis: o3 uses 172× more inference compute than GPT-4o, lifting ARC-AGI from 5% to 87.5% — no additional training required.
6For sparse MoE, Chinchilla-optimality is computed over active parameters, not total. DeepSeek-V3's 14.8T training tokens are ~20× Chinchilla-optimal for its 37B active parameter count.

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 6 of 6

Explain the Chinchilla scaling law. How does it differ from the original Kaplan scaling law?

★★☆

GoogleMeta

Why does Llama-2 70B outperform GPT-3 175B despite being smaller?

★★☆

MetaOpenAI

Derive the approximate training compute formula C = 6ND. Where does the 6 come from?

★★★

GoogleOpenAI

What is quantization and how do FP16, INT8, and INT4 compare for inference?

★★☆

MetaDatabricks

Explain continuous batching and PagedAttention. Why are they critical for LLM serving?

★★★

GoogleDatabricks

What is the learning rate schedule for training large language models? Why warmup + cosine decay?

★★☆

OpenAIGoogle

←

🗃️ Data Curation

🔥 GPU & Mixed Precision

→

📈 Scaling Laws

Scaling Laws Visualized

The Intuition

Step-by-Step Derivation

Break It — See What Happens

Real-World Numbers

Test-Time Compute as the Third Axis (2024–2025)

Key Takeaways

Recap quiz

Further Reading

Interview Questions