📈 Scaling Laws
Why Llama-2 beats GPT-3 with half the parameters
How big should the model be? How much data? Scaling laws give precise, predictable answers. Chinchilla showed that model size and data should scale equally — and that most models before 2022 were severely undertrained.
Scaling Laws Visualized
What you’re seeing: loss curves plotted against model size, dataset size, and compute budget — each follows a smooth power law, and the Chinchilla frontier shows where model size and tokens should be balanced for a given FLOP budget. What to try: use the compute budget planner below to see how Chinchilla-optimal allocation shifts the parameter-to-token ratio as you scale up.
Set a compute budget and see the Chinchilla-optimal model size, training tokens, and estimated cost. The planner uses the formula N = √(C/120) and D = 20N.
| Metric | Value |
|---|---|
| Compute (C) | 10^21.0 FLOPs |
| Optimal Model Size (N) | 2.9B params |
| Optimal Tokens (D) | 57.7B tokens |
| Verify: 6ND | 10^21.0 FLOPs |
| A100-hours (est.) | 2.2K hours |
| Training Cost (@$2/A100-hr) | $4.4K |
The Intuition
Kaplan et al. (2020) at OpenAI discovered that LLM loss scales as a smooth power law with model size, dataset size, and compute. This was revolutionary: you could predict the performance of a 100B model from experiments on 1M-parameter models.
Chinchilla (2022) from DeepMind corrected a critical mistake: Kaplan suggested scaling model size faster than data. In reality, parameters and tokens should for a fixed compute budget. This explained why GPT-3 (175B params, 300B tokens) underperformed relative to its compute cost.
The practical impact was enormous: Llama-2 70B trained on can outperform GPT-3 175B on many downstream evaluations while using less than half the parameters. It used more data per parameter, trading model size for cheaper inference.
Quick check
GPT-3 used 175B params and 300B tokens. Chinchilla says ~20 tokens/param. How many tokens was GPT-3 short by?
Emergent Abilities — Real or Artifact? Wei et al. (2022) observed that certain capabilities — multi-step arithmetic, chain-of-thought reasoning, word-in-context understanding — appeared to emerge discontinuously: near-zero performance below a threshold compute level, then sharp gains above it. This was widely interpreted as a qualitative phase transition in model capability. Schaeffer et al. (2023) challenged this view, showing that the apparent discontinuity is largely a measurement artifact. When you switch from a nonlinear metric (exact-match accuracy, which gives 0 for nearly-correct answers) to a linear metric (token-level log probability), the same models show smooth, continuous improvement. The practical implication: emergent abilities are real in the sense that useful task performance crosses a human-meaningful threshold at a certain scale, but the underlying capability growth is smooth and predictable — which means you can extrapolate it from smaller runs using the right metrics.
Inference-time Scaling Laws:OpenAI's research on o1 and o3 shows test-time compute follows its own scaling law — increasing inference compute via sampling, verification, and search can materially improve reasoning performance. The implication is that for reasoning-heavy problems, it can be cheaper to think harder at inference than to train a much bigger model. This opens a new axis of scaling distinct from the Chinchilla compute-optimal frontier.
Post-Chinchilla Overtrained Models:Both Llama-3 8B and 70B were trained on up to 15T tokens (per Meta's April 2024 release). Why deliberately over-train? Because Chinchilla optimizes for training compute efficiency, not deployment cost. Smaller models are cheaper to serve per query, so training a compact model for far longer produces better quality-per-FLOP at inference time. The insight: the Chinchilla optimal point is right only if training cost is your bottleneck. When serving millions of queries, inference cost dominates — making over-trained small models the economically rational choice.
According to Chinchilla, if you double your compute budget, how should you split it?
Step-by-Step Derivation
Kaplan Scaling Law (2020)
Loss scales as a power law with model size (non-embedding parameters):
Chinchilla Optimal Scaling (2022)
For a fixed compute budget , optimal model size and data scale equally:
Python: Compute-Optimal Model Size Calculator
# Chinchilla scaling law: C = 6ND, D_opt = 20 * N_opt
# Substituting: C = 6 * N * 20N = 120N² → N_opt = sqrt(C / 120)
def chinchilla_optimal(flops_budget):
"""Given a FLOP budget, compute optimal model size and data."""
N = (flops_budget / 120) ** 0.5 # optimal params
D = 20 * N # optimal tokens (20 tokens per param)
return {"params": N, "tokens": D, "flops": flops_budget}
# Example: 1e24 FLOPs (roughly Chinchilla's budget)
result = chinchilla_optimal(1e24)
# → ~91B params, ~1.8T tokens
# GPT-3 budget: 3.15e23 FLOPs
gpt3 = chinchilla_optimal(3.15e23)
# → ~51B params, ~1.0T tokens
# GPT-3 used 175B params + 300B tokens — model was ~3.4× too large for the data budgetTraining Compute Approximation
Total FLOPs for training a Transformer with parameters on tokens ():
Quick check
In C ≈ 6ND, the factor of 6 comes from forward + backward pass FLOPs per parameter. What is the correct breakdown?
Learning Rate Schedule: Warmup + Cosine Decay
Linear warmup for steps, then cosine decay to :
PyTorch: Cosine LR Scheduler with Warmup
import math
def get_lr(step, warmup_steps, total_steps, max_lr, min_lr):
"""Linear warmup + cosine decay schedule."""
if step < warmup_steps:
# Linear warmup
return max_lr * step / warmup_steps
# Cosine decay
progress = (step - warmup_steps) / (total_steps - warmup_steps)
return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))
# Example: GPT-3 style schedule (units are steps, not tokens)
# max_lr=6e-5, min_lr=6e-6
# warmup_steps=375, total_steps=300_000
# (assumes batch_size=1M tokens: 375M-token warmup / 1M = 375 steps)PyTorch implementation
# Chinchilla scaling loss: L(N, D) = E + A/N^alpha + B/D^beta
# Hoffmann et al. 2022 fitted constants
E = 1.69 # irreducible entropy loss
A = 406.4
B = 410.7
alpha = 0.34
beta = 0.28
def chinchilla_loss(N, D):
"""Predict cross-entropy loss given model params N and data tokens D."""
return E + A / (N ** alpha) + B / (D ** beta)
def optimal_allocation(C):
"""Chinchilla-optimal N and D for a fixed FLOP budget C = 6ND."""
N = (A * alpha / (B * beta)) ** (1 / (alpha + beta)) * (C / 6) ** (beta / (alpha + beta))
D = C / (6 * N)
return N, D
N_opt, D_opt = optimal_allocation(3.15e23) # GPT-3 compute budget
print(f"Optimal: {N_opt/1e9:.1f}B params, {D_opt/1e12:.1f}T tokens")
print(f"Predicted loss: {chinchilla_loss(N_opt, D_opt):.3f}")Quantization: Precision vs. Memory
FP32 (4 bytes): training master weights, full precision. 70B model = 280GB.
FP16/BF16 (2 bytes): standard training and inference. 70B = 140GB. BF16 preferred for its wider exponent range.
INT8 (1 byte): post-training quantization, <1% quality loss. 70B = 70GB. LLM.int8() handles outlier features in FP16.
INT4 (0.5 bytes): aggressive quantization via GPTQ/AWQ. . Fits on a single 48GB GPU. .
Serving: Continuous Batching & PagedAttention
Continuous batching (Orca): insert new requests as sequences complete, instead of waiting for the whole batch. GPU utilization jumps from ~30% to 90%+.
PagedAttention (vLLM): manages KV cache like virtual memory pages. Allocates small blocks on demand instead of pre-allocating max length. .
Model routing: use a small model for easy queries and route hard queries to the large model. Reduces average cost by 50-80% with minimal quality loss.
Break It — See What Happens
Real-World Numbers
Publicly reported or commonly cited figures, with estimates labeled explicitly. These are useful for intuition, not as exact accounting records.
| Model | Params | Tokens | Tokens/Param | Est. Cost |
|---|---|---|---|---|
| GPT-3 | 175B | 1.7x | ~$4.6M | |
| Chinchilla | 70B | 20x | Same as Gopher | |
| Llama-2 | 70B | 29x | ~$2M | |
| Llama-3 | 70B | 214x | ~$10M+ | |
| GPT-4 * | ~1.8T (est.) | ~13T (est.) | ~7x | ~$100M |
* GPT-4 architecture details not officially confirmed by OpenAI — figures are community estimates from leaks and benchmarking.
Quick check
Llama-3 8B trained on 15T tokens. Chinchilla-optimal for 8B is ~160B tokens. What does the 94× over-training ratio reveal about the real optimization target?
Test-Time Compute as the Third Axis (2024–2025)
Kaplan and Chinchilla defined two axes: model parameters (N) and training tokens (D). A third axis emerged in 2024 — compute spent at inference time. The insight: a smaller, cheaper model given more thinking steps can match or beat a larger model on reasoning tasks.
o3 vs GPT-4o: same training, 172× inference compute
OpenAI's o3 uses the same base training as GPT-4o but applies extended chain-of-thought reasoning at inference. On ARC-AGI (a test explicitly designed to resist LLM shortcuts):
- : 5% accuracy
- : 87.5% accuracy — 172× more inference compute
Source: ARC Prize blog, Dec 2024
Deep dive: Snell et al. 2024 — scaling inference compute
Snell et al. (2024, arxiv:2408.03314) systematically studied how inference-time compute scales for language models. Key finding: a model 14× smaller than a frontier model, given extended thinking budget (beam search, self-refinement, verifier-guided search), matches the larger model on medium-difficulty math tasks.
The gain is task-difficulty-dependent. Easy tasks are already solved in one pass; hard tasks require domain knowledge no amount of test-time compute can substitute for. The sweet spot is medium-difficulty: problems the model “almost knows” — where search and self-correction surface the right answer.
Practical implication: serving cost is not fixed at training time. A 7B model with a verifier loop can replace a 70B model for the right task distribution, at 10× lower per-token cost.
Deep dive: Sparse MoE and Chinchilla-optimality
The Chinchilla rule applies to activeparameters, not total. DeepSeek-V3 has 671B total parameters but only 37B active per token. The Chinchilla-optimal question is: “How many tokens should we train this 37B-active-param model on?” — not “how many for a 671B dense model?”
At 37B active params, Chinchilla-optimal training is roughly 37B × 20 ≈ 740B tokens. DeepSeek-V3 was trained on — ~20× Chinchilla-optimal for the active parameter count. This is deliberate over-training (same pattern as Llama-3) because inference cost dominates at scale.
The 671B total parameters store specialised knowledge across 256 fine-grained experts. The scaling law governs compute efficiency; the large expert pool governs knowledge capacity. These are independent dimensions.
Updated frontier models (2024–2025)
| Model | Release | Params (active) | Tokens | Significance |
|---|---|---|---|---|
| Dec 2024 | 37B / 671B total | 14.8T | MoE frontier; Chinchilla-optimal at 37B active | |
| Feb 2025 | ~(est.) dense | undisclosed | Last major dense pre-training scale-up before o3 reasoning paradigm (community estimate) | |
| Apr 2025 | undisclosed | — | 87.5% ARC-AGI via inference-time compute scaling |
Key Takeaways
What to remember for interviews
- 1Kaplan et al. (2020) showed LLM loss follows a smooth power law with compute, data, and parameters — enabling you to predict a 100B model's performance from 1M-parameter experiments.
- 2Chinchilla (2022) corrected Kaplan: for a fixed compute budget, model size and token count should scale equally (both ∝ C^0.5). Rule of thumb: ~20 tokens per parameter.
- 3Training compute is approximated as C ≈ 6ND (6 FLOPs per parameter per token: 2 forward + 4 backward). GPT-3: 6 × 175B × 300B ≈ 3.15×10²³ FLOPs.
- 4Post-Chinchilla, labs deliberately over-train small models (Llama-3 8B: 15T tokens, ~94× Chinchilla-optimal for 8B) because inference cost is paid per query while training is a one-time expense.
- 5Test-time compute is a third scaling axis: o3 uses 172× more inference compute than GPT-4o, lifting ARC-AGI from 5% to 87.5% — no additional training required.
- 6For sparse MoE, Chinchilla-optimality is computed over active parameters, not total. DeepSeek-V3's 14.8T training tokens are ~20× Chinchilla-optimal for its 37B active parameter count.
Recap quiz
Kaplan (2020) and Chinchilla (2022) both studied LLM scaling, but disagreed on one key ratio. What was the core disagreement?
GPT-3 has 175B non-embedding parameters and was trained on 300B tokens. What is its approximate training FLOP count using C ≈ 6ND?
Llama-3 8B was trained on 15T tokens — roughly 94× the Chinchilla-optimal amount for 8B params. Why would Meta deliberately over-train beyond the compute-optimal point?
Kaplan et al. found L(N) ∝ N^{−0.076}. What does the small exponent (0.076) imply about the effort needed to halve model loss?
OpenAI o1 demonstrated inference-time scaling: spending more compute at test time improves reasoning. How does this interact with the Chinchilla compute-optimal frontier?
A team wants to deploy a 70B model on a single 48GB GPU. Which quantization format is the minimum needed, and what quality tradeoff must they accept?
Further Reading
- Chinchilla: Training Compute-Optimal Large Language Models — DeepMind's revised scaling laws showing models should be trained on ~20x more tokens than parameters.
- Scaling Laws for Neural Language Models (Kaplan et al.) — The original OpenAI scaling laws paper establishing power-law relationships between compute, data, parameters, and loss.
- GPT-4 Technical Report — OpenAI's report on GPT-4, including predictable scaling of loss from smaller models.
- Lilian Weng's Blog — Technical posts on scaling behavior, emergent abilities, and LLM training dynamics.
- Emergent Abilities of Large Language Models — Wei et al. 2022 — documents capabilities that appear unpredictably at scale, raising questions about whether scaling produces continuous or discontinuous improvements.
- Are Emergent Abilities of Large Language Models a Mirage? — Schaeffer et al. 2023 — argues apparent emergence is an artifact of discontinuous evaluation metrics, not a fundamental property of scale.
- Andrej Karpathy — The State of GPT (Microsoft Build 2023) — Covers scaling laws, training recipes, and how compute budgets inform modern LLM development decisions in practice.
Interview Questions
Showing 6 of 6
Explain the Chinchilla scaling law. How does it differ from the original Kaplan scaling law?
★★☆Why does Llama-2 70B outperform GPT-3 175B despite being smaller?
★★☆Derive the approximate training compute formula C = 6ND. Where does the 6 come from?
★★★What is quantization and how do FP16, INT8, and INT4 compare for inference?
★★☆Explain continuous batching and PagedAttention. Why are they critical for LLM serving?
★★★What is the learning rate schedule for training large language models? Why warmup + cosine decay?
★★☆