Skip to content

Transformer Math

Module 13 · Training

🗃️ Data Curation

LIMA trained on 1,000 examples and matched GPT-3.5

Status:

The most important ingredient in training an LLM is not the model architecture or the optimizer — it is the data. Public examples like FineWeb show how a huge raw crawl can be compressed into a much smaller high-quality corpus, and LIMA shows that 1,000 curated examples can rival far larger sloppy SFT sets. This module covers the pipeline that turns the raw internet into a training dataset.

🧹

The Data Pipeline

What you are seeing

Each stage of the pipeline narrows the data volume. The bar width represents the approximate token count remaining after filtering (illustrative — exact stage ratios vary by dataset and thresholds). FineWeb's confirmed final size is 15T tokens.

Raw Web Crawl~200T+ tokens (illustrative)removes ~50%Language Filter~50T tokensremoves ~60%Quality Filter (perplexity)~20T tokensremoves ~25%Deduplication (MinHash)~15T tokensremoves ~7%Safety Filter~15T tokenstokenize + packTraining Data — ~15T tokens (~7.5% retained)
🎮

Data Curation Pipeline

What you are seeing

A data curation pipeline similar to FineWeb. Raw web crawl enters at the top; each stage filters or transforms the data. Click any stage to see details. In this illustrative setup, roughly 92% of the raw data is discarded before the final dataset.

What to try: Click each stage to see what it filters and why. Pay attention to how much data is removed at each step.

Data retention~15T final / ~200T+ raw (illustrative)

In this example, a little over 92% of raw web data is filtered out before pre-training

💡

The Intuition

Data quality beats data quantity.Microsoft's Phi series proved that a 1.3B model trained on "textbook quality" synthetic data can be surprisingly competitive with much larger models trained on noisier web data for some tasks. The data distribution is part of the model recipe.

The FineWeb pipeline (HuggingFace, 2024) processes from Common Crawl through language detection, quality scoring, and deduplication. Each stage removes a large fraction of the data. In the FineWeb paper, this curated pipeline performs strongly relative to earlier open corpora on downstream evaluations.

Deduplication is critical. The web is full of near-identical pages (templates, scraped content, mirrors). MinHash with LSH (Locality-Sensitive Hashing) efficiently finds near-duplicate documents by comparing their n-gram fingerprints. Without dedup, the model memorizes repeated content instead of learning general patterns.

Mixing ratios matter. Exact ratios are proprietary, but Meta reports that Llama-3 significantly increased the share of code and math data compared to . Changing domain ratios significantly shifts model capabilities — for example, more code often helps code generation and some reasoning tasks, while books and reference text often help factual coverage.

Synthetic data is increasingly important. Self-Instruct generates instruction data from an LLM. Phi-1.5 uses synthetic textbook-style data. But model collapse limits how far you can go — training on model outputs narrows the distribution over generations.

✨ Insight · can be competitive with much larger low-quality instruction sets. Pre-training already teaches most of the knowledge; SFT mainly teaches format and style. Quality of alignment data can matter more than raw example count.

Model Collapse from Synthetic Data: Shumailov et al. (2023) demonstrated that training iteratively on model-generated data causes model collapse— a pathological narrowing of the output distribution over successive generations. The intuition: each generative model is a lossy compressor of its training distribution. When that compressed output becomes the next generation's training data, low-probability tails are discarded, and the model concentrates probability mass on an ever-shrinking region. After just 5–9 self-training iterations the distribution degenerates into near-constant outputs. The fix is always mixing real data into every generation of training. Follow-up work (Gerstgrasser et al., 2024) suggests that even a small fraction of real data can anchor the distribution and prevent collapse, though the exact threshold depends on the task and generation process.

Automatic Domain Weight Optimization (DoReMi): Xie et al. (2023) introduced DoReMi, which removes the guesswork from choosing domain mixing ratios. A small proxy model () is first trained on a uniform mix. Its per-domain excess loss — how much worse it does on each domain compared to a reference — becomes the domain weight signal. Domains where the proxy struggles are upweighted; easy domains are downweighted. These weights then govern full-scale training. DoReMi showed consistent improvements over hand-tuned ratios across model sizes, and crucially, weights transfer: ratios found with the 280M proxy improved 8B training runs.

Quick check

Trade-off

If SFT data quality beats quantity (LIMA result), why can't you use just 100 examples instead of 1,000?

If SFT data quality beats quantity (LIMA result), why can't you use just 100 examples instead of 1,000?
Quick Check

Why does LIMA achieve good alignment with only 1,000 examples?

📐

Key Formulas

TF-IDF Quality Scoring

Quality filters often score documents by similarity to a reference corpus (e.g., Wikipedia). TF-IDF weights terms by frequency and rarity:

= frequency of term in document . = number of documents containing . Documents with high cosine similarity to reference articles are kept.

MinHash Similarity (Deduplication)

MinHash estimates Jaccard similarity between document n-gram sets without comparing all pairs:

and are the n-gram sets of two documents. independent hash functions produce a compact signature. Documents with are flagged as near-duplicates. LSH (Locality-Sensitive Hashing) groups similar signatures into buckets for efficient lookup — O(n) instead of O(n^2).

💡 Tip · This catches template pages, mirrors, and scraped content while preserving legitimate similar-but-different documents.

Data Mixing Ratios

The training data is a weighted mixture of domains. The effective distribution at each training step:

is the sampling weight for domain (web, code, books, etc.). DoReMi (2023) optimizes these weights automatically using a small proxy model, upweighting domains where the model struggles.

Pseudo-code: Data Filtering Pipeline

typescript
type RawDocument = {
  id: string;
  text: string;
  langConfidence: number;
  symbolRatio: number;
  repetitionRatio: number;
  domain: "web" | "code" | "books" | "wiki" | "other";
};

function curateDataset(rawDocuments: RawDocument[]) {
  // Stage 1: language detection
  let docs = rawDocuments.filter(
    (doc) =>
      fastTextLangDetect(doc.text) === "en" && doc.langConfidence > 0.65,
  );

  // Stage 2: quality filtering
  const qualityModel = loadClassifier("quality-scorer");
  docs = docs.filter(
    (doc) =>
      qualityModel.score(doc.text) > 0.5 &&
      doc.text.split(/s+/).length > 50 &&
      doc.symbolRatio < 0.1 &&
      doc.repetitionRatio < 0.3,
  );

  // Stage 3: deduplication (MinHash + LSH)
  const minhashIndex = new MinHashLSH({ threshold: 0.8, numPerm: 128 });
  const uniqueDocs: RawDocument[] = [];
  for (const doc of docs) {
    const signature = computeMinhash(doc.text, { ngram: 5 });
    if (!minhashIndex.query(signature).length) {
      minhashIndex.insert(doc.id, signature);
      uniqueDocs.push(doc);
    }
  }

  // Stage 4: benchmark decontamination
  docs = uniqueDocs.filter(
    (doc) => !has13GramOverlap(doc.text, BENCHMARK_SET),
  );

  // Stage 5: domain mixing
  return sampleByDomain(docs, {
    web: 0.5,
    code: 0.15,
    books: 0.15,
    wiki: 0.1,
    other: 0.1,
  });
}
Python: HuggingFace datasets dedup + filter pipeline

Data curation pipeline — dedup + quality filter (Python)

python
from datasets import load_dataset
from datasketch import MinHash, MinHashLSH
import langid

# ── 1. Load a slice of Common Crawl via HuggingFace ──────────────────────────
dataset = load_dataset(
    "HuggingFaceFW/fineweb",
    name="sample-10BT",
    split="train",
    streaming=True,
)

# ── 2. Quality filter: length + language ─────────────────────────────────────
def quality_filter(example: dict) -> bool:
    text: str = example["text"]
    # Minimum length heuristic (FineWeb uses ~200-word threshold)
    if len(text.split()) < 100:
        return False
    # Language detection via langid (fasttext lid.176 preferred in production)
    lang, confidence = langid.classify(text)
    if lang != "en" or confidence < 0.9:
        return False
    # Symbol ratio heuristic
    alpha_chars = sum(c.isalpha() for c in text)
    if len(text) > 0 and alpha_chars / len(text) < 0.6:
        return False
    return True

filtered = dataset.filter(quality_filter)

# ── 3. MinHash LSH deduplication ──────────────────────────────────────────────
# Settings: 5-gram shingles, 128 hash functions, Jaccard threshold 0.8
LSH_THRESHOLD = 0.8
NUM_PERM = 128  # hash functions (k)
NGRAM_SIZE = 5

lsh = MinHashLSH(threshold=LSH_THRESHOLD, num_perm=NUM_PERM)
unique_ids: set[str] = set()


def make_minhash(text: str) -> MinHash:
    m = MinHash(num_perm=NUM_PERM)
    tokens = text.lower().split()
    for i in range(len(tokens) - NGRAM_SIZE + 1):
        ngram = " ".join(tokens[i : i + NGRAM_SIZE])
        m.update(ngram.encode("utf-8"))
    return m


def dedup_filter(example: dict) -> bool:
    doc_id: str = example.get("id", example["url"])
    m = make_minhash(example["text"])
    candidates = lsh.query(m)
    if candidates:
        return False  # near-duplicate found — discard
    lsh.insert(doc_id, m)
    unique_ids.add(doc_id)
    return True


deduped = filtered.filter(dedup_filter)

# ── 4. Save ───────────────────────────────────────────────────────────────────
# Convert streaming → materialized dataset and save to disk
import os
from datasets import Dataset

rows = list(deduped.take(100_000))  # sample 100K for demo
output = Dataset.from_list(rows)
output.save_to_disk(os.path.join("/tmp", "curated_fineweb_sample"))
print(f"Saved {len(output):,} documents after quality filter + dedup")

Quick check

Derivation

MinHash uses k=128 hash functions to estimate Jaccard similarity. Doubling to k=256 hash functions cuts estimation variance by how much?

MinHash uses k=128 hash functions to estimate Jaccard similarity. Doubling to k=256 hash functions cuts estimation variance by how much?
🔧

Break It — See What Happens

No deduplication
All web data, no quality filtering

Quick check

Trade-off

Without deduplication, a large web crawl has ~50% near-duplicate pages. Which downstream effect is most harmful for a deployed model?

Without deduplication, a large web crawl has ~50% near-duplicate pages. Which downstream effect is most harmful for a deployed model?
📊

Real-World Numbers

Selected public datasets and recipes, with estimates and comparative language called out explicitly.

DatasetSizeSourceNotable
FineWeb96 Common Crawl dumpsHuggingFace 2024, open corpus with a detailed filtering pipeline
The Pile22 diverse sourcesEleutherAI, pioneered diverse domain mixing
RedPajama v230T tokensCommon Crawl + curatedTogether AI, open reproduction of Llama data
Llama-3 dataProprietary pipeline, 4x more code (per Meta blog); math upsampling reported in later Herd paper
LIMA (SFT)Manually curatedMatched 52K Alpaca with 50x less data
Phi-1.5 (synthetic)~30B tokensGPT-3.5 generated1.3B model matched Llama-7B on some benchmarks
✨ Insight · The trend is clear: datasets are growing (300B to 15T tokens in 3 years), but filtering is also getting stricter. More raw data in does not automatically mean better pretraining data out. The pipeline — not just the crawl — is a major competitive advantage.

Quick check

Trade-off

The table shows LIMA (1K examples) matched Alpaca (52K examples). If you scale LIMA to 10K high-quality examples, what should you expect?

The table shows LIMA (1K examples) matched Alpaca (52K examples). If you scale LIMA to 10K high-quality examples, what should you expect?
🧠

Key Takeaways

What to remember for interviews

  1. 1Data quality can beat raw data quantity: public pipelines like FineWeb show how a huge crawl can be compressed into a much smaller curated corpus, with roughly 92% of raw tokens filtered out in this illustrative example.
  2. 2Deduplication is critical: without it, models memorize repeated content, benchmark contamination rises, and generalization suffers. MinHash+LSH finds near-duplicates in O(n) instead of O(n²).
  3. 3Domain mixing ratios significantly shift model capabilities — more code improves reasoning, more books improve factual knowledge. DoReMi automates ratio selection using a small proxy model.
  4. 4Synthetic data can supplement but not replace real data: iterative self-training causes model collapse (Shumailov et al. 2023), narrowing the output distribution unless real data is mixed in.
  5. 5LIMA showed that 1,000 carefully curated SFT examples can rival much larger low-quality instruction sets — pre-training already teaches knowledge, and SFT mostly teaches response format and style.
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 6 of 6

Design a data pipeline for pre-training a 70B parameter LLM. What stages would you include and why?

★★★
GoogleMeta

Why is deduplication critical for LLM training? What happens without it?

★★☆
GoogleOpenAI

Can synthetic data replace real data for pre-training? What are the limits?

★★☆
MetaOpenAI

How do data mixing ratios affect model performance? How would you determine the optimal mix?

★★★
GoogleMeta

What is benchmark contamination and how do you detect and prevent it?

★★☆
OpenAIAnthropic

Explain LIMA's finding. Why can 1,000 carefully curated SFT examples match 100K low-quality ones?

★★☆
MetaGoogle
🧠

Recap quiz

Derivation

FineWeb starts from Common Crawl and ends up with ~15T tokens. If the raw crawl is ~200T tokens, what fraction is retained?

FineWeb starts from Common Crawl and ends up with ~15T tokens. If the raw crawl is ~200T tokens, what fraction is retained?
Trade-off

LIMA fine-tunes a 65B model on 1,000 examples and roughly matches Alpaca (52K examples). What is the primary reason this works?

LIMA fine-tunes a 65B model on 1,000 examples and roughly matches Alpaca (52K examples). What is the primary reason this works?
Trade-off

MinHash deduplication sets Jaccard threshold J=0.8. What happens if you lower the threshold to J=0.5?

MinHash deduplication sets Jaccard threshold J=0.8. What happens if you lower the threshold to J=0.5?
Derivation

Why does iterative self-training (training on model-generated data repeatedly) cause model collapse even if each generation looks plausible?

Why does iterative self-training (training on model-generated data repeatedly) cause model collapse even if each generation looks plausible?
Derivation

DoReMi uses a 280M proxy model to set domain mixing weights for an 8B run. What is the core signal it exploits?

DoReMi uses a 280M proxy model to set domain mixing weights for an 8B run. What is the core signal it exploits?
Trade-off

A 70B model is trained on a mix including 15% code. An ablation shows adding more code improves math benchmarks but hurts open-ended conversation. What explains the tradeoff?

A 70B model is trained on a mix including 15% code. An ablation shows adding more code improves math benchmarks but hurts open-ended conversation. What explains the tradeoff?
Trade-off

Benchmark contamination can inflate evaluation scores. Which detection method is most reliable at scale?

Benchmark contamination can inflate evaluation scores. Which detection method is most reliable at scale?