Data Curation — Transformer Math

Module 13 · Training

🗃️ Data Curation

LIMA trained on 1,000 examples and matched GPT-3.5

Status:

The most important ingredient in training an LLM is not the model architecture or the optimizer — it is the data. Public examples like FineWeb show how a huge raw crawl can be compressed into a much smaller high-quality corpus, and LIMA shows that 1,000 curated examples can rival far larger sloppy SFT sets. This module covers the pipeline that turns the raw internet into a training dataset.

🧹

The Data Pipeline

What you are seeing

Each stage of the pipeline narrows the data volume. The bar width represents the approximate token count remaining after filtering (illustrative — exact stage ratios vary by dataset and thresholds). FineWeb's confirmed final size is 15T tokens.

🎮

Data Curation Pipeline

What you are seeing

A data curation pipeline similar to FineWeb. Raw web crawl enters at the top; each stage filters or transforms the data. Click any stage to see details. In this illustrative setup, roughly 92% of the raw data is discarded before the final dataset.

What to try: Click each stage to see what it filters and why. Pay attention to how much data is removed at each step.

Data retention~15T final / ~200T+ raw (illustrative)

In this example, a little over 92% of raw web data is filtered out before pre-training

💡

The Intuition

Data quality beats data quantity.Microsoft's Phi series proved that a 1.3B model trained on "textbook quality" synthetic data can be surprisingly competitive with much larger models trained on noisier web data for some tasks. The data distribution is part of the model recipe.

The FineWeb pipeline (HuggingFace, 2024) processes from Common Crawl through language detection, quality scoring, and deduplication. Each stage removes a large fraction of the data. In the FineWeb paper, this curated pipeline performs strongly relative to earlier open corpora on downstream evaluations.

Deduplication is critical. The web is full of near-identical pages (templates, scraped content, mirrors). MinHash with LSH (Locality-Sensitive Hashing) efficiently finds near-duplicate documents by comparing their n-gram fingerprints. Without dedup, the model memorizes repeated content instead of learning general patterns.

Mixing ratios matter. Exact ratios are proprietary, but Meta reports that Llama-3 significantly increased the share of code and math data compared to . Changing domain ratios significantly shifts model capabilities — for example, more code often helps code generation and some reasoning tasks, while books and reference text often help factual coverage.

Synthetic data is increasingly important. Self-Instruct generates instruction data from an LLM. Phi-1.5 uses synthetic textbook-style data. But model collapse limits how far you can go — training on model outputs narrows the distribution over generations.

✨ Insight · can be competitive with much larger low-quality instruction sets. Pre-training already teaches most of the knowledge; SFT mainly teaches format and style. Quality of alignment data can matter more than raw example count.

Model Collapse from Synthetic Data: Shumailov et al. (2023) demonstrated that training iteratively on model-generated data causes model collapse— a pathological narrowing of the output distribution over successive generations. The intuition: each generative model is a lossy compressor of its training distribution. When that compressed output becomes the next generation's training data, low-probability tails are discarded, and the model concentrates probability mass on an ever-shrinking region. After just 5–9 self-training iterations the distribution degenerates into near-constant outputs. The fix is always mixing real data into every generation of training. Follow-up work (Gerstgrasser et al., 2024) suggests that even a small fraction of real data can anchor the distribution and prevent collapse, though the exact threshold depends on the task and generation process.

Automatic Domain Weight Optimization (DoReMi): Xie et al. (2023) introduced DoReMi, which removes the guesswork from choosing domain mixing ratios. A small proxy model () is first trained on a uniform mix. Its per-domain excess loss — how much worse it does on each domain compared to a reference — becomes the domain weight signal. Domains where the proxy struggles are upweighted; easy domains are downweighted. These weights then govern full-scale training. DoReMi showed consistent improvements over hand-tuned ratios across model sizes, and crucially, weights transfer: ratios found with the 280M proxy improved 8B training runs.

Quick check

Trade-off

If SFT data quality beats quantity (LIMA result), why can't you use just 100 examples instead of 1,000?

Coverage: 100 examples cannot span enough task types and styles for robust surface alignment.Gradient noise: fewer examples cause the optimizer to diverge.Regularization requires at least N=1,000 for the loss surface to be convex.The base model forgets pre-training knowledge with fewer SFT steps.

Quick Check

Why does LIMA achieve good alignment with only 1,000 examples?

📐

Key Formulas

TF-IDF Quality Scoring

Quality filters often score documents by similarity to a reference corpus (e.g., Wikipedia). TF-IDF weights terms by frequency and rarity:

= frequency of term in document . = number of documents containing . Documents with high cosine similarity to reference articles are kept.

MinHash Similarity (Deduplication)

MinHash estimates Jaccard similarity between document n-gram sets without comparing all pairs:

and are the n-gram sets of two documents. independent hash functions produce a compact signature. Documents with are flagged as near-duplicates. LSH (Locality-Sensitive Hashing) groups similar signatures into buckets for efficient lookup — O(n) instead of O(n^2).

💡 Tip · This catches template pages, mirrors, and scraped content while preserving legitimate similar-but-different documents.

Data Mixing Ratios

The training data is a weighted mixture of domains. The effective distribution at each training step:

is the sampling weight for domain (web, code, books, etc.). DoReMi (2023) optimizes these weights automatically using a small proxy model, upweighting domains where the model struggles.

Pseudo-code: Data Filtering Pipeline

typescript

type RawDocument = {
  id: string;
  text: string;
  langConfidence: number;
  symbolRatio: number;
  repetitionRatio: number;
  domain: "web" | "code" | "books" | "wiki" | "other";
};

function curateDataset(rawDocuments: RawDocument[]) {
  // Stage 1: language detection
  let docs = rawDocuments.filter(
    (doc) =>
      fastTextLangDetect(doc.text) === "en" && doc.langConfidence > 0.65,
  );

  // Stage 2: quality filtering
  const qualityModel = loadClassifier("quality-scorer");
  docs = docs.filter(
    (doc) =>
      qualityModel.score(doc.text) > 0.5 &&
      doc.text.split(/s+/).length > 50 &&
      doc.symbolRatio < 0.1 &&
      doc.repetitionRatio < 0.3,
  );

  // Stage 3: deduplication (MinHash + LSH)
  const minhashIndex = new MinHashLSH({ threshold: 0.8, numPerm: 128 });
  const uniqueDocs: RawDocument[] = [];
  for (const doc of docs) {
    const signature = computeMinhash(doc.text, { ngram: 5 });
    if (!minhashIndex.query(signature).length) {
      minhashIndex.insert(doc.id, signature);
      uniqueDocs.push(doc);
    }
  }

  // Stage 4: benchmark decontamination
  docs = uniqueDocs.filter(
    (doc) => !has13GramOverlap(doc.text, BENCHMARK_SET),
  );

  // Stage 5: domain mixing
  return sampleByDomain(docs, {
    web: 0.5,
    code: 0.15,
    books: 0.15,
    wiki: 0.1,
    other: 0.1,
  });
}

Python: HuggingFace datasets dedup + filter pipeline

Data curation pipeline — dedup + quality filter (Python)

python

from datasets import load_dataset
from datasketch import MinHash, MinHashLSH
import langid

# ── 1. Load a slice of Common Crawl via HuggingFace ──────────────────────────
dataset = load_dataset(
    "HuggingFaceFW/fineweb",
    name="sample-10BT",
    split="train",
    streaming=True,
)

# ── 2. Quality filter: length + language ─────────────────────────────────────
def quality_filter(example: dict) -> bool:
    text: str = example["text"]
    # Minimum length heuristic (FineWeb uses ~200-word threshold)
    if len(text.split()) < 100:
        return False
    # Language detection via langid (fasttext lid.176 preferred in production)
    lang, confidence = langid.classify(text)
    if lang != "en" or confidence < 0.9:
        return False
    # Symbol ratio heuristic
    alpha_chars = sum(c.isalpha() for c in text)
    if len(text) > 0 and alpha_chars / len(text) < 0.6:
        return False
    return True

filtered = dataset.filter(quality_filter)

# ── 3. MinHash LSH deduplication ──────────────────────────────────────────────
# Settings: 5-gram shingles, 128 hash functions, Jaccard threshold 0.8
LSH_THRESHOLD = 0.8
NUM_PERM = 128  # hash functions (k)
NGRAM_SIZE = 5

lsh = MinHashLSH(threshold=LSH_THRESHOLD, num_perm=NUM_PERM)
unique_ids: set[str] = set()


def make_minhash(text: str) -> MinHash:
    m = MinHash(num_perm=NUM_PERM)
    tokens = text.lower().split()
    for i in range(len(tokens) - NGRAM_SIZE + 1):
        ngram = " ".join(tokens[i : i + NGRAM_SIZE])
        m.update(ngram.encode("utf-8"))
    return m


def dedup_filter(example: dict) -> bool:
    doc_id: str = example.get("id", example["url"])
    m = make_minhash(example["text"])
    candidates = lsh.query(m)
    if candidates:
        return False  # near-duplicate found — discard
    lsh.insert(doc_id, m)
    unique_ids.add(doc_id)
    return True


deduped = filtered.filter(dedup_filter)

# ── 4. Save ───────────────────────────────────────────────────────────────────
# Convert streaming → materialized dataset and save to disk
import os
from datasets import Dataset

rows = list(deduped.take(100_000))  # sample 100K for demo
output = Dataset.from_list(rows)
output.save_to_disk(os.path.join("/tmp", "curated_fineweb_sample"))
print(f"Saved {len(output):,} documents after quality filter + dedup")

Quick check

Derivation

MinHash uses k=128 hash functions to estimate Jaccard similarity. Doubling to k=256 hash functions cuts estimation variance by how much?

Quartic improvement — each hash independently eliminates false positives.No improvement — accuracy is capped by the choice of n-gram size.Halves variance — standard error scales as 1/sqrt(k).Doubles accuracy — each extra hash adds linearly to precision.

🔧

Break It — See What Happens

No deduplication

All web data, no quality filtering

Quick check

Trade-off

Without deduplication, a large web crawl has ~50% near-duplicate pages. Which downstream effect is most harmful for a deployed model?

Verbatim memorization of duplicated content increases regurgitation and benchmark contamination risk.Duplicate pages double training time without any accuracy benefit.Without dedup the model learns more diverse writing styles from repeated sources.Deduplication is only needed for code data — web text duplicates are benign.

📊

Real-World Numbers

Selected public datasets and recipes, with estimates and comparative language called out explicitly.

Dataset	Size	Source	Notable
FineWeb		96 Common Crawl dumps	HuggingFace 2024, open corpus with a detailed filtering pipeline
The Pile		22 diverse sources	EleutherAI, pioneered diverse domain mixing
RedPajama v2	30T tokens	Common Crawl + curated	Together AI, open reproduction of Llama data
Llama-3 data		Proprietary pipeline	, 4x more code (per Meta blog); math upsampling reported in later Herd paper
LIMA (SFT)		Manually curated	Matched 52K Alpaca with 50x less data
Phi-1.5 (synthetic)	~30B tokens	GPT-3.5 generated	1.3B model matched Llama-7B on some benchmarks

✨ Insight · The trend is clear: datasets are growing (300B to 15T tokens in 3 years), but filtering is also getting stricter. More raw data in does not automatically mean better pretraining data out. The pipeline — not just the crawl — is a major competitive advantage.

Quick check

Trade-off

The table shows LIMA (1K examples) matched Alpaca (52K examples). If you scale LIMA to 10K high-quality examples, what should you expect?

10x improvement in benchmark scores proportional to the data increase.Marginal gains — once surface alignment is learned, more SFT examples yield diminishing returns.Significant regression — too many SFT examples cause catastrophic forgetting.No change — SFT is irrelevant above 1,000 examples for any model size.

🧠

Key Takeaways

What to remember for interviews

1Data quality can beat raw data quantity: public pipelines like FineWeb show how a huge crawl can be compressed into a much smaller curated corpus, with roughly 92% of raw tokens filtered out in this illustrative example.
2Deduplication is critical: without it, models memorize repeated content, benchmark contamination rises, and generalization suffers. MinHash+LSH finds near-duplicates in O(n) instead of O(n²).
3Domain mixing ratios significantly shift model capabilities — more code improves reasoning, more books improve factual knowledge. DoReMi automates ratio selection using a small proxy model.
4Synthetic data can supplement but not replace real data: iterative self-training causes model collapse (Shumailov et al. 2023), narrowing the output distribution unless real data is mixed in.
5LIMA showed that 1,000 carefully curated SFT examples can rival much larger low-quality instruction sets — pre-training already teaches knowledge, and SFT mostly teaches response format and style.

📚

Interview Questions

Difficulty:

Company:

Showing 6 of 6

Design a data pipeline for pre-training a 70B parameter LLM. What stages would you include and why?

★★★

GoogleMeta

Why is deduplication critical for LLM training? What happens without it?

★★☆

GoogleOpenAI

Can synthetic data replace real data for pre-training? What are the limits?

★★☆

MetaOpenAI

How do data mixing ratios affect model performance? How would you determine the optimal mix?

★★★

GoogleMeta

What is benchmark contamination and how do you detect and prevent it?

★★☆

OpenAIAnthropic

Explain LIMA's finding. Why can 1,000 carefully curated SFT examples match 100K low-quality ones?

★★☆

MetaGoogle

🧠

Recap quiz

←

📉 Pre-training & Loss

📈 Scaling Laws

→

🗃️ Data Curation

The Data Pipeline

Data Curation Pipeline

The Intuition

Key Formulas

Break It — See What Happens

Real-World Numbers

Key Takeaways

Further Reading

Interview Questions

Recap quiz