Skip to content

Transformer Math

Module 38 · Applications

🔍 RAG & Retrieval

RAG reduced hallucination from 27% to 4% in one production system

Status:
🎮

Interactive Sandbox

RAG (Retrieval-Augmented Generation) grounds LLM outputs in real data. Instead of relying on memorized knowledge, the model retrieves relevant documents before answering.

The diagram below shows both the indexing pipeline (how documents get into the vector database) and the query pipeline (how a question finds its answer).

RAG PipelineQuery time:QueryEmbedVector SearchTop-k DocsRerankContext StuffLLMAnswerIndexing:DocumentsChunkEmbedVector DBQueryEmbeddingRetrievalRerankGeneration
💡

The Intuition

Walk one query through the full pipeline.User asks: “What is our refund policy?”

  1. Embed the query — convert to a
  2. ANN searchagainst 10K document chunks → top 20 candidates in
  3. Cross-encoder rerankstop 20 → top 3 most relevant chunks
  4. Stuff top 3 chunks into the LLM prompt as context
  5. LLM generates an answer grounded in the retrieved documents

Without RAG: LLM hallucinates a plausible-sounding but incorrect refund policy from training data. With RAG: LLM quotes the actual policy document, with a citation to the source chunk.

Think of it like an open-book exam: instead of memorizing everything, the model looks up the relevant pages before answering.

ColBERT v2 / Late Interaction: Standard dense retrieval compresses an entire document into a single vector, discarding per-token detail. ColBERT keeps per-token embeddings for both query and document, then scores each query token against every document token and takes the maximum — a MaxSim operation. This late-interaction pattern achieves significantly better retrieval quality than single-vector bi-encoders (typically , at the cost of much higher storage), with only . , making it practical at scale without sacrificing the quality gain.

RAPTOR (Recursive Abstractive Processing for Tree Organized Retrieval): Flat retrieval struggles with multi-hop questions that require combining information from multiple documents. RAPTOR (Sarthi et al., 2024) builds a tree of summaries: raw chunks are clustered by embedding similarity, each cluster is summarized by an LLM, those summaries are clustered and summarized again, and the process repeats until a single root summary covers the entire corpus. At query time, retrieval traverses this tree — matching against both high-level summaries (for broad context) and leaf chunks (for specific details). On multi-hop QA benchmarks like QuALITY and NarrativeQA, RAPTOR consistently outperforms flat retrieval, with the largest gains on questions that require synthesizing information across multiple source passages.

Self-RAG — retrieving only when needed.Standard RAG always retrieves, even for questions the model already knows (“What is 2 + 2?”). Self-RAG (Asai et al., 2023) adds special reflection tokens that the model generates inline — token families for retrieval decisions, relevance judgments, and support verification — allowing it to critique its own outputs at each step. On open-domain QA benchmarks (PopQA, TriviaQA), Self-RAG outperforms standard RAG and retrieval-augmented Llama2-chat on factuality and citation accuracy — while skipping retrieval for factual lookups the model handles correctly, . The tradeoff: the model must be fine-tuned on datasets with these reflection tokens, adding training cost.

Quick check

Trade-off

ColBERT keeps per-token embeddings and scores via MaxSim instead of a single document vector. What is the primary cost of this late-interaction approach?

ColBERT keeps per-token embeddings and scores via MaxSim instead of a single document vector. What is the primary cost of this late-interaction approach?
Quick Check

Why does RAG reduce hallucination compared to a vanilla LLM?

Quick Check

Your team has 1,000 internal product manuals that change monthly. Which approach should you use first?

📐

Step-by-Step Derivation

Embedding Similarity — Cosine Distance

The core retrieval operation: find documents whose embeddings are closest to the query embedding. Cosine similarity measures the angle between two vectors:

✨ Insight · Why cosine over dot product? Cosine normalizes by magnitude, so it measures direction (semantic meaning) regardless of vector length. In practice, most embedding models already L2-normalize their outputs, making cosine = dot product.

Chunking Strategies

How you split documents into chunks directly affects retrieval quality:

StrategySizeBest For
Fixed-size512 tokensGeneral purpose, simple
Semantic (paragraph)VariableWell-structured documents
Recursive256-1024Mixed content types

Hybrid Search — Dense + Sparse

Combine embedding-based (dense) retrieval with keyword-based (sparse) retrieval for better coverage:

Alternatively, use reciprocal rank fusion (RRF): rank documents from each retriever, combine ranks.

💡 Tip · BM25 (Best Match 25) is the classic sparse retrieval algorithm — it scores documents by term frequency and inverse document frequency (TF-IDF variant). It excels at exact keyword matching where dense retrieval struggles (e.g., product codes, proper nouns).

Reranking — Cross-Encoder Precision

Bi-encoders (used for retrieval) encode query and document separately — fast but less accurate. Cross-encoders score the pair jointly — slower but much more precise:

⚠ Warning · Two-stage pattern: Retrieve top-20 with a fast bi-encoder, then rerank to top-5 with a cross-encoder. The cross-encoder sees both query and document together (via cross-attention), catching nuances the bi-encoder misses.

End-to-End RAG Pipeline — Illustrative TypeScript

The four stages in code: embed → retrieve → rerank → generate. Each stage is independently replaceable — swap the vector DB, change the reranker model, or adjust topK without touching the others.

rag.ts

typescript
async function rag(query: string, docs: Document[]) {
  // 1. Embed query — same model used at index time
  const queryEmb = await embed(query);

  // 2. Retrieve top-k chunks via ANN search (~5-50ms)
  const chunks = await vectorDB.search(queryEmb, { topK: 5 });

  // 3. Rerank — cross-encoder sees full (query, doc) pair
  const reranked = await reranker.rank(query, chunks);

  // 4. Generate with retrieved context
  const context = reranked.slice(0, 3).map(c => c.text).join('\n\n');
  return llm.chat(`Context:\n${context}\n\nQuestion: ${query}`);
}

RAG Evaluation — Three Dimensions

Evaluate retrieval and generation separately — a good retriever with a bad generator needs a different fix:

Retrieval Metrics

Generation Metrics

Faithfulness — is every claim traceable to a retrieved chunk? Use NLI (natural language inference) models to check entailment. Relevance — does the answer actually address the query? Completeness — does it cover all aspects?

Hallucination Detection Pipeline

python
# Claim-level hallucination check
response → extract_claims(response)        # LLM extracts atomic claims
         → for each claim:
              check_entailment(claim, chunks) # NLI model: entailed/contradicted/neutral
         → hallucination_rate = contradicted / total_claims
💡 Tip · Tools like RAGAS automate faithfulness + relevance scoring using LLM-as-judge. Key insight: always evaluateretrieval and generation independently — fixing the wrong component wastes time.
⚠ Warning · Lost in the middle: . Place your highest-ranked retrieved chunk first or last, not buried in the middle.

Quick check

Trade-off

Why does a RAG pipeline use a bi-encoder for retrieval and a cross-encoder for reranking, rather than using the cross-encoder for both steps?

Why does a RAG pipeline use a bi-encoder for retrieval and a cross-encoder for reranking, rather than using the cross-encoder for both steps?
🔧

Break It — See What Happens

Remove retrieval (vanilla LLM, no RAG)
Stale vector index (embeddings not updated)
Bad Chunking (10K token chunks)
No Reranking (embedding similarity only)

Quick check

Trade-off

You remove the reranking step from a RAG pipeline. Which failure mode is most likely?

You remove the reranking step from a RAG pipeline. Which failure mode is most likely?
📊

Real-World Numbers

MetricValue
Embedding dimension (OpenAI ada-002)
Embedding dimension (typical range)
Recommended chunk size
Top-k retrieval3-10 documents
Vector DB latency (10M vectors)

Quick check

Derivation

You store ada-002 embeddings (1,536 dims, float32) for 1M documents. What is the approximate raw vector size?

You store ada-002 embeddings (1,536 dims, float32) for 1M documents. What is the approximate raw vector size?
🧠

Key Takeaways

What to remember for interviews

  1. 1RAG grounds LLM outputs in retrieved documents, dramatically reducing hallucination: the model quotes actual source text rather than generating plausible-sounding facts from memorized training data.
  2. 2The two-stage pipeline separates indexing (chunking, embedding, storing) from querying (embed query, ANN search, rerank, stuff context) — evaluate and optimize each stage independently.
  3. 3Hybrid retrieval (dense embeddings + BM25 sparse) outperforms either alone: dense search captures semantic similarity while BM25 handles exact keyword and rare-term matching.
  4. 4Cross-encoder reranking rescores the top-20 bi-encoder candidates jointly (query + document together), achieving far higher precision at the cost of extra latency — the standard two-stage pattern.
  5. 5The 'lost in the middle' effect means LLMs attend poorly to information placed in the middle of long contexts — place the most relevant retrieved chunks at the beginning or end, not buried in the middle.
🧠

Recap quiz

🧠

RAG recap

Trade-off

ColBERT v2 stores per-token embeddings instead of a single document vector. What is the storage cost trade-off versus a standard bi-encoder?

ColBERT v2 stores per-token embeddings instead of a single document vector. What is the storage cost trade-off versus a standard bi-encoder?
Trade-off

You retrieve five chunks for a RAG answer. The most relevant chunk is chunk #3 (middle of five). What should you do before stuffing them into the prompt?

You retrieve five chunks for a RAG answer. The most relevant chunk is chunk #3 (middle of five). What should you do before stuffing them into the prompt?
Derivation

Your RAG pipeline uses 10,000-token chunks. Retrieval recall looks good but answer quality is poor. What is the most likely root cause?

Your RAG pipeline uses 10,000-token chunks. Retrieval recall looks good but answer quality is poor. What is the most likely root cause?
Trade-off

A production RAG system retrieves top-20 candidates in <50ms, then applies a cross-encoder reranker. Why retrieve 20 before reranking to 5 instead of just retrieving 5 directly?

A production RAG system retrieves top-20 candidates in <50ms, then applies a cross-encoder reranker. Why retrieve 20 before reranking to 5 instead of just retrieving 5 directly?
Derivation

Self-RAG reduces unnecessary retrieval calls by ~30% compared to always-retrieve RAG. What mechanism enables this?

Self-RAG reduces unnecessary retrieval calls by ~30% compared to always-retrieve RAG. What mechanism enables this?
Derivation

You store OpenAI ada-002 embeddings (1,536 dims, float32) for 10M documents. What is the raw vector storage requirement, ignoring index overhead?

You store OpenAI ada-002 embeddings (1,536 dims, float32) for 10M documents. What is the raw vector storage requirement, ignoring index overhead?
Trade-off

An interviewer asks: &ldquo;When should you use RAG vs fine-tuning?&rdquo; Which answer is most defensible?

An interviewer asks: &ldquo;When should you use RAG vs fine-tuning?&rdquo; Which answer is most defensible?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 6 of 6

Design a RAG system for 10M documents with sub-second latency.

★★★
GoogleDatabricks

What's the bottleneck — retrieval or generation?

How do you choose chunk size? What are the tradeoffs?

★★☆
Databricks

Compare dense retrieval vs sparse retrieval (BM25). When to use each?

★★☆
GoogleDatabricks

How would you evaluate RAG quality? What metrics?

★★★
AnthropicOpenAI

What is the 'lost in the middle' problem for long-context RAG?

★★☆
Anthropic

When should you use RAG vs fine-tuning vs larger context window?

★★★
GoogleOpenAI

If you have 1000 product manuals, which approach would you try first?