RAG & Retrieval — Transformer Math

Module 38 · Applications

🔍 RAG & Retrieval

RAG reduced hallucination from 27% to 4% in one production system

Status:

🎮

Interactive Sandbox

RAG (Retrieval-Augmented Generation) grounds LLM outputs in real data. Instead of relying on memorized knowledge, the model retrieves relevant documents before answering.

The diagram below shows both the indexing pipeline (how documents get into the vector database) and the query pipeline (how a question finds its answer).

💡

The Intuition

Walk one query through the full pipeline.User asks: “What is our refund policy?”

Embed the query — convert to a
ANN searchagainst 10K document chunks → top 20 candidates in
Cross-encoder rerankstop 20 → top 3 most relevant chunks
Stuff top 3 chunks into the LLM prompt as context
LLM generates an answer grounded in the retrieved documents

Without RAG: LLM hallucinates a plausible-sounding but incorrect refund policy from training data. With RAG: LLM quotes the actual policy document, with a citation to the source chunk.

Think of it like an open-book exam: instead of memorizing everything, the model looks up the relevant pages before answering.

ColBERT v2 / Late Interaction: Standard dense retrieval compresses an entire document into a single vector, discarding per-token detail. ColBERT keeps per-token embeddings for both query and document, then scores each query token against every document token and takes the maximum — a MaxSim operation. This late-interaction pattern achieves significantly better retrieval quality than single-vector bi-encoders (typically , at the cost of much higher storage), with only . , making it practical at scale without sacrificing the quality gain.

RAPTOR (Recursive Abstractive Processing for Tree Organized Retrieval): Flat retrieval struggles with multi-hop questions that require combining information from multiple documents. RAPTOR (Sarthi et al., 2024) builds a tree of summaries: raw chunks are clustered by embedding similarity, each cluster is summarized by an LLM, those summaries are clustered and summarized again, and the process repeats until a single root summary covers the entire corpus. At query time, retrieval traverses this tree — matching against both high-level summaries (for broad context) and leaf chunks (for specific details). On multi-hop QA benchmarks like QuALITY and NarrativeQA, RAPTOR consistently outperforms flat retrieval, with the largest gains on questions that require synthesizing information across multiple source passages.

Self-RAG — retrieving only when needed.Standard RAG always retrieves, even for questions the model already knows (“What is 2 + 2?”). Self-RAG (Asai et al., 2023) adds special reflection tokens that the model generates inline — token families for retrieval decisions, relevance judgments, and support verification — allowing it to critique its own outputs at each step. On open-domain QA benchmarks (PopQA, TriviaQA), Self-RAG outperforms standard RAG and retrieval-augmented Llama2-chat on factuality and citation accuracy — while skipping retrieval for factual lookups the model handles correctly, . The tradeoff: the model must be fine-tuned on datasets with these reflection tokens, adding training cost.

Quick check

Trade-off

ColBERT keeps per-token embeddings and scores via MaxSim instead of a single document vector. What is the primary cost of this late-interaction approach?

Higher query latency because MaxSim requires a full matrix multiply per candidate.Training instability because per-token gradients conflict across positions.Substantially more storage because many token vectors are kept per document, not one.Retrieval recall degrades because token-level scores overfit to rare query terms.

Quick Check

Why does RAG reduce hallucination compared to a vanilla LLM?

Quick Check

Your team has 1,000 internal product manuals that change monthly. Which approach should you use first?

📐

Step-by-Step Derivation

Embedding Similarity — Cosine Distance

The core retrieval operation: find documents whose embeddings are closest to the query embedding. Cosine similarity measures the angle between two vectors:

✨ Insight · Why cosine over dot product? Cosine normalizes by magnitude, so it measures direction (semantic meaning) regardless of vector length. In practice, most embedding models already L2-normalize their outputs, making cosine = dot product.

Chunking Strategies

How you split documents into chunks directly affects retrieval quality:

Strategy	Size	Best For
Fixed-size	512 tokens	General purpose, simple
Semantic (paragraph)	Variable	Well-structured documents
Recursive	256-1024	Mixed content types

Hybrid Search — Dense + Sparse

Combine embedding-based (dense) retrieval with keyword-based (sparse) retrieval for better coverage:

Alternatively, use reciprocal rank fusion (RRF): rank documents from each retriever, combine ranks.

💡 Tip · BM25 (Best Match 25) is the classic sparse retrieval algorithm — it scores documents by term frequency and inverse document frequency (TF-IDF variant). It excels at exact keyword matching where dense retrieval struggles (e.g., product codes, proper nouns).

Reranking — Cross-Encoder Precision

Bi-encoders (used for retrieval) encode query and document separately — fast but less accurate. Cross-encoders score the pair jointly — slower but much more precise:

⚠ Warning · Two-stage pattern: Retrieve top-20 with a fast bi-encoder, then rerank to top-5 with a cross-encoder. The cross-encoder sees both query and document together (via cross-attention), catching nuances the bi-encoder misses.

End-to-End RAG Pipeline — Illustrative TypeScript

The four stages in code: embed → retrieve → rerank → generate. Each stage is independently replaceable — swap the vector DB, change the reranker model, or adjust topK without touching the others.

rag.ts

typescript

async function rag(query: string, docs: Document[]) {
  // 1. Embed query — same model used at index time
  const queryEmb = await embed(query);

  // 2. Retrieve top-k chunks via ANN search (~5-50ms)
  const chunks = await vectorDB.search(queryEmb, { topK: 5 });

  // 3. Rerank — cross-encoder sees full (query, doc) pair
  const reranked = await reranker.rank(query, chunks);

  // 4. Generate with retrieved context
  const context = reranked.slice(0, 3).map(c => c.text).join('\n\n');
  return llm.chat(`Context:\n${context}\n\nQuestion: ${query}`);
}

RAG Evaluation — Three Dimensions

Evaluate retrieval and generation separately — a good retriever with a bad generator needs a different fix:

Retrieval Metrics

Generation Metrics

Faithfulness — is every claim traceable to a retrieved chunk? Use NLI (natural language inference) models to check entailment. Relevance — does the answer actually address the query? Completeness — does it cover all aspects?

Hallucination Detection Pipeline

python

# Claim-level hallucination check
response → extract_claims(response)        # LLM extracts atomic claims
         → for each claim:
              check_entailment(claim, chunks) # NLI model: entailed/contradicted/neutral
         → hallucination_rate = contradicted / total_claims

💡 Tip · Tools like RAGAS automate faithfulness + relevance scoring using LLM-as-judge. Key insight: always evaluateretrieval and generation independently — fixing the wrong component wastes time.

⚠ Warning · Lost in the middle: . Place your highest-ranked retrieved chunk first or last, not buried in the middle.

Quick check

Trade-off

Why does a RAG pipeline use a bi-encoder for retrieval and a cross-encoder for reranking, rather than using the cross-encoder for both steps?

Cross-encoders cannot be scaled beyond a few hundred candidates because scoring requires a full forward pass per (query, doc) pair.Bi-encoders achieve higher precision than cross-encoders because they learn query and document representations independently.Cross-encoders require fine-tuning on in-domain data before they can rank documents reliably.Bi-encoders embed query and document in different vector spaces, preventing cosine similarity from being used.

🔧

Break It — See What Happens

Remove retrieval (vanilla LLM, no RAG)

Stale vector index (embeddings not updated)

Bad Chunking (10K token chunks)

No Reranking (embedding similarity only)

Quick check

Trade-off

You remove the reranking step from a RAG pipeline. Which failure mode is most likely?

Retrieval recall drops because the ANN index no longer gets feedback from the reranker.Embedding drift accelerates because the bi-encoder loses a correction signal.Hallucination rates increase because the LLM receives more context tokens.Precision@5 drops because the bi-encoder cannot model fine-grained query–document interactions.

📊

Real-World Numbers

Metric	Value
Embedding dimension (OpenAI ada-002)
Embedding dimension (typical range)
Recommended chunk size
Top-k retrieval	3-10 documents
Vector DB latency (10M vectors)

Quick check

Derivation

You store ada-002 embeddings (1,536 dims, float32) for 1M documents. What is the approximate raw vector size?

~600 MB~6 GB~60 GB~1.5 TB

🧠

Key Takeaways

What to remember for interviews

1RAG grounds LLM outputs in retrieved documents, dramatically reducing hallucination: the model quotes actual source text rather than generating plausible-sounding facts from memorized training data.
2The two-stage pipeline separates indexing (chunking, embedding, storing) from querying (embed query, ANN search, rerank, stuff context) — evaluate and optimize each stage independently.
3Hybrid retrieval (dense embeddings + BM25 sparse) outperforms either alone: dense search captures semantic similarity while BM25 handles exact keyword and rare-term matching.
4Cross-encoder reranking rescores the top-20 bi-encoder candidates jointly (query + document together), achieving far higher precision at the cost of extra latency — the standard two-stage pattern.
5The 'lost in the middle' effect means LLMs attend poorly to information placed in the middle of long contexts — place the most relevant retrieved chunks at the beginning or end, not buried in the middle.

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 6 of 6

Design a RAG system for 10M documents with sub-second latency.

★★★

GoogleDatabricks

What's the bottleneck — retrieval or generation?

How do you choose chunk size? What are the tradeoffs?

★★☆

Databricks

Compare dense retrieval vs sparse retrieval (BM25). When to use each?

★★☆

GoogleDatabricks

How would you evaluate RAG quality? What metrics?

★★★

AnthropicOpenAI

What is the 'lost in the middle' problem for long-context RAG?

★★☆

Anthropic

When should you use RAG vs fine-tuning vs larger context window?

★★★

GoogleOpenAI

If you have 1000 product manuals, which approach would you try first?

←

🔌 Tool Use & Protocols

📏 Long Context & Context Engineering

→

Transformer Math

🔍 RAG & Retrieval

Interactive Sandbox

The Intuition

Step-by-Step Derivation

Break It — See What Happens

Real-World Numbers

Key Takeaways

Recap quiz

RAG recap

Further Reading

Interview Questions