Skip to content

Transformer Math

Module 39 · Applications

📏 Long Context & Context Engineering

A model with 200K context still forgets things in the middle — accuracy drops 20% on facts placed at position 100K. Why can’t attention just… attend?

Status:

A context window sounds infinite — until your agent fills it with tool schemas, retrieval chunks, and chat history. Context engineering is the discipline of managing this scarce resource: what goes in, where it goes, and how much it costs. Master this and you save money, reduce latency, and get better outputs.

📏

Context Length: Then vs Now

Context Length GrowthGPT-32K tokens (2020)GPT-3.54K → 16K (2022)GPT-48K → 32K → 128K (2023)Claude 3200K tokens (2024)Gemini 1.51M → 2M (2024)Scaling Techniques1. RoPE ScalingExtend positional encodings byinterpolation between trained positions.Variants: NTK-aware, YaRN (LLaMA, Mistral)2. Sliding Window AttentionEach token attends only to a local window,not the full sequence — O(n·w) not O(n²).Mistral: 4K window, 32K effective range3. Ring AttentionDistribute long sequences across GPUsin a ring; each holds a KV shard.
🎮

Context Window Budget Simulator

What you are seeing

A context window divided into sections. Each color represents a different part of what fills the context: system prompt, tool schemas, retrieved documents, conversation history, response reserve, and safety margin. The bar shows relative allocation, and the table shows exact token counts and cost impact.

What to try: Drag the dividers between sections to resize allocations. Switch context window sizes to see how budgets scale. Watch how prompt caching savings change as you adjust the system prompt and tool schema sections.

Tool Schemas
Retrieved Chunks
Chat History
Safety Margin
Section%TokensNote
System Prompt3%6,000Instructions, persona, constraints
Tool Schemas8%16,000Function definitions, parameters, descriptions
Retrieved Chunks15%30,000RAG results, documents, knowledge
Chat History50%100,000Conversation turns (recent + summarized)
Response Reserve4%8,000Max output tokens for the model's reply
Safety Margin20%40,000Buffer for token count estimation errors

Cost per Request (Claude Sonnet pricing)

Without caching

$0.6000

With prompt caching

$0.5406

10% savings on cached prefix (22,000 tokens)

💡

The Intuition

Context window as scarce resource. Every token you put in the context has a cost (literal dollars) and an opportunity cost (displaces something else). A window at $3/MTok costs $0.60 per request — multiply by millions of API calls and you are burning serious money.

Prompt caching. Anthropic and OpenAI cache the KV computations for static prompt prefixes. If your system prompt + tool schemas are 15K tokens and identical across requests, caching reduces the cost of those tokens by 90% (from $3/MTok to $0.30/MTok on Claude Sonnet). Anthropic supports 5-minute and 1-hour cache tiers; high-traffic endpoints benefit most from caching stable prefixes.

Lost in the middle. Liu et al. (2023) demonstrated that LLMs show a U-shaped attention curve — strong attention to the beginning and end of context, but accuracy drops for information placed in the middle. This is not a bug in one model; it appears across GPT-4, Claude, and Llama. Place critical information at the start or end of your context.

Memory layering. No context window is big enough for an agent that runs for hours. The solution is tiered memory: recent turns stay verbatim (full fidelity), older turns get summarized (compressed but lossy), and the oldest turns go into a vector store (retrieved on demand). This mirrors how human memory works — vivid recent memories, compressed older ones, and associative recall for the rest.

Sliding window vs. summarization. A sliding window simply drops tokens beyond a limit — simple but you lose everything. Summarization compresses older context into a shorter representation — preserves key facts but costs an extra LLM call. The best systems combine both: slide the window AND summarize what falls off.

Infini-attention (Google, 2024)— adds a compressive memory to standard attention so context length becomes theoretically unbounded without growing memory proportionally. Each transformer block maintains two attention mechanisms: a local dot-product attention window over recent tokens, and a compressed “infinite” memory that accumulates older tokens via an associative matrix update (similar to linear attention). New tokens attend to both simultaneously, with a learned gate balancing local vs. memory attention. The memory footprint stays constant regardless of how many tokens have been processed — enabling million-token contexts on hardware that would otherwise OOM with standard quadratic attention.

YaRN (2023) — extends context windows without full fine-tuning. The key insight is that RoPE encodes position through rotation angles at different frequencies, and different frequency bands have different sensitivity to interpolation. YaRN applies different interpolation ratios per frequency band: high- frequency components (encoding fine-grained local position) need less stretching than low-frequency components (encoding long-range position). This “non-uniform interpolation” preserves local attention quality while extending the effective context range. with only a small continued pre-training run on 400M tokens — far cheaper than training a new model from scratch.

Context Engineering Patterns (2024–2025) — the emerging discipline of optimizing what goes into the context window, not just how big it is. Four key patterns: (1) Prompt caching — place static content (system prompt, tool schemas, few-shot examples) first so the KV cache prefix is stable across requests, cutting costs 90%. (2) Progressive disclosure — start with summaries of documents, expand to full text only when the model signals it needs more detail, using a multi-turn approach to stay within budget. (3) Token budgeting — explicitly allocate token quotas across sections (system, tools, history, retrieval, response) and enforce hard caps with graceful degradation rather than silent overflow. (4) Memory layering — recent turns in full, older turns summarized, oldest in a vector store for retrieval-on-demand. Each layer trades fidelity for compression, matching how human working memory works.

✨ Insight · Context engineering is becoming a core skill for AI engineers. It is not just about fitting content into a window — it is about prioritizing what the model needs to see, where it sees it, and minimizing cost. Think of it like memory management in systems programming: stack (recent context), heap (vector store), disk (full logs).

Quick check

Trade-off

Your system prompt changes on every request (user-specific data injected). Should you enable prompt caching for it?

Your system prompt changes on every request (user-specific data injected). Should you enable prompt caching for it?
Quick Check

Why does prompt caching save 90% on system prompts?

📐

The Math

Cost per Request (Cached vs. Uncached)

Split input tokens into cached prefix and uncached suffix . With output tokens :

For Claude Sonnet: , , . Caching a 15K system prompt saves per request.

Attention Complexity vs. Context Length

Self-attention computes an attention matrix. Both compute and memory scale quadratically:

Doubling context length from 128K to 256K quadruples the attention computation. , but compute remains . Ring Attention distributes the sequence across devices.

Lost-in-the-Middle: Accuracy vs. Position

Empirical finding from Liu et al. (2023). Accuracy as a function of document position in a context of documents follows a U-shape:

Where (). This is not a formal model but captures the empirical pattern. Place the most relevant retrieval chunk at position 1 (start) or position N (end) for best results.

Python: Context Budget Calculator

python
def compute_context_budget(
    context_window: int = 200_000,
    system_tokens: int = 4_000,
    tool_tokens: int = 12_000,
    max_retrieval_tokens: int = 30_000,
    max_response_tokens: int = 8_000,
    safety_margin: int = 10_000,
) -> dict:
    """Calculate available tokens for conversation history."""
    reserved = (system_tokens + tool_tokens + max_retrieval_tokens
                + max_response_tokens + safety_margin)
    history_budget = context_window - reserved

    # Cost estimation (Claude Sonnet pricing)
    cached = system_tokens + tool_tokens  # static prefix
    uncached = context_window - cached
    cost_no_cache = context_window * 3.0 / 1e6
    cost_cached = cached * 0.3 / 1e6 + uncached * 3.0 / 1e6

    return {
        "history_budget": history_budget,        # 136,000
        "cost_per_request_no_cache": cost_no_cache,  # $0.60
        "cost_per_request_cached": cost_cached,      # $0.5568
        "savings_pct": (1 - cost_cached / cost_no_cache) * 100,
    }

Python: Anthropic Prompt Caching Setup

python
import anthropic

client = anthropic.Anthropic()

# The system prompt is cached — 90% cost reduction on repeat calls
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,       # 4K+ tokens
            "cache_control": {"type": "ephemeral"}  # <-- enables caching
        },
        {
            "type": "text",
            "text": TOOL_SCHEMAS_TEXT,         # 12K+ tokens
            "cache_control": {"type": "ephemeral"}
        },
    ],
    messages=[{"role": "user", "content": user_query}],
)

# Check cache performance in response usage fields
# cache_creation_input_tokens: tokens cached for the first time
# cache_read_input_tokens: tokens served from cache (90% cheaper)

Quick check

Derivation

A model runs self-attention on a 64K-token context. You then extend to 256K tokens (4× longer). By what factor does the attention compute grow?

A model runs self-attention on a 64K-token context. You then extend to 256K tokens (4× longer). By what factor does the attention compute grow?
🔧

Break It — See What Happens

Stuff everything in context (no budgeting)
Put critical info in the middle of context

Quick check

Trade-off

A RAG pipeline reranks 20 chunks by relevance and inserts them in rank order (rank 1 first). Where does the #1-ranked chunk land in the model&apos;s context?

A RAG pipeline reranks 20 chunks by relevance and inserts them in rank order (rank 1 first). Where does the #1-ranked chunk land in the model&apos;s context?
📊

Real-World Numbers

ModelContext WindowNotes
GPT-4 Turbo~300 pages of text. Lost-in-the-middle confirmed.
Claude Sonnet 4Prompt caching: 90% reduction on cached prefix.
Claude Opus 41M tokensExtended thinking enabled. Largest Anthropic context window.
Gemini 2.0 ProLargest context window at time of release (experimental model). ~5,000 pages.
Llama 3.1Open-source. Extended from 8K via RoPE scaling.
Lost-in-middleLiu 2023: consistent across GPT-4, Claude, Llama

Prompt Caching Savings (Claude Sonnet)

Cached Prefix SizeSavings/RequestAt 10K req/day
5K tokens$0.0135$135/day
15K tokens$0.0405$405/day
50K tokens$0.135$1,350/day
✨ Insight · Context windows are growing fast (8K to 2M in two years) but attention is still O(n^2). Bigger windows do not mean you should use all of it. The best agent systems use 30-60% of the available window and keep the rest as margin. Prompt caching is the single highest-ROI optimization for production LLM apps.

Quick check

Derivation

You cache a 50K-token prefix (system prompt + tools + few-shot examples) at $0.30/MTok instead of $3/MTok. At 10,000 requests/day, what is the daily saving?

You cache a 50K-token prefix (system prompt + tools + few-shot examples) at $0.30/MTok instead of $3/MTok. At 10,000 requests/day, what is the daily saving?
🔭

Long-Context Frontier (2024–2025)

The 2024–2025 research wave produced five distinct answers to the O(n²) attention bottleneck — from hardware-aligned sparsity to entirely non-Transformer architectures:

SystemApproachContext / ScaleSource
NSA (DeepSeek)Sparse attention, faster vs dense FA2. Production recipe in DeepSeek V3.1+arxiv:2502.11089, Feb 2025
Jamba 1.5 (AI21)SSM + MoE hybrid; MoE + Mamba layers interleaved with Transformer blocks. Memory-efficient vs pure Transformer at equal contextarxiv:2408.12570, Aug 2024
TTT Layers (Stanford/UCSD)Test-time trainingReplaces the KV cache with a mini-neural-net updated via gradient descent per token — O(1) memory. Strong on sequences requiring long-range dependencyarxiv:2407.04620, Jul 2024
xLSTM (Hochreiter)Extended LSTMExponential gating + matrix memory cells. — competitive with Llama 2 on language benchmarks (NeurIPS 2024)arxiv:2405.04517, May 2024
LFM (Liquid AI)Dynamic-system archNon-Transformer foundation models based on liquid neural networks / ODEs. Competitive on standard benchmarks at 1B–40B scale (Oct 2024)liquid.ai, Oct 2024
✨ Insight · Architectural diversity is the trend.2022–2023 was “scale the Transformer.” 2024–2025 is “question the Transformer.” SSM hybrids (Jamba, Zamba), recurrent-memory architectures (xLSTM, Mamba), and meta-learning context compression (TTT) all reduce O(n²) to O(n) or O(1) — but none has yet matched pure Transformer quality at the largest scales. NSA is the most production-proven, landing in DeepSeek V3.1+. See Flash Attention for the hardware-level detail on NSA's CUDA kernel design.
🧠

Key Takeaways

What to remember for interviews

  1. 1Context windows are a scarce, priced resource — a full 200K-token request at $3/MTok costs $0.60; at scale this demands deliberate token budgeting across system prompt, tools, history, and retrieval.
  2. 2Prompt caching stores the KV cache for static prompt prefixes server-side, reducing cost by 90% on cached tokens — place stable content (system prompt, tool schemas, few-shot examples) first in every request.
  3. 3The 'lost in the middle' effect: LLMs show U-shaped attention across context, attending strongly to the start and end but losing 20%+ accuracy on information placed in the middle.
  4. 4Long conversations require tiered memory: recent turns verbatim (full fidelity), older turns summarized (compressed), oldest in a vector store (retrieved on demand) — matching how human working memory works.
  5. 5Standard self-attention is O(n²) in sequence length; techniques like Ring Attention distribute the KV cache across devices in a ring topology, enabling million-token contexts without per-device OOM.
🧠

Recap Quiz

🧠

Long Context recap

Derivation

A 200K-token Claude Sonnet request with no prompt caching costs $0.60 in input tokens ($3/MTok). A 15K-token static prefix is cached. What is the approximate cost of the cached request?

A 200K-token Claude Sonnet request with no prompt caching costs $0.60 in input tokens ($3/MTok). A 15K-token static prefix is cached. What is the approximate cost of the cached request?
Trade-off

You have 20 retrieved documents. The most relevant one ranks #1. Where should you place it in the context for best accuracy?

You have 20 retrieved documents. The most relevant one ranks #1. Where should you place it in the context for best accuracy?
Derivation

Doubling context length from 128K to 256K affects attention compute and memory. What changes and what does not?

Doubling context length from 128K to 256K affects attention compute and memory. What changes and what does not?
Derivation

An agent has a 200K-token window. Static overhead (system prompt + tools) is 20K tokens. You want a 4K response reserve and a 10K safety margin. How many tokens remain for conversation history?

An agent has a 200K-token window. Static overhead (system prompt + tools) is 20K tokens. You want a 4K response reserve and a 10K safety margin. How many tokens remain for conversation history?
Derivation

YaRN extends a 4K RoPE-trained model to 128K context. What is the key insight that makes it work better than naive linear interpolation?

YaRN extends a 4K RoPE-trained model to 128K context. What is the key insight that makes it work better than naive linear interpolation?
Trade-off

A sliding window drops tokens beyond 50K. Summarization compresses old turns to 10% of size but costs an extra LLM call. When is summarization worth the cost?

A sliding window drops tokens beyond 50K. Summarization compresses old turns to 10% of size but costs an extra LLM call. When is summarization worth the cost?
Trade-off

Gemini 2.0 Pro has a 2M-token context window. What is the dominant reason NOT to stuff it completely on every request, even if cost is not a concern?

Gemini 2.0 Pro has a 2M-token context window. What is the dominant reason NOT to stuff it completely on every request, even if cost is not a concern?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 6 of 6

Explain the lost-in-the-middle phenomenon. How does it affect RAG system design?

★★☆
GoogleAnthropic

How does prompt caching work and when should you use it? What are the cost tradeoffs?

★★☆
AnthropicOpenAI

Design a token budget for an agent with 200K context. How do you allocate across system prompt, tools, retrieval, history, and response?

★★☆
AnthropicOpenAI

Compare sliding window attention, summarization, and vector store retrieval for managing long conversations. When would you use each?

★★★
GoogleMeta

What is the computational complexity of attention with respect to context length, and how do approaches like Ring Attention address this?

★★★
GoogleMeta

You're building a customer support agent. The system prompt is 8K tokens, tool schemas are 12K, and you have a 128K context window. A user sends a conversation with 150K tokens of history. What do you do?

★★★
AnthropicOpenAI