Long Context & Context Engineering

Module 39 · Applications

📏 Long Context & Context Engineering

A model with 200K context still forgets things in the middle — accuracy drops 20% on facts placed at position 100K. Why can’t attention just… attend?

Status:

A context window sounds infinite — until your agent fills it with tool schemas, retrieval chunks, and chat history. Context engineering is the discipline of managing this scarce resource: what goes in, where it goes, and how much it costs. Master this and you save money, reduce latency, and get better outputs.

📏

Context Length: Then vs Now

🎮

Context Window Budget Simulator

What you are seeing

A context window divided into sections. Each color represents a different part of what fills the context: system prompt, tool schemas, retrieved documents, conversation history, response reserve, and safety margin. The bar shows relative allocation, and the table shows exact token counts and cost impact.

What to try: Drag the dividers between sections to resize allocations. Switch context window sizes to see how budgets scale. Watch how prompt caching savings change as you adjust the system prompt and tool schema sections.

Tool Schemas

Retrieved Chunks

Chat History

Safety Margin

Section	%	Tokens	Note
System Prompt	3%	6,000	Instructions, persona, constraints
Tool Schemas	8%	16,000	Function definitions, parameters, descriptions
Retrieved Chunks	15%	30,000	RAG results, documents, knowledge
Chat History	50%	100,000	Conversation turns (recent + summarized)
Response Reserve	4%	8,000	Max output tokens for the model's reply
Safety Margin	20%	40,000	Buffer for token count estimation errors

Cost per Request (Claude Sonnet pricing)

Without caching

$0.6000

With prompt caching

$0.5406

10% savings on cached prefix (22,000 tokens)

💡

The Intuition

Context window as scarce resource. Every token you put in the context has a cost (literal dollars) and an opportunity cost (displaces something else). A window at $3/MTok costs $0.60 per request — multiply by millions of API calls and you are burning serious money.

Prompt caching. Anthropic and OpenAI cache the KV computations for static prompt prefixes. If your system prompt + tool schemas are 15K tokens and identical across requests, caching reduces the cost of those tokens by 90% (from $3/MTok to $0.30/MTok on Claude Sonnet). Anthropic supports 5-minute and 1-hour cache tiers; high-traffic endpoints benefit most from caching stable prefixes.

Lost in the middle. Liu et al. (2023) demonstrated that LLMs show a U-shaped attention curve — strong attention to the beginning and end of context, but accuracy drops for information placed in the middle. This is not a bug in one model; it appears across GPT-4, Claude, and Llama. Place critical information at the start or end of your context.

Memory layering. No context window is big enough for an agent that runs for hours. The solution is tiered memory: recent turns stay verbatim (full fidelity), older turns get summarized (compressed but lossy), and the oldest turns go into a vector store (retrieved on demand). This mirrors how human memory works — vivid recent memories, compressed older ones, and associative recall for the rest.

Sliding window vs. summarization. A sliding window simply drops tokens beyond a limit — simple but you lose everything. Summarization compresses older context into a shorter representation — preserves key facts but costs an extra LLM call. The best systems combine both: slide the window AND summarize what falls off.

Infini-attention (Google, 2024)— adds a compressive memory to standard attention so context length becomes theoretically unbounded without growing memory proportionally. Each transformer block maintains two attention mechanisms: a local dot-product attention window over recent tokens, and a compressed “infinite” memory that accumulates older tokens via an associative matrix update (similar to linear attention). New tokens attend to both simultaneously, with a learned gate balancing local vs. memory attention. The memory footprint stays constant regardless of how many tokens have been processed — enabling million-token contexts on hardware that would otherwise OOM with standard quadratic attention.

YaRN (2023) — extends context windows without full fine-tuning. The key insight is that RoPE encodes position through rotation angles at different frequencies, and different frequency bands have different sensitivity to interpolation. YaRN applies different interpolation ratios per frequency band: high- frequency components (encoding fine-grained local position) need less stretching than low-frequency components (encoding long-range position). This “non-uniform interpolation” preserves local attention quality while extending the effective context range. with only a small continued pre-training run on 400M tokens — far cheaper than training a new model from scratch.

Context Engineering Patterns (2024–2025) — the emerging discipline of optimizing what goes into the context window, not just how big it is. Four key patterns: (1) Prompt caching — place static content (system prompt, tool schemas, few-shot examples) first so the KV cache prefix is stable across requests, cutting costs 90%. (2) Progressive disclosure — start with summaries of documents, expand to full text only when the model signals it needs more detail, using a multi-turn approach to stay within budget. (3) Token budgeting — explicitly allocate token quotas across sections (system, tools, history, retrieval, response) and enforce hard caps with graceful degradation rather than silent overflow. (4) Memory layering — recent turns in full, older turns summarized, oldest in a vector store for retrieval-on-demand. Each layer trades fidelity for compression, matching how human working memory works.

✨ Insight · Context engineering is becoming a core skill for AI engineers. It is not just about fitting content into a window — it is about prioritizing what the model needs to see, where it sees it, and minimizing cost. Think of it like memory management in systems programming: stack (recent context), heap (vector store), disk (full logs).

Quick check

Trade-off

Your system prompt changes on every request (user-specific data injected). Should you enable prompt caching for it?

Yes — any static segment, even a few tokens, is worth cachingNo — a dynamic prefix never hits the cache; byte equality is requiredYes — providers hash and cache semantically similar contentNo — caching only reduces cost on output tokens, not on input

Quick Check

Why does prompt caching save 90% on system prompts?

📐

The Math

Cost per Request (Cached vs. Uncached)

Split input tokens into cached prefix and uncached suffix . With output tokens :

For Claude Sonnet: , , . Caching a 15K system prompt saves per request.

Attention Complexity vs. Context Length

Self-attention computes an attention matrix. Both compute and memory scale quadratically:

Doubling context length from 128K to 256K quadruples the attention computation. , but compute remains . Ring Attention distributes the sequence across devices.

Lost-in-the-Middle: Accuracy vs. Position

Empirical finding from Liu et al. (2023). Accuracy as a function of document position in a context of documents follows a U-shape:

Where (). This is not a formal model but captures the empirical pattern. Place the most relevant retrieval chunk at position 1 (start) or position N (end) for best results.

Python: Context Budget Calculator

python

def compute_context_budget(
    context_window: int = 200_000,
    system_tokens: int = 4_000,
    tool_tokens: int = 12_000,
    max_retrieval_tokens: int = 30_000,
    max_response_tokens: int = 8_000,
    safety_margin: int = 10_000,
) -> dict:
    """Calculate available tokens for conversation history."""
    reserved = (system_tokens + tool_tokens + max_retrieval_tokens
                + max_response_tokens + safety_margin)
    history_budget = context_window - reserved

    # Cost estimation (Claude Sonnet pricing)
    cached = system_tokens + tool_tokens  # static prefix
    uncached = context_window - cached
    cost_no_cache = context_window * 3.0 / 1e6
    cost_cached = cached * 0.3 / 1e6 + uncached * 3.0 / 1e6

    return {
        "history_budget": history_budget,        # 136,000
        "cost_per_request_no_cache": cost_no_cache,  # $0.60
        "cost_per_request_cached": cost_cached,      # $0.5568
        "savings_pct": (1 - cost_cached / cost_no_cache) * 100,
    }

Python: Anthropic Prompt Caching Setup

python

import anthropic

client = anthropic.Anthropic()

# The system prompt is cached — 90% cost reduction on repeat calls
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,       # 4K+ tokens
            "cache_control": {"type": "ephemeral"}  # <-- enables caching
        },
        {
            "type": "text",
            "text": TOOL_SCHEMAS_TEXT,         # 12K+ tokens
            "cache_control": {"type": "ephemeral"}
        },
    ],
    messages=[{"role": "user", "content": user_query}],
)

# Check cache performance in response usage fields
# cache_creation_input_tokens: tokens cached for the first time
# cache_read_input_tokens: tokens served from cache (90% cheaper)

Quick check

Derivation

A model runs self-attention on a 64K-token context. You then extend to 256K tokens (4× longer). By what factor does the attention compute grow?

4× — attention compute scales linearly with sequence length8× — linear with a constant factor for longer sequences16× — quadratic: (256K/64K)² = 4² = 1664× — attention is O(n³) due to softmax normalization

🔧

Break It — See What Happens

Stuff everything in context (no budgeting)

Put critical info in the middle of context

Quick check

Trade-off

A RAG pipeline reranks 20 chunks by relevance and inserts them in rank order (rank 1 first). Where does the #1-ranked chunk land in the model's context?

At the start — rank 1 is inserted first, benefiting from strong start-of-context attentionIn the middle — rank 1 is balanced between start and end chunksAt the end — rank 1 is appended last for recency biasPosition varies by chunk length; the reranker normalizes positions

📊

Real-World Numbers

Model	Context Window	Notes
GPT-4 Turbo		~300 pages of text. Lost-in-the-middle confirmed.
Claude Sonnet 4		Prompt caching: 90% reduction on cached prefix.
Claude Opus 4	1M tokens	Extended thinking enabled. Largest Anthropic context window.
Gemini 2.0 Pro		Largest context window at time of release (experimental model). ~5,000 pages.
Llama 3.1		Open-source. Extended from 8K via RoPE scaling.
Lost-in-middle		Liu 2023: consistent across GPT-4, Claude, Llama

Prompt Caching Savings (Claude Sonnet)

Cached Prefix Size	Savings/Request	At 10K req/day
5K tokens	$0.0135	$135/day
15K tokens	$0.0405	$405/day
50K tokens	$0.135	$1,350/day

✨ Insight · Context windows are growing fast (8K to 2M in two years) but attention is still O(n^2). Bigger windows do not mean you should use all of it. The best agent systems use 30-60% of the available window and keep the rest as margin. Prompt caching is the single highest-ROI optimization for production LLM apps.

Quick check

Derivation

You cache a 50K-token prefix (system prompt + tools + few-shot examples) at $0.30/MTok instead of $3/MTok. At 10,000 requests/day, what is the daily saving?

$13.50/day — saving on 5K tokens at the cached rate$135/day — correct for a 15K cached prefix$405/day — saving on 15K tokens multiplied by 10K requests$1,350/day — saving on 50K tokens at $2.70/MTok × 10K requests

🔭

Long-Context Frontier (2024–2025)

The 2024–2025 research wave produced five distinct answers to the O(n²) attention bottleneck — from hardware-aligned sparsity to entirely non-Transformer architectures:

System	Approach	Context / Scale	Source
NSA (DeepSeek)	Sparse attention	, faster vs dense FA2. Production recipe in DeepSeek V3.1+	arxiv:2502.11089, Feb 2025
Jamba 1.5 (AI21)	SSM + MoE hybrid	; MoE + Mamba layers interleaved with Transformer blocks. Memory-efficient vs pure Transformer at equal context	arxiv:2408.12570, Aug 2024
TTT Layers (Stanford/UCSD)	Test-time training	Replaces the KV cache with a mini-neural-net updated via gradient descent per token — O(1) memory. Strong on sequences requiring long-range dependency	arxiv:2407.04620, Jul 2024
xLSTM (Hochreiter)	Extended LSTM	Exponential gating + matrix memory cells. — competitive with Llama 2 on language benchmarks (NeurIPS 2024)	arxiv:2405.04517, May 2024
LFM (Liquid AI)	Dynamic-system arch	Non-Transformer foundation models based on liquid neural networks / ODEs. Competitive on standard benchmarks at 1B–40B scale (Oct 2024)	liquid.ai, Oct 2024

✨ Insight · Architectural diversity is the trend.2022–2023 was “scale the Transformer.” 2024–2025 is “question the Transformer.” SSM hybrids (Jamba, Zamba), recurrent-memory architectures (xLSTM, Mamba), and meta-learning context compression (TTT) all reduce O(n²) to O(n) or O(1) — but none has yet matched pure Transformer quality at the largest scales. NSA is the most production-proven, landing in DeepSeek V3.1+. See Flash Attention for the hardware-level detail on NSA's CUDA kernel design.

🧠

Key Takeaways

What to remember for interviews

1Context windows are a scarce, priced resource — a full 200K-token request at $3/MTok costs $0.60; at scale this demands deliberate token budgeting across system prompt, tools, history, and retrieval.
2Prompt caching stores the KV cache for static prompt prefixes server-side, reducing cost by 90% on cached tokens — place stable content (system prompt, tool schemas, few-shot examples) first in every request.
3The 'lost in the middle' effect: LLMs show U-shaped attention across context, attending strongly to the start and end but losing 20%+ accuracy on information placed in the middle.
4Long conversations require tiered memory: recent turns verbatim (full fidelity), older turns summarized (compressed), oldest in a vector store (retrieved on demand) — matching how human working memory works.
5Standard self-attention is O(n²) in sequence length; techniques like Ring Attention distribute the KV cache across devices in a ring topology, enabling million-token contexts without per-device OOM.

🧠

Recap Quiz

📚

Interview Questions

Difficulty:

Company:

Showing 6 of 6

Explain the lost-in-the-middle phenomenon. How does it affect RAG system design?

★★☆

GoogleAnthropic

How does prompt caching work and when should you use it? What are the cost tradeoffs?

★★☆

AnthropicOpenAI

Design a token budget for an agent with 200K context. How do you allocate across system prompt, tools, retrieval, history, and response?

★★☆

AnthropicOpenAI

Compare sliding window attention, summarization, and vector store retrieval for managing long conversations. When would you use each?

★★★

GoogleMeta

What is the computational complexity of attention with respect to context length, and how do approaches like Ring Attention address this?

★★★

GoogleMeta

You're building a customer support agent. The system prompt is 8K tokens, tool schemas are 12K, and you have a 128K context window. A user sends a conversation with 150K tokens of history. What do you do?

★★★

AnthropicOpenAI