📏 Long Context & Context Engineering
A model with 200K context still forgets things in the middle — accuracy drops 20% on facts placed at position 100K. Why can’t attention just… attend?
A context window sounds infinite — until your agent fills it with tool schemas, retrieval chunks, and chat history. Context engineering is the discipline of managing this scarce resource: what goes in, where it goes, and how much it costs. Master this and you save money, reduce latency, and get better outputs.
Context Length: Then vs Now
Context Window Budget Simulator
What you are seeing
A context window divided into sections. Each color represents a different part of what fills the context: system prompt, tool schemas, retrieved documents, conversation history, response reserve, and safety margin. The bar shows relative allocation, and the table shows exact token counts and cost impact.
What to try: Drag the dividers between sections to resize allocations. Switch context window sizes to see how budgets scale. Watch how prompt caching savings change as you adjust the system prompt and tool schema sections.
| Section | % | Tokens | Note |
|---|---|---|---|
| System Prompt | 3% | 6,000 | Instructions, persona, constraints |
| Tool Schemas | 8% | 16,000 | Function definitions, parameters, descriptions |
| Retrieved Chunks | 15% | 30,000 | RAG results, documents, knowledge |
| Chat History | 50% | 100,000 | Conversation turns (recent + summarized) |
| Response Reserve | 4% | 8,000 | Max output tokens for the model's reply |
| Safety Margin | 20% | 40,000 | Buffer for token count estimation errors |
Cost per Request (Claude Sonnet pricing)
Without caching
$0.6000
With prompt caching
$0.5406
10% savings on cached prefix (22,000 tokens)
The Intuition
Context window as scarce resource. Every token you put in the context has a cost (literal dollars) and an opportunity cost (displaces something else). A window at $3/MTok costs $0.60 per request — multiply by millions of API calls and you are burning serious money.
Prompt caching. Anthropic and OpenAI cache the KV computations for static prompt prefixes. If your system prompt + tool schemas are 15K tokens and identical across requests, caching reduces the cost of those tokens by 90% (from $3/MTok to $0.30/MTok on Claude Sonnet). Anthropic supports 5-minute and 1-hour cache tiers; high-traffic endpoints benefit most from caching stable prefixes.
Lost in the middle. Liu et al. (2023) demonstrated that LLMs show a U-shaped attention curve — strong attention to the beginning and end of context, but accuracy drops for information placed in the middle. This is not a bug in one model; it appears across GPT-4, Claude, and Llama. Place critical information at the start or end of your context.
Memory layering. No context window is big enough for an agent that runs for hours. The solution is tiered memory: recent turns stay verbatim (full fidelity), older turns get summarized (compressed but lossy), and the oldest turns go into a vector store (retrieved on demand). This mirrors how human memory works — vivid recent memories, compressed older ones, and associative recall for the rest.
Sliding window vs. summarization. A sliding window simply drops tokens beyond a limit — simple but you lose everything. Summarization compresses older context into a shorter representation — preserves key facts but costs an extra LLM call. The best systems combine both: slide the window AND summarize what falls off.
Infini-attention (Google, 2024)— adds a compressive memory to standard attention so context length becomes theoretically unbounded without growing memory proportionally. Each transformer block maintains two attention mechanisms: a local dot-product attention window over recent tokens, and a compressed “infinite” memory that accumulates older tokens via an associative matrix update (similar to linear attention). New tokens attend to both simultaneously, with a learned gate balancing local vs. memory attention. The memory footprint stays constant regardless of how many tokens have been processed — enabling million-token contexts on hardware that would otherwise OOM with standard quadratic attention.
YaRN (2023) — extends context windows without full fine-tuning. The key insight is that RoPE encodes position through rotation angles at different frequencies, and different frequency bands have different sensitivity to interpolation. YaRN applies different interpolation ratios per frequency band: high- frequency components (encoding fine-grained local position) need less stretching than low-frequency components (encoding long-range position). This “non-uniform interpolation” preserves local attention quality while extending the effective context range. with only a small continued pre-training run on 400M tokens — far cheaper than training a new model from scratch.
Context Engineering Patterns (2024–2025) — the emerging discipline of optimizing what goes into the context window, not just how big it is. Four key patterns: (1) Prompt caching — place static content (system prompt, tool schemas, few-shot examples) first so the KV cache prefix is stable across requests, cutting costs 90%. (2) Progressive disclosure — start with summaries of documents, expand to full text only when the model signals it needs more detail, using a multi-turn approach to stay within budget. (3) Token budgeting — explicitly allocate token quotas across sections (system, tools, history, retrieval, response) and enforce hard caps with graceful degradation rather than silent overflow. (4) Memory layering — recent turns in full, older turns summarized, oldest in a vector store for retrieval-on-demand. Each layer trades fidelity for compression, matching how human working memory works.
Quick check
Your system prompt changes on every request (user-specific data injected). Should you enable prompt caching for it?
Why does prompt caching save 90% on system prompts?
The Math
Cost per Request (Cached vs. Uncached)
Split input tokens into cached prefix and uncached suffix . With output tokens :
For Claude Sonnet: , , . Caching a 15K system prompt saves per request.
Attention Complexity vs. Context Length
Self-attention computes an attention matrix. Both compute and memory scale quadratically:
Doubling context length from 128K to 256K quadruples the attention computation. , but compute remains . Ring Attention distributes the sequence across devices.
Lost-in-the-Middle: Accuracy vs. Position
Empirical finding from Liu et al. (2023). Accuracy as a function of document position in a context of documents follows a U-shape:
Where (). This is not a formal model but captures the empirical pattern. Place the most relevant retrieval chunk at position 1 (start) or position N (end) for best results.
Python: Context Budget Calculator
def compute_context_budget(
context_window: int = 200_000,
system_tokens: int = 4_000,
tool_tokens: int = 12_000,
max_retrieval_tokens: int = 30_000,
max_response_tokens: int = 8_000,
safety_margin: int = 10_000,
) -> dict:
"""Calculate available tokens for conversation history."""
reserved = (system_tokens + tool_tokens + max_retrieval_tokens
+ max_response_tokens + safety_margin)
history_budget = context_window - reserved
# Cost estimation (Claude Sonnet pricing)
cached = system_tokens + tool_tokens # static prefix
uncached = context_window - cached
cost_no_cache = context_window * 3.0 / 1e6
cost_cached = cached * 0.3 / 1e6 + uncached * 3.0 / 1e6
return {
"history_budget": history_budget, # 136,000
"cost_per_request_no_cache": cost_no_cache, # $0.60
"cost_per_request_cached": cost_cached, # $0.5568
"savings_pct": (1 - cost_cached / cost_no_cache) * 100,
}Python: Anthropic Prompt Caching Setup
import anthropic
client = anthropic.Anthropic()
# The system prompt is cached — 90% cost reduction on repeat calls
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT, # 4K+ tokens
"cache_control": {"type": "ephemeral"} # <-- enables caching
},
{
"type": "text",
"text": TOOL_SCHEMAS_TEXT, # 12K+ tokens
"cache_control": {"type": "ephemeral"}
},
],
messages=[{"role": "user", "content": user_query}],
)
# Check cache performance in response usage fields
# cache_creation_input_tokens: tokens cached for the first time
# cache_read_input_tokens: tokens served from cache (90% cheaper)Quick check
A model runs self-attention on a 64K-token context. You then extend to 256K tokens (4× longer). By what factor does the attention compute grow?
Break It — See What Happens
Quick check
A RAG pipeline reranks 20 chunks by relevance and inserts them in rank order (rank 1 first). Where does the #1-ranked chunk land in the model's context?
Real-World Numbers
| Model | Context Window | Notes |
|---|---|---|
| GPT-4 Turbo | ~300 pages of text. Lost-in-the-middle confirmed. | |
| Claude Sonnet 4 | Prompt caching: 90% reduction on cached prefix. | |
| Claude Opus 4 | 1M tokens | Extended thinking enabled. Largest Anthropic context window. |
| Gemini 2.0 Pro | Largest context window at time of release (experimental model). ~5,000 pages. | |
| Llama 3.1 | Open-source. Extended from 8K via RoPE scaling. | |
| Lost-in-middle | Liu 2023: consistent across GPT-4, Claude, Llama |
Prompt Caching Savings (Claude Sonnet)
| Cached Prefix Size | Savings/Request | At 10K req/day |
|---|---|---|
| 5K tokens | $0.0135 | $135/day |
| 15K tokens | $0.0405 | $405/day |
| 50K tokens | $0.135 | $1,350/day |
Quick check
You cache a 50K-token prefix (system prompt + tools + few-shot examples) at $0.30/MTok instead of $3/MTok. At 10,000 requests/day, what is the daily saving?
Long-Context Frontier (2024–2025)
The 2024–2025 research wave produced five distinct answers to the O(n²) attention bottleneck — from hardware-aligned sparsity to entirely non-Transformer architectures:
| System | Approach | Context / Scale | Source |
|---|---|---|---|
| NSA (DeepSeek) | Sparse attention | , faster vs dense FA2. Production recipe in DeepSeek V3.1+ | arxiv:2502.11089, Feb 2025 |
| Jamba 1.5 (AI21) | SSM + MoE hybrid | ; MoE + Mamba layers interleaved with Transformer blocks. Memory-efficient vs pure Transformer at equal context | arxiv:2408.12570, Aug 2024 |
| TTT Layers (Stanford/UCSD) | Test-time training | Replaces the KV cache with a mini-neural-net updated via gradient descent per token — O(1) memory. Strong on sequences requiring long-range dependency | arxiv:2407.04620, Jul 2024 |
| xLSTM (Hochreiter) | Extended LSTM | Exponential gating + matrix memory cells. — competitive with Llama 2 on language benchmarks (NeurIPS 2024) | arxiv:2405.04517, May 2024 |
| LFM (Liquid AI) | Dynamic-system arch | Non-Transformer foundation models based on liquid neural networks / ODEs. Competitive on standard benchmarks at 1B–40B scale (Oct 2024) | liquid.ai, Oct 2024 |
Key Takeaways
What to remember for interviews
- 1Context windows are a scarce, priced resource — a full 200K-token request at $3/MTok costs $0.60; at scale this demands deliberate token budgeting across system prompt, tools, history, and retrieval.
- 2Prompt caching stores the KV cache for static prompt prefixes server-side, reducing cost by 90% on cached tokens — place stable content (system prompt, tool schemas, few-shot examples) first in every request.
- 3The 'lost in the middle' effect: LLMs show U-shaped attention across context, attending strongly to the start and end but losing 20%+ accuracy on information placed in the middle.
- 4Long conversations require tiered memory: recent turns verbatim (full fidelity), older turns summarized (compressed), oldest in a vector store (retrieved on demand) — matching how human working memory works.
- 5Standard self-attention is O(n²) in sequence length; techniques like Ring Attention distribute the KV cache across devices in a ring topology, enabling million-token contexts without per-device OOM.
Recap Quiz
Long Context recap
A 200K-token Claude Sonnet request with no prompt caching costs $0.60 in input tokens ($3/MTok). A 15K-token static prefix is cached. What is the approximate cost of the cached request?
You have 20 retrieved documents. The most relevant one ranks #1. Where should you place it in the context for best accuracy?
Doubling context length from 128K to 256K affects attention compute and memory. What changes and what does not?
An agent has a 200K-token window. Static overhead (system prompt + tools) is 20K tokens. You want a 4K response reserve and a 10K safety margin. How many tokens remain for conversation history?
YaRN extends a 4K RoPE-trained model to 128K context. What is the key insight that makes it work better than naive linear interpolation?
A sliding window drops tokens beyond 50K. Summarization compresses old turns to 10% of size but costs an extra LLM call. When is summarization worth the cost?
Gemini 2.0 Pro has a 2M-token context window. What is the dominant reason NOT to stuff it completely on every request, even if cost is not a concern?
Further Reading
- RoFormer: Enhanced Transformer with Rotary Position Embedding — Su et al. 2021 — RoPE encodes relative position via rotation, enabling length extrapolation used by LLaMA and Mistral.
- LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models — Chen et al. 2023 — extends context window from 4K to 100K using shifted sparse attention during fine-tuning.
- Lilian Weng's Blog — Posts on long-context modeling, positional encoding extensions, and retrieval vs. context tradeoffs.
- Lost in the Middle: How Language Models Use Long Contexts — Liu et al. 2023 — positional bias in retrieval: models struggle with middle-placed evidence
- Ring Attention with Blockwise Transformers for Near-Infinite Context — Liu et al. 2023 — distributing attention across devices for arbitrarily long sequences
- YaRN: Efficient Context Window Extension of Large Language Models — Peng et al. 2023 — extends context 4-16x by interpolating RoPE frequencies; Llama used this to go from 4K to 128K context
- Anthropic — Prompt Caching Documentation — Practical guide to prefix caching: when to use it, cost savings, and how to structure prompts to maximize cache hit rate
Interview Questions
Showing 6 of 6
Explain the lost-in-the-middle phenomenon. How does it affect RAG system design?
★★☆How does prompt caching work and when should you use it? What are the cost tradeoffs?
★★☆Design a token budget for an agent with 200K context. How do you allocate across system prompt, tools, retrieval, history, and response?
★★☆Compare sliding window attention, summarization, and vector store retrieval for managing long conversations. When would you use each?
★★★What is the computational complexity of attention with respect to context length, and how do approaches like Ring Attention address this?
★★★You're building a customer support agent. The system prompt is 8K tokens, tool schemas are 12K, and you have a 128K context window. A user sends a conversation with 150K tokens of history. What do you do?
★★★