📋 Prompt Engineering (System)
The system prompt has a secret boundary — everything before it is cached, everything after is fresh
The system prompt is not a static string — it is assembled dynamically from ~15 section functions. A SYSTEM_PROMPT_DYNAMIC_BOUNDARY splits the prompt into a cacheable static prefix and a dynamic suffix that changes per session.
- Static prefix — identity, rules, tool descriptions, tone (same across requests)
- Dynamic suffix — git status, date, memory, MCP instructions, skills
- Tool descriptions sorted deterministically for cache stability (~3K tokens)
- CLAUDE.md injected as user context, not system prompt, to preserve cache
Prompt Assembly Pipeline
What you are seeing
The system prompt is assembled from sections in a fixed order. Everything above the cache boundary is identical across requests — the API reuses cached KV computations for this prefix. Everything below changes per session.
What to try
Notice how the static sections are ordered by stability — identity and rules never change, tool descriptions change only when tools are added/removed. Dynamic sections change every session or even every request.
The Intuition
The Cost Problem
The system prompt is ~10K tokens. It is sent on everyAPI call. Without caching, you pay $15/M × 10K = $0.15 per calljust for the prompt. With caching (same prefix reused): $1.50/M × 10K = $0.015. That is a 10x savings on every single API call.
The cache boundary separates content that never changes (tool docs, behavior rules) from content that does (git status, available skills). Put stable content first — it is the only part the API can cache.
Why the Cache Boundary Matters
Every API request computes attention over the full prompt. For a 5K token system prompt, that is 5K tokens of KV cache computation on every request. Prompt caching reuses the KV cache from previous requests if the prompt prefix matches. The cache boundary tells the API: everything before this line is identical across requests — reuse the cached computation. This saves ~90% on cached prefix tokens.
Tool Ordering Is Load-Bearing
Tool descriptions total ~3K tokens — the largest section in the static prefix. If tools appear in different orders across requests, the prefix diverges and the cache misses. The assembleToolPool() function sorts tools alphabetically and deduplicates them. This ensures the tool section is byte-identical across all requests within a session.
Why CLAUDE.md Is User Context
CLAUDE.md contains per-project instructions — it changes every time the user switches projects. If it were in the system prompt, it would invalidate the entire cache on every project switch. By injecting it as user context (a separate message), the system prompt cache stays intact. The model sees the same information either way — the distinction is purely for cache optimization.
Prompt Variants
The prompt builder supports multiple modes: simple mode (reduced instructions for quick tasks), proactive mode (suggests next steps), coordinator mode (orchestrates sub-agents), and sub-agent mode (receives delegated work). Each variant shares the same static prefix and diverges only in specific sections, maximizing cache reuse across modes.
What Actually Happens on a Cache Hit
To understand why caching matters, it helps to know what the API skips on a hit. Each transformer layer computes attention over all tokens in the prompt — specifically, it produces key (K) and value (V) matrices for every token. For a 5K-token system prompt across 96 attention layers, that is 5,000 × 96 = 480,000 token-layer positions — each holding both a K and a V vector (960,000 total vectors) — computed on every request. Prompt caching stores these KV matrices server-side after the first request. On a cache hit, the server skips recomputing them entirely — it retrieves the stored KV pairs and continues attention from the dynamic suffix. This is why cached tokens cost ~10% of regular input tokens: the GPU work for the prefix is amortized across all requests that share it. The cache TTL on Anthropic's API is 5 minutes of inactivity — a long-running agent session with frequent turns keeps the cache warm indefinitely, while a cold start after an idle period pays full price for the first request.
Cache Breakpoints vs Automatic Prefix Caching
Anthropic's API supports explicit cache_control breakpoints — the client marks specific message boundaries as cacheable. OpenAI uses automatic prefix caching: the longest common prefix across requests is cached automatically without client-side hints. The explicit model (Anthropic) gives the application more control — Claude Code uses it to mark the static prefix boundary precisely, ensuring even partially-overlapping prompts hit the cache. The automatic model (OpenAI) is simpler to use but offers less predictability: a single inserted token at the wrong position invalidates the entire cache. Claude Code's explicit boundary marker is the correct solution for the explicit model — it is not an optimization hack, it is the intended usage pattern documented in Anthropic's prompt caching guide.
Why is CLAUDE.md content injected as user context instead of in the system prompt?
Key Code Patterns
System Prompt Builder (TypeScript pseudocode)
function buildSystemPrompt(tools: Tool[], model: string, mcpClients: McpClient[]): string {
const sections: string[] = [];
// STATIC sections (cacheable — same across requests)
sections.push(introSection()); // "You are Claude Code..."
sections.push(systemRules()); // Core behavior rules
sections.push(doingTasksSection()); // Task execution patterns
sections.push(actionsSection()); // Safety/reversibility
sections.push(toolsSection(tools)); // Tool descriptions (sorted!)
sections.push(toneStyleSection()); // Output formatting
sections.push(efficiencySection()); // "Be concise"
// === CACHE BOUNDARY ===
sections.push("SYSTEM_PROMPT_DYNAMIC_BOUNDARY");
// DYNAMIC sections (change per session)
sections.push(sessionGuidance()); // Runtime context
sections.push(memoryPrompt()); // Memory system instructions
sections.push(envInfo(model)); // OS, shell, model name
sections.push(mcpInstructions()); // MCP server docs
sections.push(skillsGuidance()); // Available skills
return sections.join("\n\n");
}Deterministic Tool Ordering
function assembleToolPool(builtinTools: Tool[], mcpTools: Tool[]): Tool[] {
// Deterministic order = cache-stable tool descriptions
const allTools = [...builtinTools, ...mcpTools];
// Alphabetical sort — same order every time
allTools.sort((a, b) => a.name.localeCompare(b.name));
// Deduplicate (MCP tool might shadow a built-in)
const seen = new Set<string>();
const unique: Tool[] = [];
for (const tool of allTools) {
if (!seen.has(tool.name)) {
seen.add(tool.name);
unique.push(tool);
}
}
return unique;
}
// Why this matters:
// Request 1 tools: [Bash, Edit, Glob, Grep, Read, Write]
// Request 2 tools: [Bash, Edit, Glob, Grep, Read, Write]
// Same order → same tokens → cache HIT
//
// Without sorting:
// Request 1: [Read, Bash, Write, Edit, Grep, Glob]
// Request 2: [Bash, Read, Edit, Write, Glob, Grep]
// Different order → different tokens → cache MISSCache Economics
// Cache hit savings calculation const staticPrefixTokens = 5000; // identity + rules + tools + tone const dynamicSuffixTokens = 500; // git status + date + memory const apiCallsPerTask = 50; // typical complex task // WITHOUT cache optimization const totalInputTokens = (staticPrefixTokens + dynamicSuffixTokens) * apiCallsPerTask; // = 5500 * 50 = 275,000 tokens computed // WITH cache optimization (~90% cache hit on static prefix) const cachedTokens = staticPrefixTokens * apiCallsPerTask * 0.9; // = 225,000 tokens reused from cache (not re-computed) const computedTokens = 275000 - 225000; // = 50,000 tokens actually computed // Cache hit tokens cost ~10% of regular input tokens // Effective savings: ~80% on the static prefix portion
Break It — See What Happens
Real-World Numbers
| Metric | Value |
|---|---|
| Section functions | ~15 functions |
| Static prefix size | ~5K tokens |
| Tool description budget | ~3K tokens (largest section) |
| Dynamic suffix size | ~500 tokens |
| Cache hit rate | ~90% on static prefix |
| Cache token cost | ~10% of regular input tokens |
Key Takeaways
What to remember for interviews
- 1The system prompt has a cache boundary — everything before it is cached across requests, everything after is recomputed fresh.
- 2Deterministic tool sorting (by name) maximizes cache hit rate — changing tool order invalidates the entire KV cache prefix.
- 3System prompt assembly merges CLAUDE.md, project rules, tool schemas, and dynamic context into a single cacheable block.
- 4Cache savings are ~10x: a 10K-token system prompt costs full price once, then ~1/10th on subsequent requests with the same prefix.
Further Reading
- Anthropic Prompt Caching Documentation — Official documentation on how prompt prefix caching works — the mechanism this module's optimizations target.
- Claude Code (source) — The open-source implementation of the dynamic prompt builder with cache boundary optimization.
- Efficient Transformers: A Survey — Tay et al., 2020 — covers KV cache and attention computation optimizations that make prompt caching possible.
- System Prompt Design Best Practices — Anthropic's guide to writing effective system prompts — complements the caching optimization strategy.
- PagedAttention: Efficient Memory Management for LLM Serving (vLLM) — Kwon et al., 2023 — the KV cache memory management technique that makes server-side prompt caching economically feasible at scale.
- Prefix Caching in vLLM — How open-source serving frameworks implement automatic prefix caching — the same mechanism Anthropic exposes via cache_control breakpoints.
- OpenAI Prompt Caching Guide — OpenAI's equivalent caching documentation — useful for comparing API designs and understanding the industry-standard interface.
Interview Questions
Showing 4 of 4
How would you optimize an AI system's prompt for API caching?
★★☆Design a system prompt that's assembled dynamically but maximizes cache hits.
★★★What are the tradeoffs of a static vs dynamic system prompt?
★★☆When should you use exact-match vs. prefix-match caching for prompts, and what are the cost tradeoffs?
★★★