Prompt Engineering (System) — Transformer Math

Module 56 · AI Engineering

📋 Prompt Engineering (System)

The system prompt has a secret boundary — everything before it is cached, everything after is fresh

Status:

The system prompt is not a static string — it is assembled dynamically from ~15 section functions. A SYSTEM_PROMPT_DYNAMIC_BOUNDARY splits the prompt into a cacheable static prefix and a dynamic suffix that changes per session.

Static prefix — identity, rules, tool descriptions, tone (same across requests)
Dynamic suffix — git status, date, memory, MCP instructions, skills
Tool descriptions sorted deterministically for cache stability (~3K tokens)
CLAUDE.md injected as user context, not system prompt, to preserve cache

🎮

Prompt Assembly Pipeline

What you are seeing

The system prompt is assembled from sections in a fixed order. Everything above the cache boundary is identical across requests — the API reuses cached KV computations for this prefix. Everything below changes per session.

What to try

Notice how the static sections are ordered by stability — identity and rules never change, tool descriptions change only when tools are added/removed. Dynamic sections change every session or even every request.

💡

The Intuition

The Cost Problem

The system prompt is ~10K tokens. It is sent on everyAPI call. Without caching, you pay $15/M × 10K = $0.15 per calljust for the prompt. With caching (same prefix reused): $1.50/M × 10K = $0.015. That is a 10x savings on every single API call.

The cache boundary separates content that never changes (tool docs, behavior rules) from content that does (git status, available skills). Put stable content first — it is the only part the API can cache.

Why the Cache Boundary Matters

Every API request computes attention over the full prompt. For a 5K token system prompt, that is 5K tokens of KV cache computation on every request. Prompt caching reuses the KV cache from previous requests if the prompt prefix matches. The cache boundary tells the API: everything before this line is identical across requests — reuse the cached computation. This saves ~90% on cached prefix tokens.

Tool Ordering Is Load-Bearing

Tool descriptions total ~3K tokens — the largest section in the static prefix. If tools appear in different orders across requests, the prefix diverges and the cache misses. The assembleToolPool() function sorts tools alphabetically and deduplicates them. This ensures the tool section is byte-identical across all requests within a session.

💡 Tip · Over 50+ API calls per task, cache hits on the static prefix save significant cost. With ~5K cacheable tokens at ~90% cache hit rate, you avoid re-computing ~225K tokens of KV cache per task.

Why CLAUDE.md Is User Context

CLAUDE.md contains per-project instructions — it changes every time the user switches projects. If it were in the system prompt, it would invalidate the entire cache on every project switch. By injecting it as user context (a separate message), the system prompt cache stays intact. The model sees the same information either way — the distinction is purely for cache optimization.

Prompt Variants

The prompt builder supports multiple modes: simple mode (reduced instructions for quick tasks), proactive mode (suggests next steps), coordinator mode (orchestrates sub-agents), and sub-agent mode (receives delegated work). Each variant shares the same static prefix and diverges only in specific sections, maximizing cache reuse across modes.

✨ Insight · The prompt is not just text — it is a carefully engineered artifact where section order, content stability, and injection point all affect performance and cost. Moving a single section from static to dynamic can cost hundreds of dollars per month at scale.

What Actually Happens on a Cache Hit

To understand why caching matters, it helps to know what the API skips on a hit. Each transformer layer computes attention over all tokens in the prompt — specifically, it produces key (K) and value (V) matrices for every token. For a 5K-token system prompt across 96 attention layers, that is 5,000 × 96 = 480,000 token-layer positions — each holding both a K and a V vector (960,000 total vectors) — computed on every request. Prompt caching stores these KV matrices server-side after the first request. On a cache hit, the server skips recomputing them entirely — it retrieves the stored KV pairs and continues attention from the dynamic suffix. This is why cached tokens cost ~10% of regular input tokens: the GPU work for the prefix is amortized across all requests that share it. The cache TTL on Anthropic's API is 5 minutes of inactivity — a long-running agent session with frequent turns keeps the cache warm indefinitely, while a cold start after an idle period pays full price for the first request.

Cache Breakpoints vs Automatic Prefix Caching

Anthropic's API supports explicit cache_control breakpoints — the client marks specific message boundaries as cacheable. OpenAI uses automatic prefix caching: the longest common prefix across requests is cached automatically without client-side hints. The explicit model (Anthropic) gives the application more control — Claude Code uses it to mark the static prefix boundary precisely, ensuring even partially-overlapping prompts hit the cache. The automatic model (OpenAI) is simpler to use but offers less predictability: a single inserted token at the wrong position invalidates the entire cache. Claude Code's explicit boundary marker is the correct solution for the explicit model — it is not an optimization hack, it is the intended usage pattern documented in Anthropic's prompt caching guide.

Quick Check

Why is CLAUDE.md content injected as user context instead of in the system prompt?

📐

Key Code Patterns

System Prompt Builder (TypeScript pseudocode)

typescript

function buildSystemPrompt(tools: Tool[], model: string, mcpClients: McpClient[]): string {
  const sections: string[] = [];

  // STATIC sections (cacheable — same across requests)
  sections.push(introSection());           // "You are Claude Code..."
  sections.push(systemRules());            // Core behavior rules
  sections.push(doingTasksSection());      // Task execution patterns
  sections.push(actionsSection());         // Safety/reversibility
  sections.push(toolsSection(tools));      // Tool descriptions (sorted!)
  sections.push(toneStyleSection());       // Output formatting
  sections.push(efficiencySection());      // "Be concise"

  // === CACHE BOUNDARY ===
  sections.push("SYSTEM_PROMPT_DYNAMIC_BOUNDARY");

  // DYNAMIC sections (change per session)
  sections.push(sessionGuidance());        // Runtime context
  sections.push(memoryPrompt());           // Memory system instructions
  sections.push(envInfo(model));           // OS, shell, model name
  sections.push(mcpInstructions());        // MCP server docs
  sections.push(skillsGuidance());         // Available skills

  return sections.join("\n\n");
}

Deterministic Tool Ordering

typescript

function assembleToolPool(builtinTools: Tool[], mcpTools: Tool[]): Tool[] {
  // Deterministic order = cache-stable tool descriptions
  const allTools = [...builtinTools, ...mcpTools];

  // Alphabetical sort — same order every time
  allTools.sort((a, b) => a.name.localeCompare(b.name));

  // Deduplicate (MCP tool might shadow a built-in)
  const seen = new Set<string>();
  const unique: Tool[] = [];
  for (const tool of allTools) {
    if (!seen.has(tool.name)) {
      seen.add(tool.name);
      unique.push(tool);
    }
  }
  return unique;
}

// Why this matters:
// Request 1 tools: [Bash, Edit, Glob, Grep, Read, Write]
// Request 2 tools: [Bash, Edit, Glob, Grep, Read, Write]
// Same order → same tokens → cache HIT
//
// Without sorting:
// Request 1: [Read, Bash, Write, Edit, Grep, Glob]
// Request 2: [Bash, Read, Edit, Write, Glob, Grep]
// Different order → different tokens → cache MISS

Cache Economics

typescript

// Cache hit savings calculation
const staticPrefixTokens = 5000;   // identity + rules + tools + tone
const dynamicSuffixTokens = 500;   // git status + date + memory
const apiCallsPerTask = 50;        // typical complex task

// WITHOUT cache optimization
const totalInputTokens = (staticPrefixTokens + dynamicSuffixTokens) * apiCallsPerTask;
// = 5500 * 50 = 275,000 tokens computed

// WITH cache optimization (~90% cache hit on static prefix)
const cachedTokens = staticPrefixTokens * apiCallsPerTask * 0.9;
// = 225,000 tokens reused from cache (not re-computed)
const computedTokens = 275000 - 225000;
// = 50,000 tokens actually computed

// Cache hit tokens cost ~10% of regular input tokens
// Effective savings: ~80% on the static prefix portion

🔧

Break It — See What Happens

Random tool ordering

No cache boundary (all sections dynamic)

📊

Real-World Numbers

Metric	Value
Section functions	~15 functions
Static prefix size	~5K tokens
Tool description budget	~3K tokens (largest section)
Dynamic suffix size	~500 tokens
Cache hit rate	~90% on static prefix
Cache token cost	~10% of regular input tokens

✨ Insight · The static/dynamic split is a design constraint that propagates through the entire system. Adding a new feature to the system prompt requires deciding: is this static (same across requests) or dynamic (changes per session)? Getting it wrong costs real money at scale.

🧠

Key Takeaways

What to remember for interviews

1The system prompt has a cache boundary — everything before it is cached across requests, everything after is recomputed fresh.
2Deterministic tool sorting (by name) maximizes cache hit rate — changing tool order invalidates the entire KV cache prefix.
3System prompt assembly merges CLAUDE.md, project rules, tool schemas, and dynamic context into a single cacheable block.
4Cache savings are ~10x: a 10K-token system prompt costs full price once, then ~1/10th on subsequent requests with the same prefix.

📚

Interview Questions

Difficulty:

Company:

Showing 4 of 4

How would you optimize an AI system's prompt for API caching?

★★☆

AnthropicGoogle

Design a system prompt that's assembled dynamically but maximizes cache hits.

★★★

Anthropic

What are the tradeoffs of a static vs dynamic system prompt?

★★☆

OpenAI

When should you use exact-match vs. prefix-match caching for prompts, and what are the cost tradeoffs?

★★★

Anthropic

←

🔒 Hooks & Permissions

⚡ Configuration & Schemas

→