Skip to content

Transformer Math

Module 46 · AI Engineering

⚙️ Agent Harness Architecture

Claude Code runs a while(true) loop — here's what's inside

Status:

Every time you type a message in Claude Code, a while(true) loop decides whether to respond or call another tool. That loop — and the systems around it — is the agent harness.

  • An agent is just a loop: prompt → LLM → tool_use → execute → repeat
  • The harness is everything around that loop: permissions, context management, tool orchestration, compaction
  • This is not theory — this is how production agents (Claude Code, Cursor, Devin) actually work
🎮

The Agent Loop — Interactive Diagram

What you are seeing

The complete lifecycle of a single agent turn: user message enters the loop, the system prompt is assembled, the LLM generates a response, and tool calls are executed before looping back. The permission system gates every tool call, and context compaction kicks in when the conversation grows too long.

What to try

Follow the flow from User Message through each stage. Notice how the loop branches: if the LLM returns tool_use blocks, tools execute and results feed back into the next API call. If the response is text only, the loop exits and renders to the user.

Claude Code Agentic LoopClick any step to learn moreSide SystemsUser Input1Input Router2/help, /clear → local!cmd → shell | text → APIQueryEngine3System Prompt Assembly4staticdynamiccacheable | per-turnAPI Call5POST /v1/messages (stream)Response Parse6text → render to usertool_use → orchestratorTool Orchestration7read-only → parallel batchwrite → serial queuePermission Check8denyallowhooks → rules → mode → promptTool Execution9Built-in: Read, Edit, BashMCP: external serversSkills: plugin bundlesLoop Back10tool_result → message listContext Compactionauto | reactive | microSub-agentsfresh QueryEngine eachExtensionsskills | plugins | MCPMain loopSide systemActive / loop
💡

The Intuition

The Problem: LLMs Are Stateless Autocomplete

Without the loop, an LLM is just autocomplete — it can't act on its answers. You ask "fix the bug" and it tells you what to change, but it can't actually open the file, read the code, make the edit, and verify the fix. The agentic harness adds that missing action layer.

Worked Example: "Fix the bug"

StepWhat happensOutput
1User: "fix the bug"System prompt assembled + API call
2Claude responds with tool_use"I'll read auth.ts" → Read("auth.ts")
3Harness executes Read toolFile contents appended to conversation
4API called again with tool resultClaude: "Found it, editing line 42" → Edit(...)
5Harness executes Edit toolFile written, result appended
6API called again"Done!" — no tool_use → loop ends

The Agentic Loop

The core cycle: user message → system prompt assembly → API call → parse response. If the response contains tool_use blocks, the harness executes each tool, appends the results to the conversation, and calls the API again. If the response is text only, the turn is done — render to the user. This is what makes it “agentic”: it does not just respond, it acts and observes.

System Prompt Assembly

The system prompt is not a static string — it is assembled dynamically from ~15 functions at each API call. It is split by a SYSTEM_PROMPT_DYNAMIC_BOUNDARY marker for API cache optimization:

  • Static sections (behavior rules, tool docs) — cached across requests, saving latency and cost
  • Dynamic sections (git status, available skills, current date) — change per session, placed after the boundary
  • CLAUDE.md files — injected as user context, not system prompt, to preserve the cache hit rate
💡 Tip · This cache split is a significant optimization. The static portion (~8K tokens) gets a cache hit on every request within a session, while only the dynamic portion (~2K tokens) needs reprocessing. For agents that make 50+ API calls per task, this saves substantial time-to-first-token latency.

Tool Orchestration

When the LLM returns multiple tool calls, the harness partitions them into batches. Read-only tools (Glob, Grep, Read) run in parallel; write tools (Bash, Edit) run serially:

// Input: [Grep, Glob, Read, Bash, Grep]

Batch 1: [Grep, Glob, Read] → concurrent (3 parallel)

Batch 2: [Bash] → serial (write tool)

Batch 3: [Grep] → concurrent (1 tool, but could be more)

Up to 10 concurrent tool calls per batch. This maximizes throughput for exploration-heavy tasks while preventing race conditions on writes.

Permission System (Layered Hierarchy)

Every tool call passes through a layered permission check, evaluated top to bottom in deny → ask → allow order:

  1. PreToolUse hooks — shell scripts: exit 0 = pass to the next layer (subsequent deny/allow rules and prompts still run), exit 2 = blocking deny. Organizations can enforce custom policies
  2. Deny rules — absolute blocks configured in settings.json (e.g., never allow rm -rf)
  3. Ask rules — require user confirmation for specific matched patterns (first match wins)
  4. Allow rules — pattern matching for pre-approved operations (e.g., Bash(npm test) always allowed)
  5. Permission mode — user-chosen level: default, acceptEdits, plan, dontAsk, or bypassPermissions
  6. Interactive prompt — ask the user (last resort)

Sub-agents

When a task is complex, the agent can spawn sub-agents — fresh QueryEngine instances with clean context:

  • Context isolation— not polluted by the parent's 100K+ token conversation history
  • Background execution — parent continues working while sub-agent runs
  • Worktree isolation — git worktree gives each sub-agent a separate checkout, preventing file conflicts
  • Tool subset — no recursive Agent tool by default (prevents fork bombs)

Context Compaction

As conversations grow, the harness uses four strategies to stay within the context window:

  • Auto-compact — proactive at ~80% context usage. Summarizes the conversation while preserving current task state
  • Reactive compact— emergency fallback when the API returns "prompt too long"
  • Microcompact — summarize individual large tool results inline (e.g., a 500-line file read becomes a 10-line summary)
  • Context collapse — structurally remove old system reminders that are no longer relevant
✨ Insight · Compaction is not optional — without it, agents hit the context limit after ~20 tool calls in a complex task. The key challenge is preserving enough context that the agent does not lose track of what it was doing, while freeing enough space to continue working.

Error Handling and Retry Logic

The harness distinguishes between two failure modes. A tool error (e.g., file not found, command exits non-zero) is not a crash — the result is appended to the conversation as a tool_result with is_error: true and the loop continues. The LLM reads the error and self-corrects — it can try a different path, fix its input, or ask the user. An API error (rate limit, network timeout, context overflow) requires different handling: rate limits trigger exponential backoff with jitter, context overflow triggers reactive compaction, and unrecoverable errors surface to the user with a clear message. This separation is critical: tool failures are learning signals for the LLM, while API failures require harness-level intervention.

Streaming and Turn Boundaries

The harness streams text to the user while simultaneously accumulating tool call blocks. The Anthropic Streaming API returns events in order: content_block_start, content_block_delta, message_stop. Text deltas are forwarded to the terminal in real time so the user sees partial output immediately. Tool call blocks are buffered until message_stop— only then does the harness know the complete set of tool calls for this turn and can apply the read/write partitioning. This means the user sees Claude reasoning ("I'll read auth.ts first") before the tool executes, giving a sense of intent even before any side effects occur.

Quick Check

Why does Claude Code partition tool calls into read-only (parallel) and write (serial) batches?

📐

Key Code Patterns

The Core Loop (TypeScript pseudocode)

typescript
async function* queryLoop(
  messages: Message[],
  tools: Tool[],
  systemPrompt: string
): AsyncGenerator<TextBlock[]> {
  while (true) {
    // 1. Auto-compact if context too long
    if (tokenCount(messages) > threshold) {
      messages = compact(messages);
    }

    // 2. Call LLM
    const response = await callApi(messages, systemPrompt, tools);

    // 3. Parse response
    const { textBlocks, toolBlocks } = parse(response);
    yield textBlocks; // stream to user

    // 4. Transition decision
    if (toolBlocks.length === 0) {
      return; // done!
    }

    // 5. Execute tools
    const results = await runTools(toolBlocks);
    messages.push(assistantMsg(response));
    messages.push(toolResults(results));
    // loop back to step 1
  }
}

Tool Partitioning

typescript
function partitionToolCalls(toolCalls: ToolCall[]): ToolCall[][] {
  const batches: ToolCall[][] = [];
  let currentBatch: ToolCall[] = [];
  let currentIsReadonly: boolean | null = null;

  for (const call of toolCalls) {
    const isReadonly = call.tool.isReadOnly();
    if (currentIsReadonly === null) {
      currentIsReadonly = isReadonly;
    }

    if (isReadonly === currentIsReadonly && isReadonly) {
      currentBatch.push(call); // group read-only together
    } else {
      if (currentBatch.length > 0) batches.push(currentBatch);
      currentBatch = [call];
      currentIsReadonly = isReadonly;
    }
  }

  if (currentBatch.length > 0) batches.push(currentBatch);
  return batches;
}

Permission Check (5-layer hierarchy)

typescript
function checkPermission(
  tool: Tool,
  input: unknown,
  context: Context
): "allow" | "deny" | "ask" {
  // Layer 1: Pre-tool hooks
  const hookResult = runPreHooks(tool, input);
  if (hookResult === EXIT_ALLOW) return "allow";
  if (hookResult === EXIT_DENY) return "deny";

  // Layer 2: Deny rules (absolute blocks — evaluated first)
  if (matchesDenyRules(tool, input)) return "deny";

  // Layer 3: Ask rules (require user confirmation for matched patterns)
  if (matchesAskRules(tool, input)) return "ask";

  // Layer 4: Allow rules (pre-approved patterns)
  if (matchesAllowRules(tool, input)) return "allow";

  // Layer 5: Permission mode
  if (mode === "bypassPermissions") return "allow";
  if (mode === "dontAsk") return "allow";
  if (mode === "plan" && tool.isWriteTool()) return "deny";

  // Layer 6: Ask user (interactive fallback)
  return promptUser(tool, input);
}
🔧

Break It — See What Happens

No tool partitioning (everything serial)
No context compaction
No permission system
📊

Real-World Numbers

MetricValue
System prompt sections~15 sections (~8K tokens static + ~2K dynamic)
Tool registry~30 built-in tools + MCP tools
Auto-compact threshold13,000 token buffer before context limit
Max concurrent tools10 parallel read-only per batch
Sub-agent tool subsetNo Agent tool (prevents recursion)
Startup timeclaude --version ~5ms (lazy loading), full REPL ~800ms
Session persistence~/.claude/projects/<sanitized-cwd>/<id>.jsonl
✨ Insight · The 13,000 token buffer for auto-compaction is carefully chosen: it leaves enough room for one more API call (system prompt + response) while triggering early enough that the compaction summary itself fits within the remaining space. Too small a buffer and the compaction itself can fail.
🧠

Key Takeaways

What to remember for interviews

  1. 1An agent is a while(true) loop: prompt → LLM → tool_use → execute → repeat. The harness is everything around that loop: permissions, context management, tool orchestration, and compaction.
  2. 2The system prompt is assembled dynamically from ~15 functions per call, split at SYSTEM_PROMPT_DYNAMIC_BOUNDARY so the static portion (~8K tokens) gets a cache hit on every request within a session.
  3. 3Read-only tools (Glob, Grep, Read) run in parallel batches of up to 10; write tools (Bash, Edit) run serially — giving 3–5x throughput improvement for exploration-heavy tasks without race conditions.
  4. 4Context compaction is non-optional: auto-compact at ~80% usage, reactive compact on overflow, microcompact for large tool results. Without it, agents fail after ~20 tool calls in complex tasks.
  5. 5Tool errors (file not found, non-zero exit) feed back into the conversation as tool_result with is_error: true so the LLM self-corrects; API errors (rate limits, context overflow) require harness-level intervention.
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 5 of 5

Design an agentic loop for a coding assistant. What are the key components?

★★☆
GoogleAnthropic

How would you implement a permission system for AI tool use that balances safety with usability?

★★★
AnthropicOpenAI

An agent is running out of context window mid-task. What strategies can you use?

★★☆
GoogleMeta

Why partition tool calls into read-only parallel and write serial batches? What could go wrong without this?

★★☆
DatabricksAnthropic

Design a sub-agent system. How do you handle context isolation, resource limits, and result aggregation?

★★★
OpenAIAnthropic