Agent Harness Architecture — Transformer Math

Module 46 · AI Engineering

⚙️ Agent Harness Architecture

Claude Code runs a while(true) loop — here's what's inside

Status:

Every time you type a message in Claude Code, a while(true) loop decides whether to respond or call another tool. That loop — and the systems around it — is the agent harness.

An agent is just a loop: prompt → LLM → tool_use → execute → repeat
The harness is everything around that loop: permissions, context management, tool orchestration, compaction
This is not theory — this is how production agents (Claude Code, Cursor, Devin) actually work

🎮

The Agent Loop — Interactive Diagram

What you are seeing

The complete lifecycle of a single agent turn: user message enters the loop, the system prompt is assembled, the LLM generates a response, and tool calls are executed before looping back. The permission system gates every tool call, and context compaction kicks in when the conversation grows too long.

What to try

Follow the flow from User Message through each stage. Notice how the loop branches: if the LLM returns tool_use blocks, tools execute and results feed back into the next API call. If the response is text only, the loop exits and renders to the user.

💡

The Intuition

The Problem: LLMs Are Stateless Autocomplete

Without the loop, an LLM is just autocomplete — it can't act on its answers. You ask "fix the bug" and it tells you what to change, but it can't actually open the file, read the code, make the edit, and verify the fix. The agentic harness adds that missing action layer.

Worked Example: "Fix the bug"

Step	What happens	Output
1	User: "fix the bug"	System prompt assembled + API call
2	Claude responds with `tool_use`	"I'll read auth.ts" → `Read("auth.ts")`
3	Harness executes Read tool	File contents appended to conversation
4	API called again with tool result	Claude: "Found it, editing line 42" → `Edit(...)`
5	Harness executes Edit tool	File written, result appended
6	API called again	"Done!" — no `tool_use` → loop ends

The Agentic Loop

The core cycle: user message → system prompt assembly → API call → parse response. If the response contains tool_use blocks, the harness executes each tool, appends the results to the conversation, and calls the API again. If the response is text only, the turn is done — render to the user. This is what makes it “agentic”: it does not just respond, it acts and observes.

System Prompt Assembly

The system prompt is not a static string — it is assembled dynamically from ~15 functions at each API call. It is split by a SYSTEM_PROMPT_DYNAMIC_BOUNDARY marker for API cache optimization:

Static sections (behavior rules, tool docs) — cached across requests, saving latency and cost
Dynamic sections (git status, available skills, current date) — change per session, placed after the boundary
CLAUDE.md files — injected as user context, not system prompt, to preserve the cache hit rate

💡 Tip · This cache split is a significant optimization. The static portion (~8K tokens) gets a cache hit on every request within a session, while only the dynamic portion (~2K tokens) needs reprocessing. For agents that make 50+ API calls per task, this saves substantial time-to-first-token latency.

Tool Orchestration

When the LLM returns multiple tool calls, the harness partitions them into batches. Read-only tools (Glob, Grep, Read) run in parallel; write tools (Bash, Edit) run serially:

// Input: [Grep, Glob, Read, Bash, Grep]

Batch 1: [Grep, Glob, Read] → concurrent (3 parallel)

Batch 2: [Bash] → serial (write tool)

Batch 3: [Grep] → concurrent (1 tool, but could be more)

Up to 10 concurrent tool calls per batch. This maximizes throughput for exploration-heavy tasks while preventing race conditions on writes.

Permission System (Layered Hierarchy)

Every tool call passes through a layered permission check, evaluated top to bottom in deny → ask → allow order:

PreToolUse hooks — shell scripts: exit 0 = pass to the next layer (subsequent deny/allow rules and prompts still run), exit 2 = blocking deny. Organizations can enforce custom policies
Deny rules — absolute blocks configured in settings.json (e.g., never allow rm -rf)
Ask rules — require user confirmation for specific matched patterns (first match wins)
Allow rules — pattern matching for pre-approved operations (e.g., Bash(npm test) always allowed)
Permission mode — user-chosen level: default, acceptEdits, plan, dontAsk, or bypassPermissions
Interactive prompt — ask the user (last resort)

Sub-agents

When a task is complex, the agent can spawn sub-agents — fresh QueryEngine instances with clean context:

Context isolation— not polluted by the parent's 100K+ token conversation history
Background execution — parent continues working while sub-agent runs
Worktree isolation — git worktree gives each sub-agent a separate checkout, preventing file conflicts
Tool subset — no recursive Agent tool by default (prevents fork bombs)

Context Compaction

As conversations grow, the harness uses four strategies to stay within the context window:

Auto-compact — proactive at ~80% context usage. Summarizes the conversation while preserving current task state
Reactive compact— emergency fallback when the API returns "prompt too long"
Microcompact — summarize individual large tool results inline (e.g., a 500-line file read becomes a 10-line summary)
Context collapse — structurally remove old system reminders that are no longer relevant

✨ Insight · Compaction is not optional — without it, agents hit the context limit after ~20 tool calls in a complex task. The key challenge is preserving enough context that the agent does not lose track of what it was doing, while freeing enough space to continue working.

Error Handling and Retry Logic

The harness distinguishes between two failure modes. A tool error (e.g., file not found, command exits non-zero) is not a crash — the result is appended to the conversation as a tool_result with is_error: true and the loop continues. The LLM reads the error and self-corrects — it can try a different path, fix its input, or ask the user. An API error (rate limit, network timeout, context overflow) requires different handling: rate limits trigger exponential backoff with jitter, context overflow triggers reactive compaction, and unrecoverable errors surface to the user with a clear message. This separation is critical: tool failures are learning signals for the LLM, while API failures require harness-level intervention.

Streaming and Turn Boundaries

The harness streams text to the user while simultaneously accumulating tool call blocks. The Anthropic Streaming API returns events in order: content_block_start, content_block_delta, message_stop. Text deltas are forwarded to the terminal in real time so the user sees partial output immediately. Tool call blocks are buffered until message_stop— only then does the harness know the complete set of tool calls for this turn and can apply the read/write partitioning. This means the user sees Claude reasoning ("I'll read auth.ts first") before the tool executes, giving a sense of intent even before any side effects occur.

Quick Check

Why does Claude Code partition tool calls into read-only (parallel) and write (serial) batches?

📐

Key Code Patterns

The Core Loop (TypeScript pseudocode)

typescript

async function* queryLoop(
  messages: Message[],
  tools: Tool[],
  systemPrompt: string
): AsyncGenerator<TextBlock[]> {
  while (true) {
    // 1. Auto-compact if context too long
    if (tokenCount(messages) > threshold) {
      messages = compact(messages);
    }

    // 2. Call LLM
    const response = await callApi(messages, systemPrompt, tools);

    // 3. Parse response
    const { textBlocks, toolBlocks } = parse(response);
    yield textBlocks; // stream to user

    // 4. Transition decision
    if (toolBlocks.length === 0) {
      return; // done!
    }

    // 5. Execute tools
    const results = await runTools(toolBlocks);
    messages.push(assistantMsg(response));
    messages.push(toolResults(results));
    // loop back to step 1
  }
}

Tool Partitioning

typescript

function partitionToolCalls(toolCalls: ToolCall[]): ToolCall[][] {
  const batches: ToolCall[][] = [];
  let currentBatch: ToolCall[] = [];
  let currentIsReadonly: boolean | null = null;

  for (const call of toolCalls) {
    const isReadonly = call.tool.isReadOnly();
    if (currentIsReadonly === null) {
      currentIsReadonly = isReadonly;
    }

    if (isReadonly === currentIsReadonly && isReadonly) {
      currentBatch.push(call); // group read-only together
    } else {
      if (currentBatch.length > 0) batches.push(currentBatch);
      currentBatch = [call];
      currentIsReadonly = isReadonly;
    }
  }

  if (currentBatch.length > 0) batches.push(currentBatch);
  return batches;
}

Permission Check (5-layer hierarchy)

typescript

function checkPermission(
  tool: Tool,
  input: unknown,
  context: Context
): "allow" | "deny" | "ask" {
  // Layer 1: Pre-tool hooks
  const hookResult = runPreHooks(tool, input);
  if (hookResult === EXIT_ALLOW) return "allow";
  if (hookResult === EXIT_DENY) return "deny";

  // Layer 2: Deny rules (absolute blocks — evaluated first)
  if (matchesDenyRules(tool, input)) return "deny";

  // Layer 3: Ask rules (require user confirmation for matched patterns)
  if (matchesAskRules(tool, input)) return "ask";

  // Layer 4: Allow rules (pre-approved patterns)
  if (matchesAllowRules(tool, input)) return "allow";

  // Layer 5: Permission mode
  if (mode === "bypassPermissions") return "allow";
  if (mode === "dontAsk") return "allow";
  if (mode === "plan" && tool.isWriteTool()) return "deny";

  // Layer 6: Ask user (interactive fallback)
  return promptUser(tool, input);
}

🔧

Break It — See What Happens

No tool partitioning (everything serial)

No context compaction

No permission system

📊

Real-World Numbers

Metric	Value
System prompt sections	~15 sections (~8K tokens static + ~2K dynamic)
Tool registry	~30 built-in tools + MCP tools
Auto-compact threshold	13,000 token buffer before context limit
Max concurrent tools	10 parallel read-only per batch
Sub-agent tool subset	No Agent tool (prevents recursion)
Startup time	`claude --version` ~5ms (lazy loading), full REPL ~800ms
Session persistence	~/.claude/projects/<sanitized-cwd>/<id>.jsonl

✨ Insight · The 13,000 token buffer for auto-compaction is carefully chosen: it leaves enough room for one more API call (system prompt + response) while triggering early enough that the compaction summary itself fits within the remaining space. Too small a buffer and the compaction itself can fail.

🧠

Key Takeaways

What to remember for interviews

1An agent is a while(true) loop: prompt → LLM → tool_use → execute → repeat. The harness is everything around that loop: permissions, context management, tool orchestration, and compaction.
2The system prompt is assembled dynamically from ~15 functions per call, split at SYSTEM_PROMPT_DYNAMIC_BOUNDARY so the static portion (~8K tokens) gets a cache hit on every request within a session.
3Read-only tools (Glob, Grep, Read) run in parallel batches of up to 10; write tools (Bash, Edit) run serially — giving 3–5x throughput improvement for exploration-heavy tasks without race conditions.
4Context compaction is non-optional: auto-compact at ~80% usage, reactive compact on overflow, microcompact for large tool results. Without it, agents fail after ~20 tool calls in complex tasks.
5Tool errors (file not found, non-zero exit) feed back into the conversation as tool_result with is_error: true so the LLM self-corrects; API errors (rate limits, context overflow) require harness-level intervention.

📚

Interview Questions

Difficulty:

Company:

Showing 5 of 5

Design an agentic loop for a coding assistant. What are the key components?

★★☆

GoogleAnthropic

How would you implement a permission system for AI tool use that balances safety with usability?

★★★

AnthropicOpenAI

An agent is running out of context window mid-task. What strategies can you use?

★★☆

GoogleMeta

Why partition tool calls into read-only parallel and write serial batches? What could go wrong without this?

★★☆

DatabricksAnthropic

Design a sub-agent system. How do you handle context isolation, resource limits, and result aggregation?

★★★

OpenAIAnthropic

🔧 Tool System

→