⚙️ Agent Harness Architecture
Claude Code runs a while(true) loop — here's what's inside
Every time you type a message in Claude Code, a while(true) loop decides whether to respond or call another tool. That loop — and the systems around it — is the agent harness.
- An agent is just a loop: prompt → LLM → tool_use → execute → repeat
- The harness is everything around that loop: permissions, context management, tool orchestration, compaction
- This is not theory — this is how production agents (Claude Code, Cursor, Devin) actually work
The Agent Loop — Interactive Diagram
What you are seeing
The complete lifecycle of a single agent turn: user message enters the loop, the system prompt is assembled, the LLM generates a response, and tool calls are executed before looping back. The permission system gates every tool call, and context compaction kicks in when the conversation grows too long.
What to try
Follow the flow from User Message through each stage. Notice how the loop branches: if the LLM returns tool_use blocks, tools execute and results feed back into the next API call. If the response is text only, the loop exits and renders to the user.
The Intuition
The Problem: LLMs Are Stateless Autocomplete
Without the loop, an LLM is just autocomplete — it can't act on its answers. You ask "fix the bug" and it tells you what to change, but it can't actually open the file, read the code, make the edit, and verify the fix. The agentic harness adds that missing action layer.
Worked Example: "Fix the bug"
| Step | What happens | Output |
|---|---|---|
| 1 | User: "fix the bug" | System prompt assembled + API call |
| 2 | Claude responds with tool_use | "I'll read auth.ts" → Read("auth.ts") |
| 3 | Harness executes Read tool | File contents appended to conversation |
| 4 | API called again with tool result | Claude: "Found it, editing line 42" → Edit(...) |
| 5 | Harness executes Edit tool | File written, result appended |
| 6 | API called again | "Done!" — no tool_use → loop ends |
The Agentic Loop
The core cycle: user message → system prompt assembly → API call → parse response. If the response contains tool_use blocks, the harness executes each tool, appends the results to the conversation, and calls the API again. If the response is text only, the turn is done — render to the user. This is what makes it “agentic”: it does not just respond, it acts and observes.
System Prompt Assembly
The system prompt is not a static string — it is assembled dynamically from ~15 functions at each API call. It is split by a SYSTEM_PROMPT_DYNAMIC_BOUNDARY marker for API cache optimization:
- Static sections (behavior rules, tool docs) — cached across requests, saving latency and cost
- Dynamic sections (git status, available skills, current date) — change per session, placed after the boundary
- CLAUDE.md files — injected as user context, not system prompt, to preserve the cache hit rate
Tool Orchestration
When the LLM returns multiple tool calls, the harness partitions them into batches. Read-only tools (Glob, Grep, Read) run in parallel; write tools (Bash, Edit) run serially:
// Input: [Grep, Glob, Read, Bash, Grep]
Batch 1: [Grep, Glob, Read] → concurrent (3 parallel)
Batch 2: [Bash] → serial (write tool)
Batch 3: [Grep] → concurrent (1 tool, but could be more)
Up to 10 concurrent tool calls per batch. This maximizes throughput for exploration-heavy tasks while preventing race conditions on writes.
Permission System (Layered Hierarchy)
Every tool call passes through a layered permission check, evaluated top to bottom in deny → ask → allow order:
- PreToolUse hooks — shell scripts: exit 0 = pass to the next layer (subsequent deny/allow rules and prompts still run), exit 2 = blocking deny. Organizations can enforce custom policies
- Deny rules — absolute blocks configured in settings.json (e.g., never allow
rm -rf) - Ask rules — require user confirmation for specific matched patterns (first match wins)
- Allow rules — pattern matching for pre-approved operations (e.g.,
Bash(npm test)always allowed) - Permission mode — user-chosen level:
default,acceptEdits,plan,dontAsk, orbypassPermissions - Interactive prompt — ask the user (last resort)
Sub-agents
When a task is complex, the agent can spawn sub-agents — fresh QueryEngine instances with clean context:
- Context isolation— not polluted by the parent's 100K+ token conversation history
- Background execution — parent continues working while sub-agent runs
- Worktree isolation — git worktree gives each sub-agent a separate checkout, preventing file conflicts
- Tool subset — no recursive Agent tool by default (prevents fork bombs)
Context Compaction
As conversations grow, the harness uses four strategies to stay within the context window:
- Auto-compact — proactive at ~80% context usage. Summarizes the conversation while preserving current task state
- Reactive compact— emergency fallback when the API returns "prompt too long"
- Microcompact — summarize individual large tool results inline (e.g., a 500-line file read becomes a 10-line summary)
- Context collapse — structurally remove old system reminders that are no longer relevant
Error Handling and Retry Logic
The harness distinguishes between two failure modes. A tool error (e.g., file not found, command exits non-zero) is not a crash — the result is appended to the conversation as a tool_result with is_error: true and the loop continues. The LLM reads the error and self-corrects — it can try a different path, fix its input, or ask the user. An API error (rate limit, network timeout, context overflow) requires different handling: rate limits trigger exponential backoff with jitter, context overflow triggers reactive compaction, and unrecoverable errors surface to the user with a clear message. This separation is critical: tool failures are learning signals for the LLM, while API failures require harness-level intervention.
Streaming and Turn Boundaries
The harness streams text to the user while simultaneously accumulating tool call blocks. The Anthropic Streaming API returns events in order: content_block_start, content_block_delta, message_stop. Text deltas are forwarded to the terminal in real time so the user sees partial output immediately. Tool call blocks are buffered until message_stop— only then does the harness know the complete set of tool calls for this turn and can apply the read/write partitioning. This means the user sees Claude reasoning ("I'll read auth.ts first") before the tool executes, giving a sense of intent even before any side effects occur.
Why does Claude Code partition tool calls into read-only (parallel) and write (serial) batches?
Key Code Patterns
The Core Loop (TypeScript pseudocode)
async function* queryLoop(
messages: Message[],
tools: Tool[],
systemPrompt: string
): AsyncGenerator<TextBlock[]> {
while (true) {
// 1. Auto-compact if context too long
if (tokenCount(messages) > threshold) {
messages = compact(messages);
}
// 2. Call LLM
const response = await callApi(messages, systemPrompt, tools);
// 3. Parse response
const { textBlocks, toolBlocks } = parse(response);
yield textBlocks; // stream to user
// 4. Transition decision
if (toolBlocks.length === 0) {
return; // done!
}
// 5. Execute tools
const results = await runTools(toolBlocks);
messages.push(assistantMsg(response));
messages.push(toolResults(results));
// loop back to step 1
}
}Tool Partitioning
function partitionToolCalls(toolCalls: ToolCall[]): ToolCall[][] {
const batches: ToolCall[][] = [];
let currentBatch: ToolCall[] = [];
let currentIsReadonly: boolean | null = null;
for (const call of toolCalls) {
const isReadonly = call.tool.isReadOnly();
if (currentIsReadonly === null) {
currentIsReadonly = isReadonly;
}
if (isReadonly === currentIsReadonly && isReadonly) {
currentBatch.push(call); // group read-only together
} else {
if (currentBatch.length > 0) batches.push(currentBatch);
currentBatch = [call];
currentIsReadonly = isReadonly;
}
}
if (currentBatch.length > 0) batches.push(currentBatch);
return batches;
}Permission Check (5-layer hierarchy)
function checkPermission(
tool: Tool,
input: unknown,
context: Context
): "allow" | "deny" | "ask" {
// Layer 1: Pre-tool hooks
const hookResult = runPreHooks(tool, input);
if (hookResult === EXIT_ALLOW) return "allow";
if (hookResult === EXIT_DENY) return "deny";
// Layer 2: Deny rules (absolute blocks — evaluated first)
if (matchesDenyRules(tool, input)) return "deny";
// Layer 3: Ask rules (require user confirmation for matched patterns)
if (matchesAskRules(tool, input)) return "ask";
// Layer 4: Allow rules (pre-approved patterns)
if (matchesAllowRules(tool, input)) return "allow";
// Layer 5: Permission mode
if (mode === "bypassPermissions") return "allow";
if (mode === "dontAsk") return "allow";
if (mode === "plan" && tool.isWriteTool()) return "deny";
// Layer 6: Ask user (interactive fallback)
return promptUser(tool, input);
}Break It — See What Happens
Real-World Numbers
| Metric | Value |
|---|---|
| System prompt sections | ~15 sections (~8K tokens static + ~2K dynamic) |
| Tool registry | ~30 built-in tools + MCP tools |
| Auto-compact threshold | 13,000 token buffer before context limit |
| Max concurrent tools | 10 parallel read-only per batch |
| Sub-agent tool subset | No Agent tool (prevents recursion) |
| Startup time | claude --version ~5ms (lazy loading), full REPL ~800ms |
| Session persistence | ~/.claude/projects/<sanitized-cwd>/<id>.jsonl |
Key Takeaways
What to remember for interviews
- 1An agent is a while(true) loop: prompt → LLM → tool_use → execute → repeat. The harness is everything around that loop: permissions, context management, tool orchestration, and compaction.
- 2The system prompt is assembled dynamically from ~15 functions per call, split at SYSTEM_PROMPT_DYNAMIC_BOUNDARY so the static portion (~8K tokens) gets a cache hit on every request within a session.
- 3Read-only tools (Glob, Grep, Read) run in parallel batches of up to 10; write tools (Bash, Edit) run serially — giving 3–5x throughput improvement for exploration-heavy tasks without race conditions.
- 4Context compaction is non-optional: auto-compact at ~80% usage, reactive compact on overflow, microcompact for large tool results. Without it, agents fail after ~20 tool calls in complex tasks.
- 5Tool errors (file not found, non-zero exit) feed back into the conversation as tool_result with is_error: true so the LLM self-corrects; API errors (rate limits, context overflow) require harness-level intervention.
Further Reading
- Toolformer: Language Models Can Teach Themselves to Use Tools — Schick et al., 2023 — training LLMs to decide when and how to call external tools.
- ReAct: Synergizing Reasoning and Acting in Language Models — Yao et al., 2022 — interleaving reasoning traces and actions for grounded decision-making.
- Reflexion: Language Agents with Verbal Reinforcement Learning — Shinn et al., 2023 — agents that reflect on failures and improve across episodes.
- Claude Code (source) — Open-source reference for a production agent harness — the architecture this module describes.
- Model Context Protocol Specification — The open standard for connecting AI agents to external tools and data sources.
- SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering — Yang et al., 2024 — production agent harness design lessons from solving real GitHub issues; ACI (agent-computer interface) design principles.
- LangChain AgentExecutor Architecture — The widely-adopted reference implementation of the tool-call loop — useful comparison to Claude Code's harness design.
- Karpathy: Software 2.0 — Andrej Karpathy's essay on neural networks as programmable software — the conceptual foundation for agent harnesses replacing hand-coded pipelines.
Interview Questions
Showing 5 of 5
Design an agentic loop for a coding assistant. What are the key components?
★★☆How would you implement a permission system for AI tool use that balances safety with usability?
★★★An agent is running out of context window mid-task. What strategies can you use?
★★☆Why partition tool calls into read-only parallel and write serial batches? What could go wrong without this?
★★☆Design a sub-agent system. How do you handle context isolation, resource limits, and result aggregation?
★★★