Skip to content

Transformer Math

Module 60 · AI Engineering

🛟 Error Recovery

The API says 'prompt too long' — the agent silently compacts and retries before you notice

Status:

An AI agent running a 30-turn task will hit errors: the prompt grows too long, responses get truncated, rate limits kick in, the user presses Ctrl+C. The agent must recover from all of these without user intervention. Behind this is a transition system — each loop iteration classifies the outcome as either a Continue (retry) or Terminal (stop) transition.

  • Continue: tool_use, reactive_compact_retry, max_output_tokens_recovery
  • Terminal: completed, model_error, max_turns, aborted_streaming
  • Each error type maps to a specific recovery strategy — or a clean exit
🎮

Error Recovery State Machine

What you are seeing

The query loop's transition system. Each iteration produces one of seven transitions — three that continue the loop and four that terminate it. The recovery strategies are encoded directly in the loop logic.

What to try

Trace each error scenario: what triggers it, how the agent recovers, and what prevents infinite retry loops.

# Query loop transition system

API call → success → tool_use? → YES → execute tools → LOOP

→ NO → COMPLETED (terminal)

API call → PromptTooLong → already compacted? → YES → MODEL_ERROR

→ NO → compact → LOOP

API call → MaxOutputTokens → retries >= 3? → YES → MODEL_ERROR

→ NO → double limit → LOOP

API call → RateLimit → sleep(retry_after) → LOOP

API call → AbortError → ABORTED (terminal)

💡

The Intuition

What you’re seeing: the error decision tree — transient errors fork to retry-with-backoff, permanent errors fork to surface-or-abort. What to try: follow what happens when an HTTP 429 arrives mid-stream.

Error Recovery Decision TreeError DetectedTransient?YES (HTTP 5xx,network timeout)Retry w/ Backoffmax 1 retry, then surfaceRecovery Loop / ResumeNO (auth, parse,out-of-scope)Recoverable?YESSurface to LLMtool_result with error msgNOAbortuser alert

The Stakes

The agent is fixing a complex bug — 15 tool calls deep, 150K tokens of context. The API returns prompt too long.

Without recovery

Task fails. User restarts from scratch. Loses 15 minutes of work and all accumulated context.

With reactive compaction

Agent silently summarizes old turns, frees ~80K tokens, retries. User never notices.

Reactive Compaction

When the API returns "prompt too long," the agent doesn't fail — it compacts old messages by summarizing them, then retries. A one-shot flag (hasAttemptedReactiveCompact) prevents infinite loops: if compaction doesn't free enough space, the error becomes terminal.

💡 Tip · Compaction preserves recent tool results (expensive to regenerate) but summarizes older assistant text. This keeps the model's working memory intact while freeing space from conversational history.

Max Output Tokens Recovery

When the model's response is truncated (hit the output token limit), the agent retries with a doubled token limit — up to 3 attempts with escalating limits. This handles the common case where the model generates a long code block that gets cut off mid-function.

Abort Propagation

When the user presses Ctrl+C, an AbortController.abort() propagates through the entire pipeline — cancelling the API request, stopping tool execution, and terminating the stream. This happens in under 100ms, ensuring no wasted tokens or dangling operations.

Error Classification

Every error is classified as transient (retry: rate limits, network timeouts, prompt too long) or permanent (surface: invalid key, content policy violation, budget exhaustion). Getting this wrong is costly: retrying permanent errors wastes tokens, while surfacing transient errors unnecessarily interrupts the user.

✨ Insight · Rate limit handling is proactive, not just reactive. The agent parses rate-limit headers from every response and slows down before hitting the limit, rather than waiting for a 429 error.

AbortSignal Propagation Through Async Generators

When Ctrl+C fires, AbortController.abort() is called once — but the cancellation must propagate through the entire generator chain. Each async generator must pass the signal down to the next level and check signal.throwIfAborted() at yield points. The SDK's messages.stream() accepts an AbortSignal and propagates it to the underlying fetch() call, which aborts the in-flight HTTP request — no more tokens stream, the connection closes, and generation stops immediately (per MDN AbortController). Tools executing mid-stream are a harder problem: read-only tools (Read, Grep, Glob) are safe to cancel at any point, but Bash commands may have already written files. The recovery strategy is to let in-flight tool executions complete, then surface the abort transition rather than killing them mid-write.

Circuit Breaker for Multi-Agent Failures

In a coordinator/worker system, a failing downstream service (e.g., a test runner that always times out) can cascade: every worker retries, saturating the API rate limit and burning budget. The circuit breaker pattern (as described by Martin Fowler) adds a third state beyond transient/permanent: a tripped state where the agent stops attempting the failing operation entirely and surfaces the circuit-open error. After a cooldown period, it moves to half-open (try once) and resets to closed on success. For agent systems, the circuit breaker sits at the tool boundary: if Bash fails 3 times in a row with the same error pattern, trip the breaker and report rather than retrying indefinitely.

Quick Check

What happens when the API returns 'prompt too long' during an agent task?

📐

Key Code Patterns

Query Loop with Error Recovery (TypeScript pseudocode)

typescript
const Transition = {
  COMPLETED:        "completed",              // success — no more tool calls
  TOOL_USE:         "tool_use",               // normal — execute tools, loop back
  REACTIVE_COMPACT: "reactive_compact_retry", // prompt too long
  MAX_TOKENS:       "max_output_tokens_recovery", // output truncated
  MODEL_ERROR:      "model_error",            // permanent API error
  MAX_TURNS:        "max_turns",              // safety limit hit
  ABORTED:          "aborted_streaming",      // user cancelled
} as const;

async function queryLoop(messages: Message[], tools: Tool[], config: Config): Promise<string> {
  let hasAttemptedCompact = false;
  let maxTokensRetries = 0;

  while (true) {
    try {
      const response = await callApi(messages, tools);
      const toolBlocks = extractToolUse(response);
      if (toolBlocks.length === 0) return Transition.COMPLETED;
      const results = await runTools(toolBlocks);
      messages.push(...results);
      // loop back
    } catch (err) {
      if (err instanceof PromptTooLongError) {
        if (hasAttemptedCompact) return Transition.MODEL_ERROR; // already tried, give up
        messages = await compact(messages);
        hasAttemptedCompact = true;
        continue; // retry with compacted messages
      } else if (err instanceof MaxOutputTokensError) {
        if (maxTokensRetries >= 3) return Transition.MODEL_ERROR;
        maxTokensRetries++;
        config.maxTokens *= 2; // escalate limit
        continue;
      } else if (err instanceof RateLimitError) {
        await sleep(err.retryAfter); // backoff
        continue;
      } else if (err instanceof AbortError) {
        return Transition.ABORTED;
      }
      throw err;
    }
  }
}

Rate Limit Handling with Early Warning

typescript
function handleRateLimit(responseHeaders: Headers): void {
  const remaining = parseInt(responseHeaders.get("x-ratelimit-remaining") ?? "0");
  const resetAt = parseTime(responseHeaders.get("x-ratelimit-reset") ?? "");

  const utilization = 1.0 - (remaining / limit);
  const timeRemainingPct = (resetAt - now()) / windowSize;

  if (utilization > timeRemainingPct) {
    warn("Approaching rate limit — slowing down");
    addDelay(calculateBackoff(utilization));
  }
}
🔧

Break It — See What Happens

No reactive compaction
No retry on truncation
No abort handling
📊

Real-World Numbers

MetricValue
Max truncation retries3 attempts with escalating limits
Reactive compact attempts1 per loop iteration (one-shot)
Rate limit sourceParsed from response headers
Abort propagation<100ms via AbortController
Error classificationTransient (retry) vs permanent (surface)
✨ Insight · The one-shot flag for reactive compaction is a deliberate design choice. Allowing multiple compaction attempts risks an infinite loop where the agent keeps compacting and retrying but never makes progress — burning tokens and time on a fundamentally impossible request.
🧠

Key Takeaways

What to remember for interviews

  1. 1Every loop iteration is classified as a Continue (retry) or Terminal (stop) transition — the agent never crashes, it classifies and routes every error.
  2. 2Reactive compaction is one-shot: the agent summarizes old messages to free space, but a hasAttemptedReactiveCompact flag prevents infinite compact-retry loops.
  3. 3Output truncation triggers up to 3 retries with escalating token limits; read-only tools are safe to cancel mid-stream, but Bash commands are let finish to avoid partial writes.
  4. 4Rate limits are handled proactively: the agent parses x-ratelimit-remaining headers and slows down before hitting the limit, not just after a 429 response.
  5. 5Circuit breakers extend the transient/permanent classification: after 3 identical Bash failures, the breaker trips and the agent reports rather than retrying indefinitely.
📚

Further Reading

Trade-off

An agent receives HTTP 429 (rate limit) from the API. Which transition should the loop take?

An agent receives HTTP 429 (rate limit) from the API. Which transition should the loop take?
Recall

A Bash tool exits with code 1 (command not found). Should the agent treat this as a tool error or an API error?

A Bash tool exits with code 1 (command not found). Should the agent treat this as a tool error or an API error?
Derivation

Why does reactive compaction use a one-shot flag (`hasAttemptedReactiveCompact`) rather than retrying compaction until context fits?

Why does reactive compaction use a one-shot flag (`hasAttemptedReactiveCompact`) rather than retrying compaction until context fits?
Recall

A user presses Ctrl+C while the agent is mid-streaming a response. What should happen to in-flight tool calls?

A user presses Ctrl+C while the agent is mid-streaming a response. What should happen to in-flight tool calls?
🎯

Interview Questions

Difficulty:
Company:

Showing 4 of 4

Design error recovery for an AI agent that handles context overflow mid-task.

★★★
AnthropicGoogle

How would you implement graceful degradation when an LLM API rate-limits you?

★★☆
DatabricksMeta

What's the difference between transient and permanent errors in an agentic system?

★★☆
OpenAI

Design an error recovery system that distinguishes between transient failures (retry) and permanent failures (escalate) for LLM tool calls.

★★★
OpenAI