Context Compaction — Transformer Math

Module 52 · AI Engineering

🗜️ Context Compaction

At 80% context usage, the agent silently summarizes its own history to keep going

Status:

Every tool call adds tokens to the conversation. A 500-line file read is ~2K tokens. After 20 tool calls, you're at 50K+ tokens. Hit the context limit and the agent is dead — it cannot make another API call. Context compaction is how agents survive long tasks: four strategies that progressively compress the conversation while preserving what matters.

Auto-compact: proactive at ~80% context — summarizes before hitting the limit
Reactive compact: emergency recovery when the API returns "prompt too long"
Microcompact: shrinks individual large tool results (e.g., 10K line grep output)
Context collapse: structurally removes old system-reminder sections

🎮

Compaction Strategies

What you are seeing

The four compaction strategies arranged by aggressiveness. Each targets a different problem: auto-compact prevents hitting the limit, reactive recovers after hitting it, microcompact shrinks individual results, and context collapse removes structural bloat.

What to try

Follow a conversation as it grows from 0 to 200K tokens. Notice which strategy activates at each stage and what information is preserved vs discarded.

💡

The Intuition

Token Growth Without Compaction

Every tool call appends both the request and the result to the conversation. Without compaction, context grows linearly until the model hits its limit and the agent dies mid-task:

Turn	Tokens	Event
1	8K	System prompt
5	25K	After 4 tool calls
10	55K	—
15	90K	—
18	160K → 45K	AUTO-COMPACT fires — summarises turns 1–13
25	155K → 45K	AUTO-COMPACT again

Without compaction: Turn 18 hits the 200K limit and the agent dies.

Auto-Compact (Proactive)

Triggers when token usage exceeds context_limit - AUTOCOMPACT_BUFFER(13,000 tokens of safety margin). The system sends old messages to the LLM with "summarize this conversation concisely" and replaces them with the summary. Recent messages (last 5) are kept intact — you don't want to summarize what just happened.

💡 Tip · The 13,000 token buffer is carefully chosen: it leaves enough room for one more API call (system prompt + response) while triggering early enough that the compaction summary itself fits within the remaining space.

Reactive Compact (Emergency)

When the API returns "prompt too long" despite auto-compact (edge case: a single tool result pushed past the limit), the system does emergency compaction and retries. A hasAttemptedReactiveCompact flag prevents infinite compaction loops — if it fails twice, the system escalates to the user instead of compacting again.

Microcompact (Per-Result)

Instead of compacting the whole conversation, microcompact targets a single large tool result. A 10K-line grep output becomes a 50-line summary of relevant matches. A 500-line file read becomes a paragraph describing the file's structure. This preserves the conversation flow — the tool call and its result are still there, just shorter.

Context Collapse (Structural)

System-reminder messages are injected throughout the conversation (git status, available skills, date). Old ones become stale — the git status from 30 minutes ago is irrelevant. Context collapse removes these outdated structural messages entirely, without summarization. This is the cheapest compaction: no API call needed, just message filtering.

✨ Insight · The four strategies form a defense-in-depth: context collapse removes structural bloat (free), microcompact shrinks individual results (cheap), auto-compact summarizes history (moderate cost), and reactive compact is the emergency fallback (highest cost but prevents total failure). Each layer catches what the previous one missed.

Information Preservation: What to Keep vs. Drop

Compaction is lossy by design — the goal is maximum task continuity with minimum token cost. The preservation heuristic applied during full compaction follows a priority order:

Always keep: the current task description, file paths that were edited, decisions made by the user, and any errors encountered — these are structural anchors that the agent needs to continue working
Summarize: the reasoning behind decisions, file contents that were read (compress to "file X contained Y"), and tool output that informed a completed step
Drop: verbatim grep results that were already acted on, stale file contents superseded by later edits, and exploratory tool calls that led nowhere

The compaction prompt instructs the LLM to produce a structured summary with explicit sections — "current task," "files modified," "decisions made," "next steps" — rather than free-form prose. Structured summaries are easier for the agent to parse in subsequent turns and less likely to drop critical details.

Compaction and Prompt Cache Invalidation

Every time full compaction runs, the messages array is replaced with a summary — a completely different sequence of tokens. API providers cache prompt prefixes: if the first N tokens of a request match a previous request, you get a cache hit. After compaction, the cached prefix is the old (pre-compaction) messages, which no longer match the new (post-compaction) conversation. The result: the next API call after compaction is a cache miss on the messages portion, costing full input token price instead of the discounted cached rate. This is an unavoidable tradeoff — compaction saves tokens overall by reducing message length, even though it momentarily breaks the cache. The static system prompt portion (tool definitions, behavior rules) still hits the cache because it does not change. Only the messages portion misses. After 2–3 turns post-compaction, the new message sequence becomes the cached prefix and cache hits resume.

Quick Check

Why does microcompact exist separately from full compaction?

📐

Key Code Patterns

Auto-Compact Trigger

typescript

const AUTOCOMPACT_BUFFER = 13_000;  // safety margin in tokens
const WARNING_BUFFER = 20_000;

function shouldAutoCompact(messages: Message[], model: string): boolean {
  const used = countTokens(messages);
  const limit = getContextLimit(model);
  return used > (limit - AUTOCOMPACT_BUFFER);
}

function shouldWarn(messages: Message[], model: string): boolean {
  const used = countTokens(messages);
  const limit = getContextLimit(model);
  return used > (limit - WARNING_BUFFER);
}

Full Compaction

typescript

async function compact(messages: Message[]): Promise<Message[]> {
  // Summarize old messages, keep recent ones
  const old = messages.slice(0, -5);   // everything except last 5
  const recent = messages.slice(-5);   // keep recent context intact

  const summary = await callApi({
    system: "Summarize this conversation concisely. Preserve: " +
            "current task, file paths, decisions made, errors.",
    messages: old,
  });
  return [{ role: "system", content: summary }, ...recent];
}

async function reactiveCompact(
  messages: Message[],
  error: Error
): Promise<Message[]> {
  // Emergency compaction after API 'prompt too long' error
  if (hasAttemptedReactiveCompact) {
    throw new Error("Compaction failed twice — escalate to user");
  }
  hasAttemptedReactiveCompact = true;
  return compact(messages);  // caller retries the API call
}

Microcompact

typescript

async function microcompact(toolResult: string): Promise<string> {
  // Shrink a single large tool result
  if (toolResult.length > 10_000) {
    const summary = await callApi({
      system: "Summarize this tool output briefly. " +
              "Keep: key findings, paths, line numbers, errors.",
      messages: [{ role: "user", content: toolResult }],
    });
    return `[microcompacted] ${summary}`;
  }
  return toolResult;  // small enough, keep as-is
}

function contextCollapse(messages: Message[]): Message[] {
  // Remove stale system-reminder sections (no API call)
  return messages.filter(
    msg => !(isSystemReminder(msg) && isStale(msg))
  );
}

🔧

Break It — See What Happens

No compaction at all

No microcompact (only full compaction)

📊

Real-World Numbers

Metric	Value
Auto-compact buffer	13,000 tokens before context limit
Warning buffer	20,000 tokens before context limit
Recent messages kept	Last 5 messages (not summarized)
Microcompact threshold	~10,000 characters per tool result
Reactive compact guard	hasAttemptedReactiveCompact (one-shot flag)
Compaction strategies	4 (auto, reactive, micro, collapse)

✨ Insight · Without compaction, agents fail after ~20 tool calls on a 200K context window. With all four strategies active, agents can sustain 100+ tool calls in a single session — each compaction cycle frees ~60% of the context, allowing the agent to continue indefinitely.

🧠

Key Takeaways

What to remember for interviews

1Four compaction strategies form defense-in-depth: context collapse (free, removes stale system-reminders), microcompact (cheap, shrinks one oversized result), auto-compact (proactive at ~80% context), and reactive compact (emergency fallback after API error).
2Auto-compact fires 13,000 tokens before the context limit, leaving room for the summary itself to fit — the 13K buffer is deliberately sized for one more full API call.
3Microcompact is surgical: it shrinks a single tool result (e.g., a 10K-line grep) in place without disturbing conversation structure, preventing single-result token spikes.
4The hasAttemptedReactiveCompact one-shot flag stops infinite compaction loops — if emergency compaction fails twice, the system escalates to the user instead of retrying.
5Every full compaction is a prompt-cache miss on the messages portion; the static system prompt still hits cache, and hits resume after 2–3 turns as the new sequence becomes the cached prefix.

📚

Interview Questions

Difficulty:

Company:

Showing 4 of 4

An agent hits the context limit mid-task. Design a recovery strategy.

★★☆

GoogleAnthropic

Compare proactive vs reactive compaction. When does each fail?

★★★