🗜️ Context Compaction
At 80% context usage, the agent silently summarizes its own history to keep going
Every tool call adds tokens to the conversation. A 500-line file read is ~2K tokens. After 20 tool calls, you're at 50K+ tokens. Hit the context limit and the agent is dead — it cannot make another API call. Context compaction is how agents survive long tasks: four strategies that progressively compress the conversation while preserving what matters.
- Auto-compact: proactive at ~80% context — summarizes before hitting the limit
- Reactive compact: emergency recovery when the API returns "prompt too long"
- Microcompact: shrinks individual large tool results (e.g., 10K line grep output)
- Context collapse: structurally removes old system-reminder sections
Compaction Strategies
What you are seeing
The four compaction strategies arranged by aggressiveness. Each targets a different problem: auto-compact prevents hitting the limit, reactive recovers after hitting it, microcompact shrinks individual results, and context collapse removes structural bloat.
What to try
Follow a conversation as it grows from 0 to 200K tokens. Notice which strategy activates at each stage and what information is preserved vs discarded.
The Intuition
Token Growth Without Compaction
Every tool call appends both the request and the result to the conversation. Without compaction, context grows linearly until the model hits its limit and the agent dies mid-task:
| Turn | Tokens | Event |
|---|---|---|
| 1 | 8K | System prompt |
| 5 | 25K | After 4 tool calls |
| 10 | 55K | — |
| 15 | 90K | — |
| 18 | 160K → 45K | AUTO-COMPACT fires — summarises turns 1–13 |
| 25 | 155K → 45K | AUTO-COMPACT again |
Without compaction: Turn 18 hits the 200K limit and the agent dies.
Auto-Compact (Proactive)
Triggers when token usage exceeds context_limit - AUTOCOMPACT_BUFFER(13,000 tokens of safety margin). The system sends old messages to the LLM with "summarize this conversation concisely" and replaces them with the summary. Recent messages (last 5) are kept intact — you don't want to summarize what just happened.
Reactive Compact (Emergency)
When the API returns "prompt too long" despite auto-compact (edge case: a single tool result pushed past the limit), the system does emergency compaction and retries. A hasAttemptedReactiveCompact flag prevents infinite compaction loops — if it fails twice, the system escalates to the user instead of compacting again.
Microcompact (Per-Result)
Instead of compacting the whole conversation, microcompact targets a single large tool result. A 10K-line grep output becomes a 50-line summary of relevant matches. A 500-line file read becomes a paragraph describing the file's structure. This preserves the conversation flow — the tool call and its result are still there, just shorter.
Context Collapse (Structural)
System-reminder messages are injected throughout the conversation (git status, available skills, date). Old ones become stale — the git status from 30 minutes ago is irrelevant. Context collapse removes these outdated structural messages entirely, without summarization. This is the cheapest compaction: no API call needed, just message filtering.
Information Preservation: What to Keep vs. Drop
Compaction is lossy by design — the goal is maximum task continuity with minimum token cost. The preservation heuristic applied during full compaction follows a priority order:
- Always keep: the current task description, file paths that were edited, decisions made by the user, and any errors encountered — these are structural anchors that the agent needs to continue working
- Summarize: the reasoning behind decisions, file contents that were read (compress to "file X contained Y"), and tool output that informed a completed step
- Drop: verbatim grep results that were already acted on, stale file contents superseded by later edits, and exploratory tool calls that led nowhere
The compaction prompt instructs the LLM to produce a structured summary with explicit sections — "current task," "files modified," "decisions made," "next steps" — rather than free-form prose. Structured summaries are easier for the agent to parse in subsequent turns and less likely to drop critical details.
Compaction and Prompt Cache Invalidation
Every time full compaction runs, the messages array is replaced with a summary — a completely different sequence of tokens. API providers cache prompt prefixes: if the first N tokens of a request match a previous request, you get a cache hit. After compaction, the cached prefix is the old (pre-compaction) messages, which no longer match the new (post-compaction) conversation. The result: the next API call after compaction is a cache miss on the messages portion, costing full input token price instead of the discounted cached rate. This is an unavoidable tradeoff — compaction saves tokens overall by reducing message length, even though it momentarily breaks the cache. The static system prompt portion (tool definitions, behavior rules) still hits the cache because it does not change. Only the messages portion misses. After 2–3 turns post-compaction, the new message sequence becomes the cached prefix and cache hits resume.
Why does microcompact exist separately from full compaction?
Key Code Patterns
Auto-Compact Trigger
const AUTOCOMPACT_BUFFER = 13_000; // safety margin in tokens
const WARNING_BUFFER = 20_000;
function shouldAutoCompact(messages: Message[], model: string): boolean {
const used = countTokens(messages);
const limit = getContextLimit(model);
return used > (limit - AUTOCOMPACT_BUFFER);
}
function shouldWarn(messages: Message[], model: string): boolean {
const used = countTokens(messages);
const limit = getContextLimit(model);
return used > (limit - WARNING_BUFFER);
}Full Compaction
async function compact(messages: Message[]): Promise<Message[]> {
// Summarize old messages, keep recent ones
const old = messages.slice(0, -5); // everything except last 5
const recent = messages.slice(-5); // keep recent context intact
const summary = await callApi({
system: "Summarize this conversation concisely. Preserve: " +
"current task, file paths, decisions made, errors.",
messages: old,
});
return [{ role: "system", content: summary }, ...recent];
}
async function reactiveCompact(
messages: Message[],
error: Error
): Promise<Message[]> {
// Emergency compaction after API 'prompt too long' error
if (hasAttemptedReactiveCompact) {
throw new Error("Compaction failed twice — escalate to user");
}
hasAttemptedReactiveCompact = true;
return compact(messages); // caller retries the API call
}Microcompact
async function microcompact(toolResult: string): Promise<string> {
// Shrink a single large tool result
if (toolResult.length > 10_000) {
const summary = await callApi({
system: "Summarize this tool output briefly. " +
"Keep: key findings, paths, line numbers, errors.",
messages: [{ role: "user", content: toolResult }],
});
return `[microcompacted] ${summary}`;
}
return toolResult; // small enough, keep as-is
}
function contextCollapse(messages: Message[]): Message[] {
// Remove stale system-reminder sections (no API call)
return messages.filter(
msg => !(isSystemReminder(msg) && isStale(msg))
);
}Break It — See What Happens
Real-World Numbers
| Metric | Value |
|---|---|
| Auto-compact buffer | 13,000 tokens before context limit |
| Warning buffer | 20,000 tokens before context limit |
| Recent messages kept | Last 5 messages (not summarized) |
| Microcompact threshold | ~10,000 characters per tool result |
| Reactive compact guard | hasAttemptedReactiveCompact (one-shot flag) |
| Compaction strategies | 4 (auto, reactive, micro, collapse) |
Key Takeaways
What to remember for interviews
- 1Four compaction strategies form defense-in-depth: context collapse (free, removes stale system-reminders), microcompact (cheap, shrinks one oversized result), auto-compact (proactive at ~80% context), and reactive compact (emergency fallback after API error).
- 2Auto-compact fires 13,000 tokens before the context limit, leaving room for the summary itself to fit — the 13K buffer is deliberately sized for one more full API call.
- 3Microcompact is surgical: it shrinks a single tool result (e.g., a 10K-line grep) in place without disturbing conversation structure, preventing single-result token spikes.
- 4The hasAttemptedReactiveCompact one-shot flag stops infinite compaction loops — if emergency compaction fails twice, the system escalates to the user instead of retrying.
- 5Every full compaction is a prompt-cache miss on the messages portion; the static system prompt still hits cache, and hits resume after 2–3 turns as the new sequence becomes the cached prefix.
Further Reading
- Claude Code (source) — Production implementation of all four compaction strategies described in this module.
- Lost in the Middle: How Language Models Use Long Contexts — Liu et al., 2023 — LLMs struggle with information in the middle of long contexts, motivating compaction.
- Scaling Transformer to 1M tokens with RingAttention — Liu et al., 2023 — extending context windows, reducing but not eliminating the need for compaction.
- MemGPT: Towards LLMs as Operating Systems — Packer et al., 2023 — virtual context management with paging, a complementary approach to compaction.
- In-Context Retrieval-Augmented Language Models — Ram et al., 2023 — dynamically retrieving only the relevant chunks rather than keeping the full context, the retrieval complement to compaction.
- Anthropic: Long Context Tips — Anthropic's guidance on structuring prompts to maximize long-context performance — directly informs what to preserve vs. discard during compaction.
- Lilian Weng: The Transformer Family v2 — Deep dive on context window extensions and memory-augmented transformers — the research landscape that motivates runtime compaction.
Interview Questions
Showing 4 of 4
An agent hits the context limit mid-task. Design a recovery strategy.
★★☆Compare proactive vs reactive compaction. When does each fail?
★★★How would you decide what to keep vs what to summarize?
★★★Design a microcompact system that shrinks individual tool results without losing critical information.
★★☆