💰 Cost Tracking & Budgets
Claude Code emits cost events on every API response. Miss one and a runaway agent burns $200 before the budget gate fires.
Every API call has a price. Opus input costs $15 per million tokens, output costs $75/M — but cached input tokens cost just $1.50/M (10x cheaper). A cost tracking system monitors spend in real-time, enforces budget limits, and helps users make informed model choices.
- Every API response includes input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens
- Per-model pricing: multiply each token type by its rate, sum for total cost
- Budget limits (
maxBudgetUsd) stop the agent before overspending
Cost Calculation Walkthrough
What you are seeing
A single API call broken down by token type: how many tokens of each kind, the per-token rate, and the resulting cost. Notice how cached tokens dramatically reduce the total.
What to try
Compare the cached vs uncached cost. The system prompt (10K tokens) would cost $0.15 uncached but only $0.015 cached — a 10x savings repeated on every API call.
LLM API Cost Calculator
Input Cost
$0.05550
per request
Output Cost
$0.03000
per request
Cache Savings
$0.09450
52% saved
Per Request
$0.08550
with cache
Cache Impact — 10 requests/day
Without cache (0%)
$1.80/day
With 70% cache
$0.8550/day
Monthly estimate
$25.65/mo
Model Comparison — daily cost for 10 requests
Pricing details: Claude Sonnet 4
input: $3/M tokens
output: $15/M tokens
cache_read: $0.3/M tokens
cache_write: $3.75/M tokens
input_cost = (uncached × $3 + cached × $0.3) / 1M
uncached = 50,000 × 0.30 = 15,000 tokens
cached = 50,000 × 0.70 = 35,000 tokens
The Intuition
Model Cost Comparison — 1K requests × 50K input + 2K output
| Model | Input/M | Output/M | Cache Read/M | 1K req cost |
|---|---|---|---|---|
| Opus 4 | $15.00 | $75.00 | $1.50 | $900→$293(90% cache) |
| Sonnet 4 | $3.00 | $15.00 | $0.30 | $180→$59(90% cache) |
| Haiku 4.5 | $1.00 | $5.00 | $0.10 | $60→$20(90% cache) |
Cache savings assume 90% hit rate on the 50K input prefix. Cached tokens billed at Cache Read rate.
Token-Level Pricing
Every API response includes a usage object with four token counts: input_tokens (new context), cache_read_tokens (reused prefix), cache_creation_tokens (newly cached), and output_tokens (model response). Each has a different per-million rate that varies by model.
Cache Economics
Cached input tokens cost 10x less than uncached ones. The API caches prompt prefixes — if the first N tokens of your request match a previous request, those tokens get a cache hit. The system prompt and tool definitions (~10K tokens) are identical across turns, making them ideal cache candidates. Over a 50-turn session, caching these tokens saves ~$7 on Opus.
Budget Enforcement
The maxBudgetUsd setting defines a spending cap for the session. After every API call, the cost tracker checks cumulative spend against the budget. If exceeded, it raises BudgetExceededError and the agent stops gracefully — saving session state so the user can resume after adjusting the limit.
Rate Limit Awareness
API responses include x-ratelimit-* headers: remaining tokens, total limit, and reset time. The tracker computes utilization (tokens used / total limit) relative to time remaining in the rate limit window. If utilization is ahead of the time curve, it triggers an early warning — the agent can slow down or batch operations to avoid hitting the hard limit.
Cache Prefix Ordering for Maximum Hit Rate
The Anthropic API caches prompt prefixes — only the leading tokens of a request are eligible for caching. This means the order of content in your request directly determines cache efficiency. The optimal layout, per the Anthropic prompt caching docs, is: stable system prompt first, then tool definitions (sorted deterministically by name), then conversation history (variable). Tool definitions must be sorted the same way on every call — if the tool list is built from an unordered object, two calls with identical tools but different insertion order produce different prefixes and get zero cache hits. A cache miss on a 10K-token system prompt costs $0.15 per call on Opus; over 50 turns that is $7.50 wasted. The fix is a single tools.sort((a, b) => a.name.localeCompare(b.name)) before building the request.
Cost Attribution to Tool Calls
Not all API calls cost the same. A call that triggers a 5-tool chain (Read + Grep + Bash + Edit + Bash) costs more in output tokens than one that answers directly — the model must generate tool_use JSON for each tool invocation, and tool results feed back as input tokens in the next call. Attributing cost at the tool level (rather than just per-session) reveals which tools are most expensive to invoke. In practice, Bash is typically the costliest: it produces verbose stdout that becomes large tool_result blocks in the next turn's input. Truncating Bash output (e.g., keeping only the last 2K lines of a long test run) is one of the highest ROI cost optimizations in an agent system.
Why are cached input tokens so much cheaper than uncached ones?
Key Code Patterns
Per-Model Pricing Table
// Per-model pricing (approximate, illustrative)
interface ModelPricing {
input: number;
output: number;
cacheRead: number;
cacheWrite: number;
}
const PRICING: Record<string, ModelPricing> = {
"claude-opus-4": {
input: 15.00 / 1_000_000, // $15/M tokens
output: 75.00 / 1_000_000, // $75/M tokens
cacheRead: 1.50 / 1_000_000, // $1.50/M (10x cheaper!)
cacheWrite: 18.75 / 1_000_000,
},
"claude-sonnet-4": {
input: 3.00 / 1_000_000,
output: 15.00 / 1_000_000,
cacheRead: 0.30 / 1_000_000,
cacheWrite: 3.75 / 1_000_000,
},
};Cost Tracker
interface TokenUsage {
inputTokens: number;
outputTokens: number;
cacheReadTokens: number;
cacheCreationTokens: number;
}
class CostTracker {
private totalCost: number = 0;
private turnCosts: number[] = [];
constructor(
private model: string,
private maxBudget?: number,
) {}
// Called after every API response
recordUsage(usage: TokenUsage): number {
const pricing = PRICING[this.model];
const cost =
usage.inputTokens * pricing.input +
usage.outputTokens * pricing.output +
usage.cacheReadTokens * pricing.cacheRead +
usage.cacheCreationTokens * pricing.cacheWrite;
this.totalCost += cost;
this.turnCosts.push(cost);
if (this.maxBudget && this.totalCost >= this.maxBudget) {
throw new BudgetExceededError(
`Budget $${this.maxBudget.toFixed(2)} exceeded (spent $${this.totalCost.toFixed(2)})`
);
}
return cost;
}
// Predict how many more turns the budget allows
estimateRemainingTurns(): number {
if (!this.turnCosts.length || !this.maxBudget) return Infinity;
const avgCost = this.turnCosts.reduce((a, b) => a + b, 0) / this.turnCosts.length;
const remaining = this.maxBudget - this.totalCost;
return Math.floor(remaining / avgCost);
}
}Rate Limit Tracking
class RateLimitTracker {
private remaining: number = 0;
private limit: number = 0;
private resetAt: Date = new Date();
// Parse rate limit headers from API response
update(headers: Record<string, string>): string {
this.remaining = parseInt(headers["x-ratelimit-remaining-tokens"]);
this.limit = parseInt(headers["x-ratelimit-limit-tokens"]);
this.resetAt = parseTime(headers["x-ratelimit-reset"]);
// Are we using tokens faster than they replenish?
const utilization = 1.0 - this.remaining / this.limit;
const timePct = timeRemainingPct(this.resetAt);
if (utilization > timePct + 0.1) {
return "WARNING: approaching rate limit";
}
return "OK";
}
// True if we should slow down to avoid hard limit
shouldThrottle(): boolean {
return this.remaining < this.limit * 0.1; // less than 10% remaining
}
}Break It — See What Happens
Real-World Numbers
| Metric | Value |
|---|---|
| Opus input cost | $15.00 / M tokens |
| Opus output cost | $75.00 / M tokens |
| Cache read cost | $1.50 / M tokens (10x cheaper) |
| Typical session cost | $0.50 - $5.00 |
| Output vs input ratio | 5x more expensive |
| Cache savings per 50-turn session | ~$7 on Opus (10K token prefix) |
Key Takeaways
What to remember for interviews
- 1Every API response returns four token counts (input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens) — each billed at a different rate that must be multiplied separately to get true cost.
- 2Cached input tokens cost ~10x less than uncached ($1.50/M vs $15/M for Opus) because the API reuses the KV cache from a previous request, skipping GPU computation.
- 3Tool definitions must be sorted deterministically before each request — different insertion order produces a different prefix hash and destroys cache hits worth $7+ per 50-turn session.
- 4Output tokens cost 5x more than input on Claude models, making agent verbosity the dominant cost driver — 'be concise' in a system prompt is a direct cost optimization.
- 5Budget enforcement via maxBudgetUsd raises BudgetExceededError after each API call; proactive rate-limit tracking via x-ratelimit-remaining headers slows requests before hitting the hard limit.
Which token type appears in an API usage object and is billed at the cheapest rate?
An Opus session processes 100K input tokens and 10K output tokens with 0% cache hit rate. A Sonnet session does the same. What is the Opus-to-Sonnet cost ratio for this session?
An agent has a 10K-token system prompt repeated on every call. With a 90% cache hit rate over 50 turns on Opus, what is the approximate cache saving vs no caching?
The budget gate (maxBudgetUsd) is checked after every API response. An agent makes 3 calls before the check fires. Each call costs $4 against a $10 limit. What is the maximum overage?
Further Reading
- Anthropic API Pricing — Current pricing for all Claude models — input, output, and cached token rates.
- Prompt Caching with Claude — How prompt caching works, cache breakpoints, and cost implications for agent systems.
- Token Economics of LLM Applications — a16z analysis of cost structures in production LLM applications.
- Cloud Cost Optimization Patterns — Google Cloud cost optimization — the same principles (metering, budgets, alerts) apply to LLM spend.
- LLM API Pricing Comparison (Artificial Analysis) — Live benchmark tracking price, throughput, and latency across all major LLM providers — the reference for model selection decisions in cost-aware agents.
- OpenTelemetry for LLM Observability — The open standard for emitting cost, latency, and token metrics — the instrumentation layer beneath production LLM cost dashboards.
- Simon Willison: Costs and Pricing for LLM APIs — Simon Willison's year-in-review analysis of LLM cost trends — practical commentary on how pricing has evolved and what it means for agent budget design.
Interview Questions
Showing 4 of 4
Design a cost tracking system for an AI agent that handles multiple pricing tiers.
★★★How does prompt caching affect the economics of AI agent systems?
★☆☆What's the right granularity for budget limits — per-turn, per-session, or per-project?
★★☆Design a cost tracking system that predicts when a conversation will exceed a budget threshold and suggests cheaper alternatives.
★★★