Skip to content

Transformer Math

Module 64 · AI Engineering

💰 Cost Tracking & Budgets

Claude Code emits cost events on every API response. Miss one and a runaway agent burns $200 before the budget gate fires.

Status:

Every API call has a price. Opus input costs $15 per million tokens, output costs $75/M — but cached input tokens cost just $1.50/M (10x cheaper). A cost tracking system monitors spend in real-time, enforces budget limits, and helps users make informed model choices.

  • Every API response includes input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens
  • Per-model pricing: multiply each token type by its rate, sum for total cost
  • Budget limits ( maxBudgetUsd) stop the agent before overspending
🎮

Cost Calculation Walkthrough

What you are seeing

A single API call broken down by token type: how many tokens of each kind, the per-token rate, and the resulting cost. Notice how cached tokens dramatically reduce the total.

What to try

Compare the cached vs uncached cost. The system prompt (10K tokens) would cost $0.15 uncached but only $0.015 cached — a 10x savings repeated on every API call.

LLM API Cost Calculator

1K500K
10032K
0%100%
11,000

Input Cost

$0.05550

per request

Output Cost

$0.03000

per request

Cache Savings

$0.09450

52% saved

Per Request

$0.08550

with cache

Cache Impact — 10 requests/day

Without cache (0%)

$1.80/day

With 70% cache

$0.8550/day

Monthly estimate

$25.65/mo

Cache saves $0.9450/day (52% reduction) — or $28.35/month.

Model Comparison — daily cost for 10 requests

Claude Opus 4$4.28Claude Sonnet 4$0.8550Claude Haiku 4.5$0.2850
With cacheWithout cacheSelected model
Pricing details: Claude Sonnet 4

input: $3/M tokens

output: $15/M tokens

cache_read: $0.3/M tokens

cache_write: $3.75/M tokens

input_cost = (uncached × $3 + cached × $0.3) / 1M

uncached = 50,000 × 0.30 = 15,000 tokens

cached = 50,000 × 0.70 = 35,000 tokens

💡

The Intuition

Model Cost Comparison — 1K requests × 50K input + 2K output

ModelInput/MOutput/MCache Read/M1K req cost
Opus 4$15.00$75.00$1.50$900$293(90% cache)
Sonnet 4$3.00$15.00$0.30$180$59(90% cache)
Haiku 4.5$1.00$5.00$0.10$60$20(90% cache)

Cache savings assume 90% hit rate on the 50K input prefix. Cached tokens billed at Cache Read rate.

Token-Level Pricing

Every API response includes a usage object with four token counts: input_tokens (new context), cache_read_tokens (reused prefix), cache_creation_tokens (newly cached), and output_tokens (model response). Each has a different per-million rate that varies by model.

💡 Tip · Output tokens cost 5x more than input tokens on Claude models ($75/M vs $15/M for Opus). A verbose agent that generates long explanations costs far more than one that gives concise answers. This is why agent prompts often say "be concise."

Cache Economics

Cached input tokens cost 10x less than uncached ones. The API caches prompt prefixes — if the first N tokens of your request match a previous request, those tokens get a cache hit. The system prompt and tool definitions (~10K tokens) are identical across turns, making them ideal cache candidates. Over a 50-turn session, caching these tokens saves ~$7 on Opus.

Budget Enforcement

The maxBudgetUsd setting defines a spending cap for the session. After every API call, the cost tracker checks cumulative spend against the budget. If exceeded, it raises BudgetExceededError and the agent stops gracefully — saving session state so the user can resume after adjusting the limit.

✨ Insight · The cost tracker can also estimate remaining turns: divide remaining budget by average cost per turn. This lets the agent prioritize — if only 3 turns remain, skip exploration and go straight to implementation.

Rate Limit Awareness

API responses include x-ratelimit-* headers: remaining tokens, total limit, and reset time. The tracker computes utilization (tokens used / total limit) relative to time remaining in the rate limit window. If utilization is ahead of the time curve, it triggers an early warning — the agent can slow down or batch operations to avoid hitting the hard limit.

Cache Prefix Ordering for Maximum Hit Rate

The Anthropic API caches prompt prefixes — only the leading tokens of a request are eligible for caching. This means the order of content in your request directly determines cache efficiency. The optimal layout, per the Anthropic prompt caching docs, is: stable system prompt first, then tool definitions (sorted deterministically by name), then conversation history (variable). Tool definitions must be sorted the same way on every call — if the tool list is built from an unordered object, two calls with identical tools but different insertion order produce different prefixes and get zero cache hits. A cache miss on a 10K-token system prompt costs $0.15 per call on Opus; over 50 turns that is $7.50 wasted. The fix is a single tools.sort((a, b) => a.name.localeCompare(b.name)) before building the request.

Cost Attribution to Tool Calls

Not all API calls cost the same. A call that triggers a 5-tool chain (Read + Grep + Bash + Edit + Bash) costs more in output tokens than one that answers directly — the model must generate tool_use JSON for each tool invocation, and tool results feed back as input tokens in the next call. Attributing cost at the tool level (rather than just per-session) reveals which tools are most expensive to invoke. In practice, Bash is typically the costliest: it produces verbose stdout that becomes large tool_result blocks in the next turn's input. Truncating Bash output (e.g., keeping only the last 2K lines of a long test run) is one of the highest ROI cost optimizations in an agent system.

Quick Check

Why are cached input tokens so much cheaper than uncached ones?

📐

Key Code Patterns

Per-Model Pricing Table

typescript
// Per-model pricing (approximate, illustrative)
interface ModelPricing {
  input: number;
  output: number;
  cacheRead: number;
  cacheWrite: number;
}

const PRICING: Record<string, ModelPricing> = {
  "claude-opus-4": {
    input:      15.00 / 1_000_000,  // $15/M tokens
    output:     75.00 / 1_000_000,  // $75/M tokens
    cacheRead:   1.50 / 1_000_000,  // $1.50/M (10x cheaper!)
    cacheWrite: 18.75 / 1_000_000,
  },
  "claude-sonnet-4": {
    input:      3.00 / 1_000_000,
    output:    15.00 / 1_000_000,
    cacheRead:  0.30 / 1_000_000,
    cacheWrite: 3.75 / 1_000_000,
  },
};

Cost Tracker

typescript
interface TokenUsage {
  inputTokens: number;
  outputTokens: number;
  cacheReadTokens: number;
  cacheCreationTokens: number;
}

class CostTracker {
  private totalCost: number = 0;
  private turnCosts: number[] = [];

  constructor(
    private model: string,
    private maxBudget?: number,
  ) {}

  // Called after every API response
  recordUsage(usage: TokenUsage): number {
    const pricing = PRICING[this.model];
    const cost =
      usage.inputTokens * pricing.input +
      usage.outputTokens * pricing.output +
      usage.cacheReadTokens * pricing.cacheRead +
      usage.cacheCreationTokens * pricing.cacheWrite;

    this.totalCost += cost;
    this.turnCosts.push(cost);

    if (this.maxBudget && this.totalCost >= this.maxBudget) {
      throw new BudgetExceededError(
        `Budget $${this.maxBudget.toFixed(2)} exceeded (spent $${this.totalCost.toFixed(2)})`
      );
    }
    return cost;
  }

  // Predict how many more turns the budget allows
  estimateRemainingTurns(): number {
    if (!this.turnCosts.length || !this.maxBudget) return Infinity;
    const avgCost = this.turnCosts.reduce((a, b) => a + b, 0) / this.turnCosts.length;
    const remaining = this.maxBudget - this.totalCost;
    return Math.floor(remaining / avgCost);
  }
}

Rate Limit Tracking

typescript
class RateLimitTracker {
  private remaining: number = 0;
  private limit: number = 0;
  private resetAt: Date = new Date();

  // Parse rate limit headers from API response
  update(headers: Record<string, string>): string {
    this.remaining = parseInt(headers["x-ratelimit-remaining-tokens"]);
    this.limit = parseInt(headers["x-ratelimit-limit-tokens"]);
    this.resetAt = parseTime(headers["x-ratelimit-reset"]);

    // Are we using tokens faster than they replenish?
    const utilization = 1.0 - this.remaining / this.limit;
    const timePct = timeRemainingPct(this.resetAt);

    if (utilization > timePct + 0.1) {
      return "WARNING: approaching rate limit";
    }
    return "OK";
  }

  // True if we should slow down to avoid hard limit
  shouldThrottle(): boolean {
    return this.remaining < this.limit * 0.1; // less than 10% remaining
  }
}
🔧

Break It — See What Happens

No budget limits
No cost visibility
📊

Real-World Numbers

MetricValue
Opus input cost$15.00 / M tokens
Opus output cost$75.00 / M tokens
Cache read cost$1.50 / M tokens (10x cheaper)
Typical session cost$0.50 - $5.00
Output vs input ratio5x more expensive
Cache savings per 50-turn session~$7 on Opus (10K token prefix)
✨ Insight · The 5x output-to-input cost ratio means that a concise agent (generating 500 output tokens per turn) costs 5x less in output than a verbose one (2,500 tokens). Over 50 turns, that's $3.75 vs $9.38 in output costs alone — conciseness is a cost optimization.
🧠

Key Takeaways

What to remember for interviews

  1. 1Every API response returns four token counts (input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens) — each billed at a different rate that must be multiplied separately to get true cost.
  2. 2Cached input tokens cost ~10x less than uncached ($1.50/M vs $15/M for Opus) because the API reuses the KV cache from a previous request, skipping GPU computation.
  3. 3Tool definitions must be sorted deterministically before each request — different insertion order produces a different prefix hash and destroys cache hits worth $7+ per 50-turn session.
  4. 4Output tokens cost 5x more than input on Claude models, making agent verbosity the dominant cost driver — 'be concise' in a system prompt is a direct cost optimization.
  5. 5Budget enforcement via maxBudgetUsd raises BudgetExceededError after each API call; proactive rate-limit tracking via x-ratelimit-remaining headers slows requests before hitting the hard limit.
Derivation

Which token type appears in an API usage object and is billed at the cheapest rate?

Which token type appears in an API usage object and is billed at the cheapest rate?
Derivation

An Opus session processes 100K input tokens and 10K output tokens with 0% cache hit rate. A Sonnet session does the same. What is the Opus-to-Sonnet cost ratio for this session?

An Opus session processes 100K input tokens and 10K output tokens with 0% cache hit rate. A Sonnet session does the same. What is the Opus-to-Sonnet cost ratio for this session?
Derivation

An agent has a 10K-token system prompt repeated on every call. With a 90% cache hit rate over 50 turns on Opus, what is the approximate cache saving vs no caching?

An agent has a 10K-token system prompt repeated on every call. With a 90% cache hit rate over 50 turns on Opus, what is the approximate cache saving vs no caching?
Derivation

The budget gate (maxBudgetUsd) is checked after every API response. An agent makes 3 calls before the check fires. Each call costs $4 against a $10 limit. What is the maximum overage?

The budget gate (maxBudgetUsd) is checked after every API response. An agent makes 3 calls before the check fires. Each call costs $4 against a $10 limit. What is the maximum overage?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 4 of 4

Design a cost tracking system for an AI agent that handles multiple pricing tiers.

★★★
AnthropicDatabricks

How does prompt caching affect the economics of AI agent systems?

★☆☆
GoogleOpenAI

What's the right granularity for budget limits — per-turn, per-session, or per-project?

★★☆
Anthropic

Design a cost tracking system that predicts when a conversation will exceed a budget threshold and suggests cheaper alternatives.

★★★
OpenAI