Case: Design an Agent Platform

Module 76 · Design Reviews

🏗️ Case: Design an Agent Platform

An agent that spawns agents — where does the budget live?

Status:

Most systems in this series have a clear cost unit. Agent platforms do not. One user task can fan out into child agents, tools, and model calls. The core question is: where do you put the spend cap when one API call can fan out into hundreds of child calls?

The answer is the trajectory boundary, not the step or token. Budget the user-facing task, roll child spend into the parent, and stop gracefully when the envelope is exhausted.

This case study covers sandboxing, capability-scoped tools, trajectory eval, and the spend-control architecture around that unit. For a concrete real-world agent system, see Coding Agent Case Study; for cross-system failure taxonomy, see Failure Taxonomy Comparison.

📋

Requirements & SLOs

Working backwards from the tenant

“A developer registers their agent with our platform, attaches a $50 monthly spend cap, and grants it access to the web-search and code-execution tools. Their end-users trigger the agent by clicking ‘research this topic’. The agent runs for a minute or two, calls several tools, optionally spawns a sub-agent to summarize a document, and returns a result. The developer sees spend in real time on their dashboard. If the agent hits the cap mid-trajectory, it receives a graceful budget-exceeded signal and returns a partial result rather than erroring silently. Tool-call logs are retained for 90 days for audit. Any incident can be replayed from the trajectory store within one hour.”

SLO table

Metric	Target	Priority	Why this value
Sandbox escape rate	0 (P0 — existential)	P0	One tenant accessing another tenant's data or compute ends the product. Non-negotiable ceiling.
Per-tenant spend cap enforcement	Within 1 step of cap	P0	Overspend by more than one tool-call equivalent destroys tenant trust and creates direct financial liability.
Tool-call audit retention	≥90 days	P1	Enterprise compliance and SOC 2 requirements. 90 days covers most incident SLA windows.
Trajectory replay availability	≤1h for P1 incidents	P1	On-call engineers need full step replay to diagnose runaway agents; without it, MTTR doubles.
Agent p95 start latency	<2s to first tool call	P2	User perception of “instant” response for interactive tasks. Accommodates microVM boot + LLM TTFT.
Platform availability	99.9% / month	P2	Standard SaaS enterprise SLA. 43 min downtime / month.

✨ Insight · Why sandbox escape is existential but slow start is P2. An agent platform is a multi-tenant system. A sandbox escape lets one tenant read another's trajectory store, tool credentials, or model outputs — this is a data breach. The company does not survive this as a hosted platform. Slow start (2.5s instead of 2s) is annoying; a sandbox escape is company-ending. Prioritization by blast radius, not by technical difficulty.

🧪

Eval Harness

Standard LLM evals measure final-answer quality on a single turn. Agent evals must measure the trajectory— the full sequence of decisions the agent made, not just what it returned at the end. Hamel Husain's practitioner framework applied to multi-turn agents produces four eval dimensions:

Eval dimension	What it measures	Signal source
End-to-end task success	Did the agent return a result that satisfies the user intent? LLM-judge on golden (task, ideal-output) pairs.	Offline on sampled trajectories + online shadow judge (5%)
Tool-call correctness	Precision and recall of tool selections vs. golden trajectory. Did the agent call the right tools in the right order without spurious calls?	Offline — compare tool call sequence to annotated golden path
Spend efficiency	Useful-work-$ / total-$ per completed task. Useful work = tasks that pass the end-to-end judge. A bloated agent that loops through extra tool calls will score low here even with high task success.	Online — computed per trajectory from the spend ledger
Safety adversarial suite	Does the agent refuse prompt-injection attacks, cross-tenant data access attempts, and tool-scope escalation requests? Binary pass/fail on a curated red-team set.	Offline CI gate on every deploy + quarterly manual red-team

💡 Tip · Why trajectory eval, not final-answer eval.An agent that succeeds by burning 10× more tokens than necessary, or that selected the correct answer after four wrong tool calls, looks perfect on final-answer eval. Trajectory eval catches it: tool-call P/R is low, spend efficiency is low. These are the agents that blow past budgets in production. Hamel Husain's core argument: measuring only the output is measuring only the last inch of a mile-long run.

🧮

Back-of-Envelope

Scenario: 1,000 concurrent agents, average 20 tool calls each, model mix 80% Haiku (fast, cheap) + 20% Sonnet (reasoning steps). Seed the calculator below with the user-visible QPS, then read the gotcha — the number the calculator gives is wrong by the amplification factor.

Scenario: Agent platform: 1,000 concurrent agents, 80% Haiku / 20% Sonnet mix, average 20 tool calls per agent, 800 input / 200 output tokens per model call.

Model Size7B

GPU TypeA100-80GB

QPS Target1,000

Input Tokens800

Output Tokens200

Cache Hit Rate10%

Model Weights (FP16)	14 GB
KV Cache / Request	134.2 MB (1000 tokens)
Tokens/sec per GPU	2,400
Effective QPS (after cache)	900
GPUs Needed	75
Cost / Month	$109,500
Est. p95 Latency	1.15s
Bottleneck	Compute/Bandwidth

GPU Memory Usage

19%

Compute Utilization

100%

Monthly Cost

$109,500

⚠ Warning · Gotcha: The calculator counts 1,000 QPS. But each user-visible task issues 20 model calls. True LLM QPS = 20,000. Agents amplify LLM calls 10–50× per user-visible task — the calculator undercounts raw LLM demand by that same factor. Design for the LLM call rate, not the user task rate.

⚠ Warning · The amplification trap.Every capacity plan for an agent platform that starts from “user tasks per second” is wrong by the average tool-call depth. For 1,000 concurrent agents with 20 tool calls each, the real LLM QPS is 20,000 — before accounting for sub-agents. A naive design that provisions for 1,000 QPS at the LLM gateway will brown out immediately. Always derive LLM gateway capacity from user_tasks × avg_llm_calls_per_task × (1 + avg_child_agent_depth).

Baseline: Multi-agent platform task-runner — 300 GPUs @ $3.5/hr at p99 45000 ms, 500 QPS, 55% cache hit.

Per-task latency (P99) and cost per agent turn. Cache hit rate captures prompt-prefix reuse across repeated tool schemas and system prompts. Adjust QPS to see how the 20× LLM amplification factor changes GPU demand.

p99 Latency Target45000 ms

Peak QPS500

Cache Hit Rate55%

Effective QPS (after cache)	225
Latency-batch factor	1.00×
GPUs Needed	300 (+0% latency vs baseline)
Hourly Burn	$1,050 (+0% vs baseline)
Cost / Request	$0.00058
Monthly Burn (24×7)	$766,500
Bottleneck	Balanced

⚠ Warning · Gotcha: The 500 QPS here is user-visible task rate. True LLM call rate is 500 × 20 tool calls = 10,000 QPS at the model gateway. A 55% cache hit rate on shared system prompts meaningfully reduces that load — but only if the cache key is stable across agent turns.

🏛️

Architecture

Eight components, one reason for each to exist. The three distinguishing components — microVM runner, capability-scoped tool registry, and trajectory store — are what separates a hosted agent platform from a thin wrapper around an LLM API.

Hosted Agent Platform Architecture

Hover each component to see its role. Trajectory Orchestrator is the spend-control point.

Justification table

Component	Why it must exist	What breaks without it
Tenant Control Plane	Per-tenant config isolation: spend caps, tool scopes, retention policy. Centralizes the source of truth for what each tenant is allowed to do.	Caps are enforced inconsistently; a config change has to touch every downstream component.
API Gateway	Auth, rate limit, trajectory budget injection at session start. The choke point before any agent compute is allocated.	A runaway tenant can exhaust the shared LLM gateway before any per-trajectory cap fires.
Trajectory Orchestrator	Owns step sequencing and the trajectory budget ledger. Every spend event — model call, tool call, child-agent spawn — decrements the budget here. Sends graceful stop signal at 100%.	Spend caps have no single enforcement point; child-agent spend escapes the parent budget.
Agent Runner (microVM)	Firecracker microVM per session: tenant isolation via separate kernel, seccomp, egress network policy. The isolation boundary that makes multi-tenant safe.	Shared container space — a prompt-injected agent reads another tenant's environment variables, credentials, or filesystem state.
Tool Registry (capability-scoped)	Presents tools as capability tokens. An agent can only call tools within the scope granted at session start — it never sees raw credentials. Prevents scope creep and prompt-injection escalation.	An agent granted “web search” calls “send email” via prompt injection.
LLM Gateway	Fan-out across model tiers (Haiku / Sonnet / Opus), token metering, spend deduction to the trajectory budget envelope.	Model costs go unaccounted until the billing cycle ends; no real-time cap enforcement.
Trajectory Store	Immutable append-only log of every step, tool call, model call, and spend event. Enables replay, incident debugging, post-hoc eval, and spend attribution. The single component most teams skip and most regret.	Incidents can't be replayed, spend can't be attributed, eval can't be run post-hoc. MTTR doubles.
Spend + Eval Dashboard	Real-time per-tenant spend ledger, trajectory success rate, tool-call P/R, adversarial pass rate. The SLO truth surface for operators and tenants alike.	Operators discover cost overruns from billing, not from monitoring — hours or days late.

🔬

Deep Dives

Expand the deep dives

Open for the full technical detail.

Expand

Two components matter most here: sandbox isolation and trajectory-boundary spend control.

Deep Dive A — Sandboxing with Firecracker microVMs

What it is. Each agent session runs inside a Firecracker microVM: a lightweight KVM-based virtual machine that boots a minimal Linux kernel in ~125ms (per Agache et al., NSDI 2020). Unlike Docker containers, microVMs have a full separate kernel — a kernel exploit inside the VM does not escape to the host. AWS deployed Firecracker in production for Lambda and Fargate, demonstrating that the boot overhead is acceptable for short-lived, high-frequency workloads at cloud scale.

Approach. The Firecracker VMM (Virtual Machine Monitor) uses KVM as the hypervisor and exposes a minimal device model: virtio-net, virtio-block, and a serial console — nothing else. The stripped-down device model is not just for speed; it reduces the kernel attack surface exposed to the guest. The guest kernel is a hardened minimal Linux image. Combined with seccomp BPF filters applied inside the guest (blocking dangerous syscalls like ptrace, mount, and raw socket creation), the effective attack path to a host escape requires defeating both the VMM isolation and the seccomp layer simultaneously.

Trade-off (non-obvious). The textbook trade-off is boot latency (microVM ~125ms vs. container ~30ms). The non-obvious one is overlay filesystem overhead. Each VM needs its own root filesystem. Naively, that is a full Linux image per session. In practice, the base image is a read-only snapshot shared across all VMs (copy-on-write overlay via OverlayFS or a custom block device). The non-obvious failure: if your agent workload generates large writes to the overlay (e.g., compiling code, downloading models), the CoW layer grows unboundedly during the session. You need an explicit disk quota on the guest overlay volume — otherwise a single agent with a large write workload can exhaust the host's disk and cause neighbor-noisy eviction of other sessions on the same host. This is orthogonal to the security boundary; it is a resource-accounting gap.

Isolation guarantees.Separate kernel namespace, no shared memory, egress network policy enforced at the hypervisor layer. Each VM's network interface routes through a dedicated NAT that only allows calls to the tool registry endpoint and the LLM gateway — not arbitrary internet access.
Capability token scheme.At session start, the Trajectory Orchestrator mints short-lived HMAC-signed tokens for each tool in the tenant's granted scope. The agent runner presents these tokens on every tool invocation; the tool registry validates signature + expiry + trajectory ID before executing. A compromised agent can invoke permitted tools within the session window — it cannot escalate to unpermitted tools or steal raw credentials (assuming the orchestrator itself is not compromised; TOCTOU races on trajectory-ID check are a known gap).
Boot latency trade-off. Cold boot ~125ms; snapshot restore ~5ms (Firecracker supports memory snapshots). Use pre-warmed VM pools for interactive-tier tenants; cold boot is acceptable for batch-tier. This keeps p95 start latency well within the 2s SLO.
Failure mode: sandbox escape. Discovered through bug bounty or red-team, not through monitoring. Defense-in-depth has three independent layers: (1) seccomp filter blocks dangerous syscalls inside the VM, (2) egress network policy blocks exfiltration channels, (3) capability tokens prevent tool-scope escalation. All three must be defeated simultaneously for a meaningful escape — the attacker cost is high. In practice, historical hypervisor CVEs (e.g., CVE-2019-14835 in the virtio driver) have been the most common escape vector; the minimal Firecracker device model substantially shrinks this attack surface relative to QEMU.
Detection metric. Sandbox escape has no in-band detection signal because the attack reads, not writes. The leading indicator is anomalous egress traffic: if an agent session makes outbound connections to IPs not in the allowed list (tool registry + LLM gateway), that is an immediate P0 alert. Threshold: any non-allowlisted egress event. Secondary metric: rate of seccomp SIGSYS signals per session — a spike indicates a syscall probing attempt. Target: 0 SIGSYS per session in steady state; alert on any occurrence. Both metrics require hypervisor-layer instrumentation, not guest-layer logging (a compromised guest cannot be trusted to self-report).
Mitigation.If anomalous egress is detected: (1) immediately freeze the VM (SIGSTOP the Firecracker process), preserving memory for forensics; (2) rotate all capability tokens issued to that session; (3) quarantine the host from the pool pending audit. Secondary effect: freezing the VM terminates the tenant's in-flight trajectory — they receive an error. This is the acceptable trade-off: the blast radius of a confirmed escape outweighs the cost of one aborted task.

Real-world example. E2B's engineering blog on building their cloud sandbox for AI agents covers exactly this design in production. E2B uses Firecracker microVMs as the execution primitive for each sandbox session, with a custom overlay filesystem, network policy enforced at the hypervisor layer, and a file-descriptor-based protocol (not HTTP) for tool invocations inside the VM. Their post-launch findings: the dominant operational cost was not boot latency but overlay disk management — agents that download libraries mid-session required explicit quota enforcement to prevent host-disk exhaustion. They also found that pre-warming a pool of 50–100 VMs per region reduced p99 cold-start latency from ~200ms to ~12ms (snapshot restore), which is essential for interactive-tier tenants.

Deep Dive B — Spend Control at Trajectory Boundary

Approach.The trajectory boundary is the correct spend-control unit because it matches the user's mental model of cost. A user initiates one task — “research this topic,” “fix this bug,” “process this document” — and expects to be charged for that task, not for each individual model call or tool invocation the agent makes internally. Per-model-call caps kill tasks arbitrarily at step N with no graceful partial-result path. Per-tool-call caps have the same problem at even finer granularity. The trajectory is the correct unit.

The orchestrator implements this with a budget envelope: at dispatch, the Trajectory Orchestrator reads the tenant's remaining monthly budget and assigns a per-trajectory cap: min(tenant_remaining_budget, task_type_default_cap). For interactive tasks the default cap might be on the order of $0.10; for deep research tasks, $1.00. Tenants configure these defaults in the Tenant Control Plane. Every model call deducts estimated cost (input_tokens × input_price + output_tokens × output_price) from the envelope atomically — before the call is dispatched, not after. Every tool call deducts its estimated cost (qualitative estimates — actual costs vary by tool provider). Decrementing before dispatch means the orchestrator cannot overshoot by more than one step even if the call fails or the response arrives out of order.

Trade-off (non-obvious). The textbook trade-off is granularity: finer budgets give tighter cost control but more accounting writes. The non-obvious one is fair-share scheduler starvation. When a platform hosts hundreds of tenants sharing an LLM gateway, the trajectory budget only controls the tenant's own spend — it does not control how much queue time the tenant's trajectories consume in the shared LLM gateway. A tenant with a large budget and many parallel trajectories can saturate the gateway's token-per-second capacity, starving tenants with small budgets even though neither has exceeded their spend cap. The fix is a second layer of control: a per-tenant token-rate quota at the LLM gateway, distinct from the per-trajectory dollar budget. The dollar budget answers “how much can this task cost?”; the token-rate quota answers “how fast can this tenant consume tokens?” Both are necessary; most first-version platforms only implement one.

Child-agent rollup.When the parent agent spawns a child agent, the parent's remaining budget is split: the child receives a sub-budget carved from the parent's remaining envelope, and all child spend rolls up to the parent's ledger in real time. If the child exhausts its sub-budget, it receives a graceful stop signal and returns a partial result to the parent. The parent can choose to continue (deducting from its remaining budget) or return to the user. The critical implementation detail: child spend must be reserved against the parent budget at child-dispatch time, not at child-return time. If you reserve at return, a parent that spawns 50 parallel children can commit 50 × sub-budget before any child has returned — exceeding the parent's envelope by an unbounded factor.
Graceful stop signal.At 100% budget consumption, the Trajectory Orchestrator injects a synthetic tool result into the agent's context: {"type":"budget_exceeded","message":"Task budget exhausted. Return partial result."}. A well-designed agent handles this and returns what it has. An agent that ignores it is killed by the orchestrator after one additional model call (a grace turn). The grace turn exists because killing the agent mid-generation produces a truncated response with no context for the user; the grace turn allows the agent to emit a coherent terminal message.
Failure mode: budget reconciliation lag.Budget deductions are optimistic (estimated cost before the call) and reconciled against actual costs asynchronously. During the reconciliation window — typically 100–500ms — a trajectory can overshoot by up to one step's worth of spend. For a model call that returns a large context window, one step's overshoot can be non-trivial. The mitigation: use the pessimistic upper bound of the call tier (e.g., max output tokens × price) as the pre-deduction estimate, then credit back the difference at reconciliation. This guarantees the trajectory never overshoots by more than one step's pessimistic estimate — which is the SLO spec.
Detection metric. The primary signal for spend-control drift is trajectory overshoot rate: the fraction of completed trajectories whose actual cost exceeded their assigned budget by more than one step's pessimistic estimate. Target: <0.1% of trajectories. A spike in overshoot rate indicates reconciliation lag has grown beyond the one-step window — typically caused by orchestrator backpressure at high QPS. Secondary metric: per-tenant monthly spend vs. cap ratio, computed daily. A tenant consistently reaching 95%+ of their monthly cap before the billing cycle ends is a leading indicator that their default per-trajectory cap is too high for their task mix, and they will hit the monthly cap mid-month without warning. Alert threshold: any tenant whose projected end-of-month spend exceeds their cap by more than 20% based on the current burn rate, triggering a cap-adjustment recommendation.

Real-world example. LangSmith (docs.smith.langchain.com) is the closest public reference for trajectory-boundary spend tracking. LangSmith models execution as a run tree: each root run corresponds to a user-visible task (equivalent to a trajectory), and child runs (LLM calls, tool calls, chain invocations) roll their token and latency costs up into the parent. The run tree structure makes it possible to answer “how much did this user task cost in total, including all nested calls?” — the exact question the trajectory budget envelope must answer at enforcement time. LangSmith's production architecture also surfaces a practical scaling constraint: at high run volume, the write path for cost roll-up becomes a bottleneck. Their mitigation (documented in their tracing architecture) is to buffer child-run cost events in a local aggregator and flush to the parent counter in batches — the same 100ms-window batching pattern described in the deduction mechanics above. The lesson: spend-control architecture and observability architecture converge at scale; designing them separately produces redundant write paths and inconsistent cost numbers between the spend enforcer and the monitoring dashboard.

✨ Insight · The recursive-agent trap.The most common spend-control bug on agent platforms: the parent agent spawns 50 child agents to parallelize a research task. Each child is below the per-trajectory cap. The parent has not been charged for child spend because child budgets were tracked independently. Total cost: 50 × per-child budget, which far exceeds the parent's cap. Fix: always roll child spend into the parent's envelope before the child is dispatched, not after it returns.

Quick Check

An agent platform tenant sets a $10 monthly spend cap. Their agent runs a task that spawns 5 child agents, each of which calls 10 tools. The per-trajectory cap is $0.50. Which enforcement design correctly contains total spend to ≤$10/month?

🔧

Break It

Three components, each removed independently. Every failure mode maps back to a metric in the eval harness or a P0/P1 SLO in the requirements table.

Remove microVM isolation

Agents run in shared containers on the same host. A prompt-injection attack against Tenant A's agent causes it to read /proc/1/environ and exfiltrate environment variables — including the LLM API keys and trajectory store credentials of every other agent running on that host. Blast radius: all tenants on the affected host, potentially the entire platform if credentials are shared. Detection: bug bounty or red-team report — not monitoring. No in-band signal because the attack reads, not writes. SLO violated: sandbox escape rate = 0 (P0 — existential).

Spend cap at per-call, not per-task

Each model call is capped at $0.005. An agent that runs a 40-step tool loop never exceeds $0.005 per step, so the cap never fires. Total task cost: $0.20. If that task recurses into 10 child agents, total: $2.00. A tenant with a $10 monthly cap hits it in 5 parent tasks — which may correspond to 5 user clicks. Or the cap is set high enough that individual calls never hit it, making it useless. Detection: spend efficiency metric on the dashboard — useful work per dollar will be low, but the raw spend overrun may only appear in the monthly billing reconciliation. SLO violated: per-tenant spend cap enforced within 1 step (P0).

Skip the trajectory store

A tenant's agent ran for 3 hours, spawned 20 child agents, spent $47 (above their $50 cap by luck), and returned a partially wrong answer. There is now a P1 incident. Without the trajectory store: (1) on-call cannot replay the trajectory to find where the agent diverged, (2) spend cannot be attributed to specific steps or child agents, (3) the eval team cannot run the golden-path comparison to confirm the bug is fixed before the next deploy. Detection: only through user complaint and log grep, not structured replay. SLO violated: trajectory replay within 1h for P1 incidents (P1); post-hoc eval blocked indefinitely.

💸

What Does a Bad Day Cost?

Agent platforms have a larger blast radius per incident than traditional LLM APIs because a single API call can spawn arbitrary downstream compute. Three incident modes, each with a dollar cost and a time cost:

Sandbox escape discovered via bug bounty (existential) — Direct financial cost: bug bounty payout ($10K–$100K for critical). Indirect: mandatory security audit, customer notifications under GDPR Article 33 (72-hour window), potential churn of enterprise tenants. In the worst case: regulatory action, class-action liability. This is the incident mode that ends hosted agent platform companies. There is no “detect and recover” — the damage is the data access that already happened.
Runaway recursive agent spawns 1,000 child agents before cap fires — Direct financial cost: 1,000 children × 20 steps × $0.001/step = $20 in LLM spend per runaway, but the real cost is GPU capacity consumption. At 1,000 runaway agents simultaneously, this is an equivalent load of 1M extra model calls saturating the LLM gateway and degrading latency for all other tenants. Time to detection without the spend dashboard: next billing cycle. With real-time trajectory budget enforcement: 1 step (the spec). The difference is hours of degraded service vs. seconds.
Trajectory store outage during a P1 incident (can't debug) — Direct financial cost: engineers' time × MTTR extension. A 1h incident with trajectory replay becomes a 3–4h incident on log grep alone. At an all-in engineering cost of $500/hour and a 5-person incident response: $2,500 in engineering cost per extra hour. Indirect cost: SLA credits for affected tenants, trust damage if the incident recurs because the root cause was misdiagnosed.

⚠ Warning · Agent platforms amplify blast radius. A traditional LLM API: one bad request → one bad response. A hosted agent platform: one bad task dispatch → 50 child agents → 1,000 model calls → $200 in unaccounted spend, all before the user sees an error. The spend-control and sandboxing SLOs in this module are P0 specifically because the amplification factor makes every latent failure catastrophically larger than it would be in a stateless API.

🚨

On-call Runbook

Runaway tool loop across nested sub-agents

MTTR p50 / p99: Seconds with trajectory budget enforcement; hours without

Blast radius: 1,000+ child agents spawned before spend cap fires; LLM gateway saturated; all-tenant latency degrades 5–10×

1. DetectReal-time trajectory budget enforcement detects overage within 1 step; spend dashboard alerts in <5s; without enforcement: next billing cycle
2. EscalateOncall paged; parent trajectory forcibly terminated; child microVMs evicted; tenant spend cap temporarily lowered pending investigation
3. RollbackKill switch halts all in-flight trajectories for affected tenant; restore from last trajectory checkpoint if task is retriable
4. PostAdd per-trajectory child-agent depth limit (e.g., max depth 3); add circuit breaker on LLM gateway per tenant QPS; audit tool schemas for accidental recursion surface

Sandbox escape via unsanitized tool argument

MTTR p50 / p99: Containment in minutes; full remediation (audit + notification) 24–72h

Blast radius: Tenant data cross-contamination; potential access to host filesystem or other tenant microVMs; regulatory notification required under GDPR Art. 33

1. Detectseccomp violation logged immediately; egress network policy blocks unexpected outbound; detected within 1 request
2. EscalateSecurity oncall paged immediately; affected microVM quarantined; all sessions for affected tenant suspended pending audit
3. RollbackNo rollback for data-access events — focus on containment; rotate all tenant API keys; initiate forensic audit of microVM logs
4. PostHarden tool argument validation (schema + allowlist); add integration test for sandbox escape attempts; review seccomp policy coverage

Trajectory eval regression hides quality drop

MTTR p50 / p99: Traffic shift in <5 min; root-cause diagnosis 2–6h

Blast radius: Silent model quality degradation ships to production; tenants notice via task failure rate increase; trust damage before detection

1. DetectAutomated eval suite on new model version should catch; missed if eval coverage gaps exist or eval prompts are stale
2. EscalateProduct oncall alerted by tenant complaint spike; eval team reviews trajectory store samples; A/B comparison against previous model version
3. RollbackTraffic shift back to previous model version; feature flag on model routing; no data loss — trajectories are immutable
4. PostAdd golden-set regression tests covering top-10 tenant task types; gate model promotion on eval pass rate ≥99%; add eval freshness check (prompts updated within 30 days)

🏢

Company Lens

Anthropic's push

Expect deep drill on capability-limited tool design and trajectory oversight. Anthropic's published guidance on building effective agents emphasizes: tools should do exactly what they say, nothing more; agents should be able to pause and check in with humans before irreversible actions; and every deployed agent harness should have a kill switch that halts all in-flight trajectories. Interview questions will probe whether your tool registry prevents scope creep, how you handle a tool that silently expands its behavior, and what your human-in-the-loop checkpoint looks like for high-stakes actions (send email, execute payment, delete data). The safety adversarial eval suite is not optional at Anthropic — expect to justify its coverage and cadence.

OpenAI's push

Expect drill on the Assistants API-style platform design: thread management, run lifecycle, tool-call streaming contracts, and file attachment handling. OpenAI's Assistants API is the closest public reference for a hosted agent platform and has well-documented design decisions. Expect questions on: how you version tool schemas without breaking existing agents, how you handle a run that exceeds the max_prompt_tokens limit mid-trajectory, and how you expose observability to developers (OpenAI ships run steps as a first-class API resource, per platform.openai.com/docs/assistants — your trajectory store is the equivalent). Cost management questions will focus on how you surface token usage per-run to developers in real time, not just at billing.

🧠

Key Takeaways

What to remember for interviews

1The right spend-control unit is the trajectory boundary — the cost of one user-facing task. Per-step is too coarse; per-token is too fine; per-trajectory is the unit the user and the budget both understand.
2Child-agent spend must roll up into the parent's budget envelope before dispatch, not after return — otherwise a recursive agent defeats the cap by spawning under-cap children in parallel.
3Sandbox isolation on a multi-tenant agent platform is existential, not operational. A sandbox escape is a data breach affecting all tenants on the host; defense-in-depth (seccomp + network policy + capability tokens) must defeat the attack at three independent layers.
4The trajectory store is the platform's flywheel: it powers incident replay, spend attribution, post-hoc eval, and the spend dashboard simultaneously. It is the most commonly skipped component and the one most teams regret skipping first.
5Capacity planning for agent platforms must derive LLM gateway QPS from user tasks × average LLM calls per task × (1 + average child-agent depth). Starting from user-visible QPS understates true LLM demand by 10–50×.
6Trajectory eval — measuring tool-call precision/recall and spend efficiency across the full episode — catches the bloated agents that look fine on final-answer benchmarks but blow past budgets in production.

🎯

Interview Questions

Difficulty:

Company:

Showing 5 of 5

An interviewer asks: 'Where should a hosted agent platform enforce a tenant's spend cap — per model call, per tool call, or somewhere else?' Walk through the reasoning.

★★★

AnthropicOpenAI

Design the capability-token scheme for a tool registry on a multi-tenant agent platform. What does a token contain and how does the runner validate it?

★★★

AnthropicGoogle

Your trajectory store goes down during a P1 incident. What are the three compounding effects, and how does each one extend MTTR?

★★☆

AnthropicOpenAI

A Google interviewer asks: 'Firecracker microVMs add ~125ms boot latency. Your agent p95 start SLO is <2s to first tool call. Is microVM isolation worth the cost?' How do you reason through this?

★★★

GoogleAnthropic

Meta's interviewer asks: 'How do you design trajectory eval for an agent platform where agents can spawn other agents? Standard pass/fail on the leaf output doesn't work.'

★★★

MetaAnthropic

📚

Transformer Math