🏗️ Case: Design an Agent Platform
An agent that spawns agents — where does the budget live?
Most systems in this series have a clear cost unit. Agent platforms do not. One user task can fan out into child agents, tools, and model calls. The core question is: where do you put the spend cap when one API call can fan out into hundreds of child calls?
The answer is the trajectory boundary, not the step or token. Budget the user-facing task, roll child spend into the parent, and stop gracefully when the envelope is exhausted.
This case study covers sandboxing, capability-scoped tools, trajectory eval, and the spend-control architecture around that unit. For a concrete real-world agent system, see Coding Agent Case Study; for cross-system failure taxonomy, see Failure Taxonomy Comparison.
Requirements & SLOs
Working backwards from the tenant
SLO table
| Metric | Target | Priority | Why this value |
|---|---|---|---|
| Sandbox escape rate | 0 (P0 — existential) | P0 | One tenant accessing another tenant's data or compute ends the product. Non-negotiable ceiling. |
| Per-tenant spend cap enforcement | Within 1 step of cap | P0 | Overspend by more than one tool-call equivalent destroys tenant trust and creates direct financial liability. |
| Tool-call audit retention | ≥90 days | P1 | Enterprise compliance and SOC 2 requirements. 90 days covers most incident SLA windows. |
| Trajectory replay availability | ≤1h for P1 incidents | P1 | On-call engineers need full step replay to diagnose runaway agents; without it, MTTR doubles. |
| Agent p95 start latency | <2s to first tool call | P2 | User perception of “instant” response for interactive tasks. Accommodates microVM boot + LLM TTFT. |
| Platform availability | 99.9% / month | P2 | Standard SaaS enterprise SLA. 43 min downtime / month. |
Eval Harness
Standard LLM evals measure final-answer quality on a single turn. Agent evals must measure the trajectory— the full sequence of decisions the agent made, not just what it returned at the end. Hamel Husain's practitioner framework applied to multi-turn agents produces four eval dimensions:
| Eval dimension | What it measures | Signal source |
|---|---|---|
| End-to-end task success | Did the agent return a result that satisfies the user intent? LLM-judge on golden (task, ideal-output) pairs. | Offline on sampled trajectories + online shadow judge (5%) |
| Tool-call correctness | Precision and recall of tool selections vs. golden trajectory. Did the agent call the right tools in the right order without spurious calls? | Offline — compare tool call sequence to annotated golden path |
| Spend efficiency | Useful-work-$ / total-$ per completed task. Useful work = tasks that pass the end-to-end judge. A bloated agent that loops through extra tool calls will score low here even with high task success. | Online — computed per trajectory from the spend ledger |
| Safety adversarial suite | Does the agent refuse prompt-injection attacks, cross-tenant data access attempts, and tool-scope escalation requests? Binary pass/fail on a curated red-team set. | Offline CI gate on every deploy + quarterly manual red-team |
Back-of-Envelope
Scenario: 1,000 concurrent agents, average 20 tool calls each, model mix 80% Haiku (fast, cheap) + 20% Sonnet (reasoning steps). Seed the calculator below with the user-visible QPS, then read the gotcha — the number the calculator gives is wrong by the amplification factor.
| Model Weights (FP16) | 14 GB |
| KV Cache / Request | 134.2 MB (1000 tokens) |
| Tokens/sec per GPU | 2,400 |
| Effective QPS (after cache) | 900 |
| GPUs Needed | 75 |
| Cost / Month | $109,500 |
| Est. p95 Latency | 1.15s |
| Bottleneck | Compute/Bandwidth |
GPU Memory Usage
19%
Compute Utilization
100%
Monthly Cost
$109,500
user_tasks × avg_llm_calls_per_task × (1 + avg_child_agent_depth).Baseline: Multi-agent platform task-runner — 300 GPUs @ $3.5/hr at p99 45000 ms, 500 QPS, 55% cache hit.
Per-task latency (P99) and cost per agent turn. Cache hit rate captures prompt-prefix reuse across repeated tool schemas and system prompts. Adjust QPS to see how the 20× LLM amplification factor changes GPU demand.
| Effective QPS (after cache) | 225 |
| Latency-batch factor | 1.00× |
| GPUs Needed | 300 (+0% latency vs baseline) |
| Hourly Burn | $1,050 (+0% vs baseline) |
| Cost / Request | $0.00058 |
| Monthly Burn (24×7) | $766,500 |
| Bottleneck | Balanced |
Architecture
Eight components, one reason for each to exist. The three distinguishing components — microVM runner, capability-scoped tool registry, and trajectory store — are what separates a hosted agent platform from a thin wrapper around an LLM API.
Hosted Agent Platform Architecture
Hover each component to see its role. Trajectory Orchestrator is the spend-control point.
Justification table
| Component | Why it must exist | What breaks without it |
|---|---|---|
| Tenant Control Plane | Per-tenant config isolation: spend caps, tool scopes, retention policy. Centralizes the source of truth for what each tenant is allowed to do. | Caps are enforced inconsistently; a config change has to touch every downstream component. |
| API Gateway | Auth, rate limit, trajectory budget injection at session start. The choke point before any agent compute is allocated. | A runaway tenant can exhaust the shared LLM gateway before any per-trajectory cap fires. |
| Trajectory Orchestrator | Owns step sequencing and the trajectory budget ledger. Every spend event — model call, tool call, child-agent spawn — decrements the budget here. Sends graceful stop signal at 100%. | Spend caps have no single enforcement point; child-agent spend escapes the parent budget. |
| Agent Runner (microVM) | Firecracker microVM per session: tenant isolation via separate kernel, seccomp, egress network policy. The isolation boundary that makes multi-tenant safe. | Shared container space — a prompt-injected agent reads another tenant's environment variables, credentials, or filesystem state. |
| Tool Registry (capability-scoped) | Presents tools as capability tokens. An agent can only call tools within the scope granted at session start — it never sees raw credentials. Prevents scope creep and prompt-injection escalation. | An agent granted “web search” calls “send email” via prompt injection. |
| LLM Gateway | Fan-out across model tiers (Haiku / Sonnet / Opus), token metering, spend deduction to the trajectory budget envelope. | Model costs go unaccounted until the billing cycle ends; no real-time cap enforcement. |
| Trajectory Store | Immutable append-only log of every step, tool call, model call, and spend event. Enables replay, incident debugging, post-hoc eval, and spend attribution. The single component most teams skip and most regret. | Incidents can't be replayed, spend can't be attributed, eval can't be run post-hoc. MTTR doubles. |
| Spend + Eval Dashboard | Real-time per-tenant spend ledger, trajectory success rate, tool-call P/R, adversarial pass rate. The SLO truth surface for operators and tenants alike. | Operators discover cost overruns from billing, not from monitoring — hours or days late. |
Deep Dives
Expand the deep dives
Open for the full technical detail.
Expand
Expand the deep dives
Open for the full technical detail.
Two components matter most here: sandbox isolation and trajectory-boundary spend control.
Deep Dive A — Sandboxing with Firecracker microVMs
What it is. Each agent session runs inside a Firecracker microVM: a lightweight KVM-based virtual machine that boots a minimal Linux kernel in ~125ms (per Agache et al., NSDI 2020). Unlike Docker containers, microVMs have a full separate kernel — a kernel exploit inside the VM does not escape to the host. AWS deployed Firecracker in production for Lambda and Fargate, demonstrating that the boot overhead is acceptable for short-lived, high-frequency workloads at cloud scale.
Approach. The Firecracker VMM (Virtual Machine Monitor) uses KVM as the hypervisor and exposes a minimal device model: virtio-net, virtio-block, and a serial console — nothing else. The stripped-down device model is not just for speed; it reduces the kernel attack surface exposed to the guest. The guest kernel is a hardened minimal Linux image. Combined with seccomp BPF filters applied inside the guest (blocking dangerous syscalls like ptrace, mount, and raw socket creation), the effective attack path to a host escape requires defeating both the VMM isolation and the seccomp layer simultaneously.
Trade-off (non-obvious). The textbook trade-off is boot latency (microVM ~125ms vs. container ~30ms). The non-obvious one is overlay filesystem overhead. Each VM needs its own root filesystem. Naively, that is a full Linux image per session. In practice, the base image is a read-only snapshot shared across all VMs (copy-on-write overlay via OverlayFS or a custom block device). The non-obvious failure: if your agent workload generates large writes to the overlay (e.g., compiling code, downloading models), the CoW layer grows unboundedly during the session. You need an explicit disk quota on the guest overlay volume — otherwise a single agent with a large write workload can exhaust the host's disk and cause neighbor-noisy eviction of other sessions on the same host. This is orthogonal to the security boundary; it is a resource-accounting gap.
- Isolation guarantees.Separate kernel namespace, no shared memory, egress network policy enforced at the hypervisor layer. Each VM's network interface routes through a dedicated NAT that only allows calls to the tool registry endpoint and the LLM gateway — not arbitrary internet access.
- Capability token scheme.At session start, the Trajectory Orchestrator mints short-lived HMAC-signed tokens for each tool in the tenant's granted scope. The agent runner presents these tokens on every tool invocation; the tool registry validates signature + expiry + trajectory ID before executing. A compromised agent can invoke permitted tools within the session window — it cannot escalate to unpermitted tools or steal raw credentials (assuming the orchestrator itself is not compromised; TOCTOU races on trajectory-ID check are a known gap).
- Boot latency trade-off. Cold boot ~125ms; snapshot restore ~5ms (Firecracker supports memory snapshots). Use pre-warmed VM pools for interactive-tier tenants; cold boot is acceptable for batch-tier. This keeps p95 start latency well within the 2s SLO.
- Failure mode: sandbox escape. Discovered through bug bounty or red-team, not through monitoring. Defense-in-depth has three independent layers: (1) seccomp filter blocks dangerous syscalls inside the VM, (2) egress network policy blocks exfiltration channels, (3) capability tokens prevent tool-scope escalation. All three must be defeated simultaneously for a meaningful escape — the attacker cost is high. In practice, historical hypervisor CVEs (e.g., CVE-2019-14835 in the virtio driver) have been the most common escape vector; the minimal Firecracker device model substantially shrinks this attack surface relative to QEMU.
- Detection metric. Sandbox escape has no in-band detection signal because the attack reads, not writes. The leading indicator is anomalous egress traffic: if an agent session makes outbound connections to IPs not in the allowed list (tool registry + LLM gateway), that is an immediate P0 alert. Threshold: any non-allowlisted egress event. Secondary metric: rate of seccomp SIGSYS signals per session — a spike indicates a syscall probing attempt. Target: 0 SIGSYS per session in steady state; alert on any occurrence. Both metrics require hypervisor-layer instrumentation, not guest-layer logging (a compromised guest cannot be trusted to self-report).
- Mitigation.If anomalous egress is detected: (1) immediately freeze the VM (SIGSTOP the Firecracker process), preserving memory for forensics; (2) rotate all capability tokens issued to that session; (3) quarantine the host from the pool pending audit. Secondary effect: freezing the VM terminates the tenant's in-flight trajectory — they receive an error. This is the acceptable trade-off: the blast radius of a confirmed escape outweighs the cost of one aborted task.
Real-world example. E2B's engineering blog on building their cloud sandbox for AI agents covers exactly this design in production. E2B uses Firecracker microVMs as the execution primitive for each sandbox session, with a custom overlay filesystem, network policy enforced at the hypervisor layer, and a file-descriptor-based protocol (not HTTP) for tool invocations inside the VM. Their post-launch findings: the dominant operational cost was not boot latency but overlay disk management — agents that download libraries mid-session required explicit quota enforcement to prevent host-disk exhaustion. They also found that pre-warming a pool of 50–100 VMs per region reduced p99 cold-start latency from ~200ms to ~12ms (snapshot restore), which is essential for interactive-tier tenants.
Deep Dive B — Spend Control at Trajectory Boundary
Approach.The trajectory boundary is the correct spend-control unit because it matches the user's mental model of cost. A user initiates one task — “research this topic,” “fix this bug,” “process this document” — and expects to be charged for that task, not for each individual model call or tool invocation the agent makes internally. Per-model-call caps kill tasks arbitrarily at step N with no graceful partial-result path. Per-tool-call caps have the same problem at even finer granularity. The trajectory is the correct unit.
The orchestrator implements this with a budget envelope: at dispatch, the Trajectory Orchestrator reads the tenant's remaining monthly budget and assigns a per-trajectory cap: min(tenant_remaining_budget, task_type_default_cap). For interactive tasks the default cap might be on the order of $0.10; for deep research tasks, $1.00. Tenants configure these defaults in the Tenant Control Plane. Every model call deducts estimated cost (input_tokens × input_price + output_tokens × output_price) from the envelope atomically — before the call is dispatched, not after. Every tool call deducts its estimated cost (qualitative estimates — actual costs vary by tool provider). Decrementing before dispatch means the orchestrator cannot overshoot by more than one step even if the call fails or the response arrives out of order.
Trade-off (non-obvious). The textbook trade-off is granularity: finer budgets give tighter cost control but more accounting writes. The non-obvious one is fair-share scheduler starvation. When a platform hosts hundreds of tenants sharing an LLM gateway, the trajectory budget only controls the tenant's own spend — it does not control how much queue time the tenant's trajectories consume in the shared LLM gateway. A tenant with a large budget and many parallel trajectories can saturate the gateway's token-per-second capacity, starving tenants with small budgets even though neither has exceeded their spend cap. The fix is a second layer of control: a per-tenant token-rate quota at the LLM gateway, distinct from the per-trajectory dollar budget. The dollar budget answers “how much can this task cost?”; the token-rate quota answers “how fast can this tenant consume tokens?” Both are necessary; most first-version platforms only implement one.
- Child-agent rollup.When the parent agent spawns a child agent, the parent's remaining budget is split: the child receives a sub-budget carved from the parent's remaining envelope, and all child spend rolls up to the parent's ledger in real time. If the child exhausts its sub-budget, it receives a graceful stop signal and returns a partial result to the parent. The parent can choose to continue (deducting from its remaining budget) or return to the user. The critical implementation detail: child spend must be reserved against the parent budget at child-dispatch time, not at child-return time. If you reserve at return, a parent that spawns 50 parallel children can commit 50 × sub-budget before any child has returned — exceeding the parent's envelope by an unbounded factor.
- Graceful stop signal.At 100% budget consumption, the Trajectory Orchestrator injects a synthetic tool result into the agent's context:
{"type":"budget_exceeded","message":"Task budget exhausted. Return partial result."}. A well-designed agent handles this and returns what it has. An agent that ignores it is killed by the orchestrator after one additional model call (a grace turn). The grace turn exists because killing the agent mid-generation produces a truncated response with no context for the user; the grace turn allows the agent to emit a coherent terminal message. - Failure mode: budget reconciliation lag.Budget deductions are optimistic (estimated cost before the call) and reconciled against actual costs asynchronously. During the reconciliation window — typically 100–500ms — a trajectory can overshoot by up to one step's worth of spend. For a model call that returns a large context window, one step's overshoot can be non-trivial. The mitigation: use the pessimistic upper bound of the call tier (e.g., max output tokens × price) as the pre-deduction estimate, then credit back the difference at reconciliation. This guarantees the trajectory never overshoots by more than one step's pessimistic estimate — which is the SLO spec.
- Detection metric. The primary signal for spend-control drift is trajectory overshoot rate: the fraction of completed trajectories whose actual cost exceeded their assigned budget by more than one step's pessimistic estimate. Target: <0.1% of trajectories. A spike in overshoot rate indicates reconciliation lag has grown beyond the one-step window — typically caused by orchestrator backpressure at high QPS. Secondary metric: per-tenant monthly spend vs. cap ratio, computed daily. A tenant consistently reaching 95%+ of their monthly cap before the billing cycle ends is a leading indicator that their default per-trajectory cap is too high for their task mix, and they will hit the monthly cap mid-month without warning. Alert threshold: any tenant whose projected end-of-month spend exceeds their cap by more than 20% based on the current burn rate, triggering a cap-adjustment recommendation.
Real-world example. LangSmith (docs.smith.langchain.com) is the closest public reference for trajectory-boundary spend tracking. LangSmith models execution as a run tree: each root run corresponds to a user-visible task (equivalent to a trajectory), and child runs (LLM calls, tool calls, chain invocations) roll their token and latency costs up into the parent. The run tree structure makes it possible to answer “how much did this user task cost in total, including all nested calls?” — the exact question the trajectory budget envelope must answer at enforcement time. LangSmith's production architecture also surfaces a practical scaling constraint: at high run volume, the write path for cost roll-up becomes a bottleneck. Their mitigation (documented in their tracing architecture) is to buffer child-run cost events in a local aggregator and flush to the parent counter in batches — the same 100ms-window batching pattern described in the deduction mechanics above. The lesson: spend-control architecture and observability architecture converge at scale; designing them separately produces redundant write paths and inconsistent cost numbers between the spend enforcer and the monitoring dashboard.
An agent platform tenant sets a $10 monthly spend cap. Their agent runs a task that spawns 5 child agents, each of which calls 10 tools. The per-trajectory cap is $0.50. Which enforcement design correctly contains total spend to ≤$10/month?
Break It
Three components, each removed independently. Every failure mode maps back to a metric in the eval harness or a P0/P1 SLO in the requirements table.
Remove microVM isolation
Agents run in shared containers on the same host. A prompt-injection attack against Tenant A's agent causes it to read /proc/1/environ and exfiltrate environment variables — including the LLM API keys and trajectory store credentials of every other agent running on that host. Blast radius: all tenants on the affected host, potentially the entire platform if credentials are shared. Detection: bug bounty or red-team report — not monitoring. No in-band signal because the attack reads, not writes. SLO violated: sandbox escape rate = 0 (P0 — existential).
Spend cap at per-call, not per-task
Each model call is capped at $0.005. An agent that runs a 40-step tool loop never exceeds $0.005 per step, so the cap never fires. Total task cost: $0.20. If that task recurses into 10 child agents, total: $2.00. A tenant with a $10 monthly cap hits it in 5 parent tasks — which may correspond to 5 user clicks. Or the cap is set high enough that individual calls never hit it, making it useless. Detection: spend efficiency metric on the dashboard — useful work per dollar will be low, but the raw spend overrun may only appear in the monthly billing reconciliation. SLO violated: per-tenant spend cap enforced within 1 step (P0).
Skip the trajectory store
A tenant's agent ran for 3 hours, spawned 20 child agents, spent $47 (above their $50 cap by luck), and returned a partially wrong answer. There is now a P1 incident. Without the trajectory store: (1) on-call cannot replay the trajectory to find where the agent diverged, (2) spend cannot be attributed to specific steps or child agents, (3) the eval team cannot run the golden-path comparison to confirm the bug is fixed before the next deploy. Detection: only through user complaint and log grep, not structured replay. SLO violated: trajectory replay within 1h for P1 incidents (P1); post-hoc eval blocked indefinitely.
What Does a Bad Day Cost?
Agent platforms have a larger blast radius per incident than traditional LLM APIs because a single API call can spawn arbitrary downstream compute. Three incident modes, each with a dollar cost and a time cost:
- Sandbox escape discovered via bug bounty (existential) — Direct financial cost: bug bounty payout ($10K–$100K for critical). Indirect: mandatory security audit, customer notifications under GDPR Article 33 (72-hour window), potential churn of enterprise tenants. In the worst case: regulatory action, class-action liability. This is the incident mode that ends hosted agent platform companies. There is no “detect and recover” — the damage is the data access that already happened.
- Runaway recursive agent spawns 1,000 child agents before cap fires — Direct financial cost: 1,000 children × 20 steps × $0.001/step = $20 in LLM spend per runaway, but the real cost is GPU capacity consumption. At 1,000 runaway agents simultaneously, this is an equivalent load of 1M extra model calls saturating the LLM gateway and degrading latency for all other tenants. Time to detection without the spend dashboard: next billing cycle. With real-time trajectory budget enforcement: 1 step (the spec). The difference is hours of degraded service vs. seconds.
- Trajectory store outage during a P1 incident (can't debug) — Direct financial cost: engineers' time × MTTR extension. A 1h incident with trajectory replay becomes a 3–4h incident on log grep alone. At an all-in engineering cost of $500/hour and a 5-person incident response: $2,500 in engineering cost per extra hour. Indirect cost: SLA credits for affected tenants, trust damage if the incident recurs because the root cause was misdiagnosed.
On-call Runbook
Runaway tool loop across nested sub-agents
MTTR p50 / p99: Seconds with trajectory budget enforcement; hours withoutBlast radius: 1,000+ child agents spawned before spend cap fires; LLM gateway saturated; all-tenant latency degrades 5–10×
- 1. DetectReal-time trajectory budget enforcement detects overage within 1 step; spend dashboard alerts in <5s; without enforcement: next billing cycle
- 2. EscalateOncall paged; parent trajectory forcibly terminated; child microVMs evicted; tenant spend cap temporarily lowered pending investigation
- 3. RollbackKill switch halts all in-flight trajectories for affected tenant; restore from last trajectory checkpoint if task is retriable
- 4. PostAdd per-trajectory child-agent depth limit (e.g., max depth 3); add circuit breaker on LLM gateway per tenant QPS; audit tool schemas for accidental recursion surface
Sandbox escape via unsanitized tool argument
MTTR p50 / p99: Containment in minutes; full remediation (audit + notification) 24–72hBlast radius: Tenant data cross-contamination; potential access to host filesystem or other tenant microVMs; regulatory notification required under GDPR Art. 33
- 1. Detectseccomp violation logged immediately; egress network policy blocks unexpected outbound; detected within 1 request
- 2. EscalateSecurity oncall paged immediately; affected microVM quarantined; all sessions for affected tenant suspended pending audit
- 3. RollbackNo rollback for data-access events — focus on containment; rotate all tenant API keys; initiate forensic audit of microVM logs
- 4. PostHarden tool argument validation (schema + allowlist); add integration test for sandbox escape attempts; review seccomp policy coverage
Trajectory eval regression hides quality drop
MTTR p50 / p99: Traffic shift in <5 min; root-cause diagnosis 2–6hBlast radius: Silent model quality degradation ships to production; tenants notice via task failure rate increase; trust damage before detection
- 1. DetectAutomated eval suite on new model version should catch; missed if eval coverage gaps exist or eval prompts are stale
- 2. EscalateProduct oncall alerted by tenant complaint spike; eval team reviews trajectory store samples; A/B comparison against previous model version
- 3. RollbackTraffic shift back to previous model version; feature flag on model routing; no data loss — trajectories are immutable
- 4. PostAdd golden-set regression tests covering top-10 tenant task types; gate model promotion on eval pass rate ≥99%; add eval freshness check (prompts updated within 30 days)
Company Lens
Anthropic's push
Expect deep drill on capability-limited tool design and trajectory oversight. Anthropic's published guidance on building effective agents emphasizes: tools should do exactly what they say, nothing more; agents should be able to pause and check in with humans before irreversible actions; and every deployed agent harness should have a kill switch that halts all in-flight trajectories. Interview questions will probe whether your tool registry prevents scope creep, how you handle a tool that silently expands its behavior, and what your human-in-the-loop checkpoint looks like for high-stakes actions (send email, execute payment, delete data). The safety adversarial eval suite is not optional at Anthropic — expect to justify its coverage and cadence.
OpenAI's push
Expect drill on the Assistants API-style platform design: thread management, run lifecycle, tool-call streaming contracts, and file attachment handling. OpenAI's Assistants API is the closest public reference for a hosted agent platform and has well-documented design decisions. Expect questions on: how you version tool schemas without breaking existing agents, how you handle a run that exceeds the max_prompt_tokens limit mid-trajectory, and how you expose observability to developers (OpenAI ships run steps as a first-class API resource, per platform.openai.com/docs/assistants — your trajectory store is the equivalent). Cost management questions will focus on how you surface token usage per-run to developers in real time, not just at billing.
Key Takeaways
What to remember for interviews
- 1The right spend-control unit is the trajectory boundary — the cost of one user-facing task. Per-step is too coarse; per-token is too fine; per-trajectory is the unit the user and the budget both understand.
- 2Child-agent spend must roll up into the parent's budget envelope before dispatch, not after return — otherwise a recursive agent defeats the cap by spawning under-cap children in parallel.
- 3Sandbox isolation on a multi-tenant agent platform is existential, not operational. A sandbox escape is a data breach affecting all tenants on the host; defense-in-depth (seccomp + network policy + capability tokens) must defeat the attack at three independent layers.
- 4The trajectory store is the platform's flywheel: it powers incident replay, spend attribution, post-hoc eval, and the spend dashboard simultaneously. It is the most commonly skipped component and the one most teams regret skipping first.
- 5Capacity planning for agent platforms must derive LLM gateway QPS from user tasks × average LLM calls per task × (1 + average child-agent depth). Starting from user-visible QPS understates true LLM demand by 10–50×.
- 6Trajectory eval — measuring tool-call precision/recall and spend efficiency across the full episode — catches the bloated agents that look fine on final-answer benchmarks but blow past budgets in production.
Interview Questions
Showing 5 of 5
An interviewer asks: 'Where should a hosted agent platform enforce a tenant's spend cap — per model call, per tool call, or somewhere else?' Walk through the reasoning.
★★★Design the capability-token scheme for a tool registry on a multi-tenant agent platform. What does a token contain and how does the runner validate it?
★★★Your trajectory store goes down during a P1 incident. What are the three compounding effects, and how does each one extend MTTR?
★★☆A Google interviewer asks: 'Firecracker microVMs add ~125ms boot latency. Your agent p95 start SLO is <2s to first tool call. Is microVM isolation worth the cost?' How do you reason through this?
★★★Meta's interviewer asks: 'How do you design trajectory eval for an agent platform where agents can spawn other agents? Standard pass/fail on the leaf output doesn't work.'
★★★Further Reading
- Anthropic — Building Effective Agents — Anthropic's practitioner guide on agentic system design: when to use multi-agent patterns, how to design tool interfaces, and the safety properties a hosted platform must enforce.
- ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022) — The paper that formalized the observe-think-act loop underpinning every agent on a hosted platform. The trajectory concept in this module maps directly to a ReAct episode.
- Firecracker: Lightweight Virtualization for Serverless Applications (Agache et al., 2020) — AWS's NSDI paper on Firecracker microVMs — the isolation primitive used by Lambda, Fargate, and any serious multi-tenant agent runner. Essential for the sandboxing deep dive.
- E2B — Secure Open-Source Cloud Runtime for AI Agents — E2B's engineering blog on building a cloud sandbox for AI agents using microVMs. Covers the practical tradeoffs: boot latency, overlay FS, egress policies, and capability scoping.
- Hamel Husain — Your AI Product Needs Evals — The practitioner post that reframed eval-first design for the AI engineering generation. The trajectory eval section of this module follows Hamel's framework applied to multi-turn agent episodes.
- LangSmith — Tracing and Evaluation for LLM Applications — LangSmith's production architecture is the closest public reference for a trajectory store + eval dashboard. The tracing schema and run-tree model inform the trajectory store design here.