',
};
function* renderInkTree(inkTree: InkNode): Generator
{
// Walk Ink component tree, emit HTML
for (const node of walk(inkTree)) {
const adapter = componentMap[node.type as InkNodeType];
yield adapter(node.props);
yield* renderInkTree(node.children);
yield closingTag(node.type);
}
}
function flexboxCss(props: InkProps): string {
const direction = props.flexDirection ?? "column";
const justify = props.justifyContent ?? "flex-start";
const padding = props.padding ?? 0;
return \`flex-direction:\${direction}; justify-content:\${justify}; padding:\${padding}ch;\`;
}
```
```typescript
type SessionMode = "terminal" | "ide" | "web" | "viewer";
type PermissionCallback = (tool: string, input: unknown) => Promise;
// Factory: same engine, different permission UIs
function createSession(mode: SessionMode): QueryEngine {
let callback: PermissionCallback;
if (mode === "terminal") {
callback = terminalPermission; // stdin prompt
} else if (mode === "ide") {
callback = idePermission; // WebSocket dialog
} else if (mode === "web") {
callback = webPermission; // Zustand modal
} else {
callback = async () => false; // viewer โ read-only
}
return new QueryEngine({ permissionCallback: callback });
}
async function terminalPermission(tool: string, _input: unknown): Promise {
const answer = await readLine(\`Allow \${tool}? [y/n] \`);
return answer.toLowerCase() === "y";
}
async function idePermission(tool: string, input: unknown): Promise {
ws.send(JSON.stringify({ type: "permission_request", tool, input }));
// ws.receive() is pseudocode โ real impl: wrap ws.onmessage in a Promise
const response = await ws.receive();
return response.allowed;
}
async function webPermission(tool: string, _input: unknown): Promise {
// store.dispatch/waitFor is pseudocode โ real Zustand impl uses
// setState + a subscribe-based Promise wrapper (Zustand has no built-in waitFor)
store.dispatch({ type: "SHOW_PERMISSION_MODAL", tool });
return store.waitFor("PERMISSION_RESPONSE").then((r) => r.allowed);
}
```
## Interview Questions
### โ
โ
โ
_(Anthropic, Meta)_
**Q:** Design a system where the same AI engine serves terminal, IDE, and web interfaces.
Answer
Use the bridge pattern: one shared QueryEngine handles all AI logic (tool execution, context management, streaming), and each frontend connects through an adapter layer. Terminal: Ink renders React components to ANSI via stdout. IDE: a WebSocket bridge connects the extension to a headless CLI process โ the extension sends user messages, receives streaming events, and renders them in IDE panels. Web: a Next.js app with Zustand state management and an ink-compat adapter that maps Ink components (Box, Text) to HTML equivalents so tool renderers are shared across all three surfaces. The key architectural decision: permission callbacks are injected into QueryEngine at session creation, so terminal prompts stdin, IDE shows a dialog via WebSocket, and web shows a modal โ same engine, different UI.
### โ
โ
โ _(Google)_
**Q:** How would you handle permission dialogs across different UI frontends?
Answer
Inject a permission_callback into the QueryEngine at session creation time. In terminal mode, the callback reads from stdin (blocking prompt). In IDE bridge mode, the callback sends a permission_request message over WebSocket, then awaits the response โ the IDE extension renders a native dialog (VS Code showInformationMessage, JetBrains DialogWrapper) and sends back {allowed: true/false}. In web mode, the callback dispatches to a Zustand store that triggers a React modal. The pattern is dependency injection: the engine never knows which UI is rendering the dialog. This also enables viewer-only mode โ inject a callback that always returns false, so read-only observers can watch but never approve tool use.
### โ
โ
โ _(Anthropic)_
**Q:** What
Answer
Ink components (Box, Text, Spinner, etc.) are designed for terminal rendering via ANSI escape codes. The ink-compat adapter maps these to browser-equivalent HTML elements: Box becomes a div with flexbox, Text becomes a span with CSS styles, Spinner becomes a CSS animation. This means tool output renderers โ the components that display file diffs, search results, command output โ are written once using Ink primitives and work in all three frontends. Without ink-compat, you
### โ
โ
โ _(Anthropic)_
**Q:** What are the tradeoffs between IDE-native and stdio-based bridges for editor integration?
Answer
stdio bridges (spawning a CLI subprocess and communicating over stdin/stdout) are simpler to implement and work across any editor that can spawn a process, but they lack access to IDE-native APIs โ they can
## Further Reading
- [VS Code Extension API](https://code.visualstudio.com/api)
Official docs for building VS Code extensions โ the primary IDE integration surface for Claude Code.
- [WebSocket RFC 6455](https://datatracker.ietf.org/doc/html/rfc6455)
The protocol spec underlying IDE-to-engine communication in bridge mode.
- [Ink: React for interactive command-line apps](https://github.com/vadimdemedes/ink)
The terminal React renderer whose components are adapted by ink-compat for cross-platform rendering.
- [VS Code Extension Host Architecture](https://code.visualstudio.com/api/advanced-topics/extension-host)
How VS Code isolates extensions in a separate process โ the same isolation model used by the Claude Code IDE bridge to sandbox the engine from the editor.
- [Adapter Pattern (Refactoring Guru)](https://refactoring.guru/design-patterns/adapter)
The structural design pattern at the heart of the bridge โ converting one interface (QueryEngine) into multiple frontend-specific interfaces.
- [React Native Architecture Overview](https://reactnative.dev/docs/the-new-architecture/landing-page)
The gold standard for a single React tree rendering to multiple native targets โ the same multi-renderer problem Claude Code
## Related
Agent Harness Architecture ยท Tool System ยท Sub-agents ยท Commands & Skills ยท Plugins & MCP
---
---
title: "Streaming & API Layer"
part: "AI Engineering"
number: 58
emoji: "๐"
subtitle: "Async generators, queryModelWithStreaming, SSE parsing, and backpressure"
tags: ["engineering", "ml", "ai-engineering", "interview-prep", "agent-sdk"]
---
# ๐ Streaming & API Layer
> Async generators, queryModelWithStreaming, SSE parsing, and backpressure
> [!question] Key Question
> Tokens appear one by one because five async generators pipe data like Unix pipes
โ Bridges & IDE Integration | โ Error Recovery
## Key Insights
> [!tip] Insight
> The key property of async generators is{" "} lazy evaluation. The producer only runs when the consumer asks for the next value. This gives you backpressure for free โ if the terminal renderer is slow, the entire pipeline naturally slows down.
> [!tip] Insight
> Streaming is not just about UX โ it fundamentally changes the agent architecture. Without streaming, you wait for the entire response before knowing if there are tool calls. With streaming, you can start rendering text immediately while still accumulating tool_use blocks.
> [!tip] Insight
> At ~50 tokens/second, a 2K token response takes ~40 seconds to generate. With streaming, the user sees the first token in under a second. Without streaming, they see nothing for 40 seconds then everything at once.
## Code Examples
```typescript
// The streaming pipeline โ a chain of async generators
async function* queryModelWithStreaming(
messages: Message[],
systemPrompt: string,
tools: Tool[]
): AsyncGenerator {
// Calls the API and yields streaming events
const response = sdk.messages.stream({
model: "claude-opus-4-6",
messages,
system: systemPrompt,
tools,
max_tokens: 8096, // required by the Messages API
});
for await (const event of response) {
yield event; // text_delta, tool_use, message_stop, etc.
}
}
async function* queryLoop(
messages: Message[],
tools: Tool[],
systemPrompt: string
): AsyncGenerator {
// The agentic loop โ consumes streaming events
while (true) {
const toolBlocks: ToolUseEvent[] = [];
for await (const event of queryModelWithStreaming(messages, systemPrompt, tools)) {
if (event.type === "text_delta") {
yield event; // pass through to REPL for rendering
} else if (event.type === "tool_use") {
toolBlocks.push(event);
}
}
if (toolBlocks.length === 0) return; // no tools = done
const results = await runTools(toolBlocks);
messages.push(...results);
// loop continues โ call API again with tool results
}
}
// The REPL consumes the outermost generator:
for await (const event of queryLoop(messages, tools, prompt)) {
renderToTerminal(event); // each token appears immediately
}
```
```typescript
// SSE wire format from the API
//
// event: message_start
// data: {"type":"message_start","message":{"id":"msg_01..."}}
//
// event: content_block_delta
// data: {"type":"content_block_delta","delta":{"type":"text_delta","text":"Hello"}}
//
// event: message_stop
// data: {"type":"message_stop"}
async function* parseSseStream(response: ReadableStream): AsyncGenerator {
// Parse Server-Sent Events into typed objects
let buffer = "";
for await (const chunk of response) {
buffer += new TextDecoder().decode(chunk);
while (buffer.includes("\\n\\n")) {
const idx = buffer.indexOf("\\n\\n");
const eventStr = buffer.slice(0, idx);
buffer = buffer.slice(idx + 2);
const event = parseEvent(eventStr);
yield event;
}
}
}
```
## Interview Questions
### โ
โ
โ
_(Anthropic, Google)_
**Q:** Design a streaming pipeline for an AI agent that handles tool calls mid-stream.
Answer
The key insight is that tool_use events arrive interleaved with text_delta events in the same stream. Your pipeline must: (1) buffer text deltas for immediate rendering, (2) accumulate tool_use blocks until complete (they arrive as start + delta + stop events), (3) when message_stop arrives, check for pending tool blocks, (4) execute tools and append results to messages, (5) loop back to the API with updated messages. The streaming pipeline is a chain of async generators: the inner generator yields raw SSE events, the middle layer parses them into typed objects, and the outer generator (query_loop) handles the tool-use-then-retry logic. Each layer yields progressively, so the user sees text tokens immediately even when tool calls are pending.
### โ
โ
โ _(Meta)_
**Q:** Explain backpressure in async generators. Why does it matter for LLM streaming?
Answer
Backpressure is the mechanism where a slow consumer naturally slows down the producer. In async generators,
### โ
โ
โ _(Google, Databricks)_
**Q:** How would you handle network disconnection during a streaming API call?
Answer
Layer the solution: (1) Stream interruption recovery is handled at the application level โ the SDK provides event callbacks and error handlers, but the agent harness decides whether to retry the full request or attempt to continue from the last received event. Recovery of partial tool-use blocks requires careful state management. (2) At the application layer, detect stalled streams with a heartbeat timeout (e.g., no event for 30s = stale connection). (3) Implement idempotent retry: if the stream dies mid-response, you have partial text โ include it in the retry request context so the model can continue rather than restart. (4) For tool calls interrupted mid-execution, check tool idempotency: Read/Grep are safe to retry, but Bash may need rollback. The key tradeoff: aggressive reconnection wastes tokens (re-generating seen content), while conservative reconnection loses progress.
### โ
โ
โ
_(Anthropic)_
**Q:** How do you handle partial tool-call JSON in a streaming response where the connection drops mid-chunk?
Answer
A tool call
## Further Reading
- [MDN: Async iteration and generators](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for-await...of)
Reference for async function*, for await...of, and the async iteration protocol.
- [Anthropic Streaming API](https://docs.anthropic.com/en/api/streaming)
Official docs for streaming message responses via Server-Sent Events.
- [SSE Specification (WHATWG)](https://html.spec.whatwg.org/multipage/server-sent-events.html)
The standard behind text/event-stream โ event types, data fields, reconnection.
- [Claude Code (source)](https://github.com/anthropics/claude-code)
Open-source reference for the streaming pipeline architecture described in this module.
- [WHATWG Streams API](https://streams.spec.whatwg.org/)
The browser standard for backpressure-aware streaming โ ReadableStream, WritableStream, and the pipe chain that async generators implement natively.
- [Anthropic SDK Streaming (TypeScript)](https://github.com/anthropics/anthropic-sdk-typescript/blob/main/helpers.md)
The official SDK
- [Node.js Stream Backpressure Guide](https://nodejs.org/en/docs/guides/backpressuring-in-streams)
Official Node.js guide on backpressure โ the mechanism that prevents unbounded memory growth when the consumer is slower than the producer.
## Related
Agent Harness Architecture ยท Tool System ยท Sub-agents ยท Commands & Skills ยท Plugins & MCP
---
---
title: "Error Recovery"
part: "AI Engineering"
number: 59
emoji: "๐"
subtitle: "Reactive compact retry, max output tokens escalation, abort handling, and graceful degradation"
tags: ["engineering", "ml", "ai-engineering", "interview-prep", "agent-sdk"]
---
# ๐ Error Recovery
> Reactive compact retry, max output tokens escalation, abort handling, and graceful degradation
> [!question] Key Question
> The API says 'prompt too long' โ the agent silently compacts and retries before you notice
โ Streaming & API Layer | โ Speculative Execution
## Key Insights
> [!tip] Insight
> Compaction preserves recent tool results (expensive to regenerate) but summarizes older assistant text. This keeps the model's working memory intact while freeing space from conversational history.
> [!tip] Insight
> Rate limit handling is proactive, not just reactive. The agent parses rate-limit headers from every response and slows down before hitting the limit, rather than waiting for a 429 error.
> [!tip] Insight
> The one-shot flag for reactive compaction is a deliberate design choice. Allowing multiple compaction attempts risks an infinite loop where the agent keeps compacting and retrying but never makes progress โ burning tokens and time on a fundamentally impossible request.
## Code Examples
```typescript
const Transition = {
COMPLETED: "completed", // success โ no more tool calls
TOOL_USE: "tool_use", // normal โ execute tools, loop back
REACTIVE_COMPACT: "reactive_compact_retry", // prompt too long
MAX_TOKENS: "max_output_tokens_recovery", // output truncated
MODEL_ERROR: "model_error", // permanent API error
MAX_TURNS: "max_turns", // safety limit hit
ABORTED: "aborted_streaming", // user cancelled
} as const;
async function queryLoop(messages: Message[], tools: Tool[], config: Config): Promise {
let hasAttemptedCompact = false;
let maxTokensRetries = 0;
while (true) {
try {
const response = await callApi(messages, tools);
const toolBlocks = extractToolUse(response);
if (toolBlocks.length === 0) return Transition.COMPLETED;
const results = await runTools(toolBlocks);
messages.push(...results);
// loop back
} catch (err) {
if (err instanceof PromptTooLongError) {
if (hasAttemptedCompact) return Transition.MODEL_ERROR; // already tried, give up
messages = await compact(messages);
hasAttemptedCompact = true;
continue; // retry with compacted messages
} else if (err instanceof MaxOutputTokensError) {
if (maxTokensRetries >= 3) return Transition.MODEL_ERROR;
maxTokensRetries++;
config.maxTokens *= 2; // escalate limit
continue;
} else if (err instanceof RateLimitError) {
await sleep(err.retryAfter); // backoff
continue;
} else if (err instanceof AbortError) {
return Transition.ABORTED;
}
throw err;
}
}
}
```
```typescript
function handleRateLimit(responseHeaders: Headers): void {
const remaining = parseInt(responseHeaders.get("x-ratelimit-remaining") ?? "0");
const resetAt = parseTime(responseHeaders.get("x-ratelimit-reset") ?? "");
const utilization = 1.0 - (remaining / limit);
const timeRemainingPct = (resetAt - now()) / windowSize;
if (utilization > timeRemainingPct) {
warn("Approaching rate limit โ slowing down");
addDelay(calculateBackoff(utilization));
}
}
```
## Interview Questions
### โ
โ
โ
_(Anthropic, Google)_
**Q:** Design error recovery for an AI agent that handles context overflow mid-task.
Answer
The agent maintains a transition system where each loop iteration classifies the outcome. When the API returns
### โ
โ
โ _(Databricks, Meta)_
**Q:** How would you implement graceful degradation when an LLM API rate-limits you?
Answer
Layer the solution: (1) Parse rate limit headers (x-ratelimit-remaining, x-ratelimit-reset) from every response โ not just error responses. (2) Calculate utilization ratio: if you
### โ
โ
โ _(OpenAI)_
**Q:** What
Answer
Transient errors are recoverable by retrying the same request: rate limits (429), network timeouts, server errors (500/503), and
### โ
โ
โ
_(OpenAI)_
**Q:** Design an error recovery system that distinguishes between transient failures (retry) and permanent failures (escalate) for LLM tool calls.
Answer
Build a typed error taxonomy at the tool boundary. Transient failures: network timeouts, rate limit 429s, Bash exit codes from flaky external services (curl timeout, test runner OOM). Permanent failures: missing file paths (ENOENT on a Read), schema validation errors (the tool was called with malformed input), content policy blocks, and budget exhaustion. The classification is encoded in the tool executor
## Further Reading
- [Anthropic API Error Codes](https://docs.anthropic.com/en/api/errors)
Official reference for API error types, status codes, and recommended handling strategies.
- [Exponential Backoff and Jitter (AWS)](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/)
AWS architecture blog on backoff strategies โ full jitter outperforms equal jitter and decorrelated jitter.
- [Circuit Breaker Pattern (Martin Fowler)](https://martinfowler.com/bliki/CircuitBreaker.html)
The pattern for preventing cascading failures when a downstream service is unhealthy.
- [Claude Code (source)](https://github.com/anthropics/claude-code)
Open-source reference for the error recovery and transition system described in this module.
- [Release It! โ Production-Ready Software (Nygard)](https://pragprog.com/titles/mnee2/release-it-second-edition/)
The book that codified circuit breakers, bulkheads, and timeouts โ the stability patterns directly applied in agent error recovery.
- [AbortController and AbortSignal (MDN)](https://developer.mozilla.org/en-US/docs/Web/API/AbortController)
The browser/Node.js API for cooperative cancellation โ the mechanism behind Ctrl+C propagation through the streaming pipeline.
- [Google SRE Book: Handling Overload](https://sre.google/sre-book/handling-overload/)
Google SRE
## Related
Agent Harness Architecture ยท Tool System ยท Sub-agents ยท Commands & Skills ยท Plugins & MCP
---
---
title: "Speculative Execution"
part: "AI Engineering"
number: 60
emoji: "๐ฎ"
subtitle: "Parallel speculation, overlay filesystems, safe tool subsets, and acceptance criteria"
tags: ["engineering", "ml", "ai-engineering", "interview-prep", "agent-sdk"]
---
# ๐ฎ Speculative Execution
> Parallel speculation, overlay filesystems, safe tool subsets, and acceptance criteria
> [!question] Key Question
> While you're still typing, a speculative agent already searched the codebase for you
โ Error Recovery | โ Coordinator/Worker Pattern
## Key Insights
> [!tip] Insight
> The overlay filesystem is the key safety mechanism. Like Docker's layered filesystem, reads fall through to the real FS while writes go to a temporary layer. This is an application-managed copy-on-write abstraction: merge copies changed files back to the real FS, discard deletes the temp layer. Both operations are lightweight (proportional to files changed, not total codebase size).
> [!tip] Insight
> The accept/reject decision compares the user's actual message against the speculation's predicted intent โ not exact string matching, but semantic alignment. "Fix the bug" and "Can you fix that error?" both align with a speculation that investigated the error.
> [!tip] Insight
> Speculation is suppressed ~60% of the time because most turns are either cheap (not worth speculating) or read-only (nothing actionable to predict). The 40% of turns where speculation runs tend to be high-value: after file edits, after bug fixes, after complex tool chains โ exactly the moments when pre-computation saves the most time.
## Code Examples
```typescript
const SpeculationState = {
IDLE: "idle",
RUNNING: "running",
ACCEPTED: "accepted",
REJECTED: "rejected",
} as const;
type SpeculationStateValue = typeof SpeculationState[keyof typeof SpeculationState];
class SpeculativeExecutor {
private state: SpeculationStateValue = SpeculationState.IDLE;
private overlayFs: OverlayFileSystem = new OverlayFileSystem();
private safeTools: string[] = ["Read", "Glob", "Grep", "TaskGet", "TaskList"];
private result: unknown = null;
// Run speculative work in background
async speculate(conversation: Conversation): Promise {
if (this.shouldSuppress(conversation)) return;
this.state = SpeculationState.RUNNING;
// Predict next steps
const prediction = await predictNextAction(conversation);
// Run with overlay FS and restricted tools
const engine = new QueryEngine({
tools: filterTools(this.safeTools),
filesystem: this.overlayFs, // writes go to overlay
});
this.result = await engine.submit(prediction);
}
// Check if speculation matches user intent
onUserMessage(message: string): void {
if (this.state !== SpeculationState.RUNNING) return;
if (alignsWithSpeculation(message, this.result)) {
this.state = SpeculationState.ACCEPTED;
this.overlayFs.mergeToReal(); // apply cached work
} else {
this.state = SpeculationState.REJECTED;
this.overlayFs.discard(); // throw away, no harm
}
}
// Don't speculate if it's not worth it
private shouldSuppress(conversation: Conversation): boolean {
if (conversation.lastTurnCost < threshold) return true; // cheap turn
if (conversation.lastToolWasReadOnly) return true; // nothing to speculate
return false;
}
}
```
```typescript
// Copy-on-write filesystem โ reads from real, writes to temp
class OverlayFileSystem {
private overlay: Map = new Map(); // path -> content
read(path: string): string {
if (this.overlay.has(path)) {
return this.overlay.get(path)!;
}
return realFs.read(path);
}
write(path: string, content: string): void {
this.overlay.set(path, content); // never touches real FS
}
mergeToReal(): void {
for (const [path, content] of this.overlay) {
realFs.write(path, content);
}
}
discard(): void {
this.overlay.clear();
}
}
```
## Interview Questions
### โ
โ
โ
_(Anthropic, Google)_
**Q:** Design a speculative execution system for an AI agent. How do you ensure safety?
Answer
Three isolation layers: (1) Overlay filesystem โ reads from the real FS but writes go to a temporary copy-on-write layer. If speculation is wrong, discard the overlay; if right, merge it. This is the same principle as OverlayFS in Docker containers. (2) Tool filtering โ only allow read-only tools (Read, Glob, Grep, TaskGet, TaskList). No Bash, no Edit, no network calls. Even if the speculative agent hallucinates, it can
### โ
โ
โ _(Meta)_
**Q:** What
Answer
Overlay FS is a copy-on-write layer: reads fall through to the real FS, writes go to a temp directory. It
### โ
โ
โ _(OpenAI, Databricks)_
**Q:** How would you decide when speculation is worth the compute cost?
Answer
Build a suppression heuristic with three signals: (1) Last turn cost โ if the previous turn was cheap (simple question, no tool calls), speculation is unlikely to save time. Only speculate after expensive turns (multiple tool calls, code generation). (2) Task type โ after a file edit, the user likely wants to test or review; after a bug fix, they likely want to verify. Read-only exploration turns don
### โ
โ
โ
_(Anthropic)_
**Q:** What verification strategy prevents speculative execution from committing side effects that the user hasn
Answer
Two complementary layers: tool filtering and overlay commit gating. Tool filtering is the first line โ the speculative agent
## Further Reading
- [OverlayFS Documentation (Linux Kernel)](https://docs.kernel.org/filesystems/overlayfs.html)
The kernel filesystem that inspired the copy-on-write pattern used in speculative execution.
- [Speculative Execution in CPUs (Hennessy & Patterson)](https://en.wikipedia.org/wiki/Speculative_execution)
The CPU architecture concept โ predict the branch, execute speculatively, commit or rollback.
- [Branch Prediction (Wikipedia)](https://en.wikipedia.org/wiki/Branch_predictor)
How CPUs predict which branch to take โ the same predict-execute-verify pattern applies to agent speculation.
- [Claude Code (source)](https://github.com/anthropics/claude-code)
Open-source agentic coding tool that may contain related implementation ideas for speculative execution and overlay-based isolation.
- [Spectre and Meltdown: Lessons for Software Design](https://meltdownattack.com/)
The real-world consequences of CPU speculative execution gone wrong โ illustrates why side-effect isolation (overlay FS) is non-negotiable before committing speculative work.
- [Copy-on-Write Semantics (Linux man page: fork)](https://man7.org/linux/man-pages/man2/fork.2.html)
The OS-level COW primitive that makes fork cheap โ the same copy-on-write principle applied to filesystem overlays in agent speculative execution.
- [Git Stash and Worktrees](https://git-scm.com/docs/git-stash)
The git primitives for saving and restoring working state โ the lightweight alternative to overlay FS for speculative edits confined to git-tracked files.
## Related
Agent Harness Architecture ยท Tool System ยท Sub-agents ยท Commands & Skills ยท Plugins & MCP
---
---
title: "Coordinator/Worker Pattern"
part: "AI Engineering"
number: 61
emoji: "๐"
subtitle: "Multi-agent coordination, restricted tool sets, environment gating, and task distribution"
tags: ["engineering", "ml", "ai-engineering", "interview-prep", "agent-sdk"]
---
# ๐ Coordinator/Worker Pattern
> Multi-agent coordination, restricted tool sets, environment gating, and task distribution
> [!question] Key Question
> The coordinator writes prompts, not code โ it manages a team of worker agents
โ Speculative Execution | โ Session Persistence
## Key Insights
> [!tip] Insight
> The coordinator's context window stays clean for high-level decisions. It never fills up with implementation details, diffs, or compiler output โ that stays in worker contexts.
> [!tip] Insight
> This is the pattern behind complex multi-file refactors: the coordinator reads the codebase, plans the decomposition, spawns 2-5 workers for independent subtasks, then verifies integration. It's MapReduce for code changes.
> [!tip] Insight
> The coordination overhead (10-15% of tokens spent on planning instead of executing) pays for itself when tasks have dependencies. For a 4-worker refactor, the alternative is 4 independent agents that each spend 30%+ of their context re-discovering the plan.
## Code Examples
```typescript
// Coordinator manages workers โ never writes code directly
class CoordinatorMode {
private coordinatorTools: string[] = [
"Agent", // spawn workers
"SendMessage", // communicate with running workers
"TeamCreate", // create worker teams
"Read", "Glob", "Grep", // can READ code to plan
// NO Bash, Edit, Write โ coordinator doesn't code
];
private workerTools: string[] = [
"Bash", "Read", "Write", "Edit", "Glob", "Grep",
// NO Agent โ workers can't spawn sub-workers
];
async coordinate(task: Task): Promise {
// 1. Analyze the task
const plan = await this.planDecomposition(task);
// 2. Spawn workers for each subtask
const workers: Agent[] = [];
for (const subtask of plan.subtasks) {
const worker = await spawnAgent({
prompt: subtask.description,
tools: this.workerTools,
background: true, // parallel execution
});
workers.push(worker);
}
// 3. Monitor and aggregate results
const results = await gatherResults(workers);
// 4. Verify and integrate
return this.verifyIntegration(results);
}
}
```
```typescript
const COORDINATOR_TOOLS: string[] = [
"Agent", "SendMessage", "TeamCreate", "TeamDelete",
"Read", "Glob", "Grep", // read-only access
];
const WORKER_TOOLS: string[] = [
"Bash", "Read", "Write", "Edit", "Glob", "Grep",
// No Agent โ strict two-level hierarchy
];
const ALL_TOOLS: string[] = [...COORDINATOR_TOOLS, ...WORKER_TOOLS];
// Tool set determined at startup, not by the agent
function getAvailableTools(mode: string): string[] {
if (process.env.COORDINATOR_MODE) {
return COORDINATOR_TOOLS; // management only
}
return ALL_TOOLS; // full access for single-agent mode
}
// Workers are spawned with explicit tool lists
async function spawnWorker(taskDescription: string): Promise {
return await Agent({
prompt: taskDescription,
tools: WORKER_TOOLS, // enforced at registry level
// Agent tool NOT in WORKER_TOOLS โ no recursion possible
});
}
```
```typescript
// Spawn workers for independent subtasks concurrently
async function runParallelWorkers(
subtasks: Subtask[],
workerTools: string[]
): Promise {
async function runOne(subtask: Subtask): Promise {
const result = await spawnAgent({
prompt: subtask.description,
tools: workerTools,
background: true,
});
return { subtask: subtask.id, result };
}
// All workers run simultaneously
const results = await Promise.all(subtasks.map(runOne));
// Check for conflicts
const modifiedFiles = new Map();
for (const r of results) {
for (const f of r.result.filesModified) {
if (modifiedFiles.has(f)) {
throw new ConflictError(
\`\${f} modified by workers \${modifiedFiles.get(f)} and \${r.subtask}\`
);
}
modifiedFiles.set(f, r.subtask);
}
}
return results;
}
```
## Interview Questions
### โ
โ
โ
_(Anthropic, OpenAI)_
**Q:** Design a multi-agent system where a coordinator delegates to specialized workers.
Answer
The coordinator is a special agent that only has management tools (Agent, SendMessage, Read, Grep) โ it can plan and observe but never write code. It decomposes tasks, spawns worker agents with restricted tool sets (Bash, Edit, Write โ but no Agent tool, preventing uncontrolled recursion). Workers execute in parallel on independent subtasks and report results via tool_result. The coordinator aggregates results, detects conflicts (e.g., two workers editing the same file), and decides next steps. Key design choices: (1) tool-level isolation prevents recursion, (2) parallel workers maximize throughput, (3) coordinator
### โ
โ
โ _(Google)_
**Q:** How do you prevent uncontrolled recursion in a system where agents can spawn agents?
Answer
Remove the Agent tool from worker tool sets. Workers can execute code (Bash, Edit, Write) but cannot spawn sub-workers. This is enforced at the tool registry level โ when a worker is created, its available tools are filtered to exclude Agent. This creates a strict two-level hierarchy: coordinator spawns workers, workers execute and return. No deeper nesting. The alternative โ depth limits โ is fragile because a depth-3 agent tree consumes 3x context and 3x API cost with no coordination. The flat coordinator/worker pattern is simpler, cheaper, and easier to debug.
### โ
โ
โ
_(Meta, Anthropic)_
**Q:** What
Answer
Flat pool: all agents are equal, no coordinator. Simple to implement, but no one owns the plan โ agents may duplicate work, conflict on shared files, or miss integration issues. Works for embarrassingly parallel tasks (lint 10 files). Hierarchical: coordinator owns the plan, workers own execution. Higher overhead (coordinator consumes tokens just to manage), but critical for tasks with dependencies โ multi-file refactors, cross-module changes, anything requiring integration. The coordinator pattern pays for itself when subtasks interact: it detects conflicts early and re-plans, whereas a flat pool discovers conflicts at merge time.
### โ
โ
โ
_(Google)_
**Q:** How would you implement work-stealing between coordinator-worker agents when one worker finishes early?
Answer
Model the task queue as a shared priority queue that the coordinator owns and workers pull from, rather than pre-assigning all subtasks at spawn time. Workers request the next task when they complete their current one: the coordinator
## Further Reading
- [MapReduce: Simplified Data Processing on Large Clusters](https://research.google/pubs/pub62/)
Dean & Ghemawat, 2004 โ the original coordinator/worker pattern for distributed computation.
- [A Universal Modular ACTOR Formalism for Artificial Intelligence](https://dl.acm.org/doi/10.5555/1624775.1624804)
Hewitt et al., 1973 โ the Actor Model that underpins modern multi-agent message passing.
- [Large Language Model based Multi-Agents: A Survey of Progress and Challenges](https://arxiv.org/abs/2402.01680)
Recent survey of LLM-based multi-agent architectures and coordination patterns.
- [Claude Code (source)](https://github.com/anthropics/claude-code)
Open-source reference implementing the coordinator/worker pattern for real-world coding tasks.
- [LLM-Compiler: Parallel Function Calling](https://arxiv.org/abs/2312.04511)
Kim et al., 2023 โ DAG-based parallel function call planning, the same dependency-aware parallelism coordinators use to maximize worker throughput.
- [Anthropic: Building Effective Agents โ Orchestrator Subagent](https://www.anthropic.com/research/building-effective-agents)
Anthropic
- [Celery: Distributed Task Queue](https://docs.celeryq.dev/)
The production distributed task queue โ the software engineering analog of the coordinator/worker pattern, with retries, priorities, and result backends.
## Related
Agent Harness Architecture ยท Tool System ยท Sub-agents ยท Commands & Skills ยท Plugins & MCP
---
---
title: "Session Persistence"
part: "AI Engineering"
number: 62
emoji: "๐พ"
subtitle: "Session JSON, /resume reconstruction, message history, file snapshots, and attribution"
tags: ["engineering", "ml", "ai-engineering", "interview-prep", "agent-sdk"]
---
# ๐พ Session Persistence
> Session JSON, /resume reconstruction, message history, file snapshots, and attribution
> [!question] Key Question
> Close the terminal, reopen it, type --resume โ the conversation continues exactly where you left off
โ Coordinator/Worker Pattern | โ Cost Tracking & Budgets
## Key Insights
> [!tip] Insight
> Sessions are typically 50KB-5MB depending on conversation length. Long coding sessions with many tool results can grow larger because tool_result content (file contents, grep output) is stored verbatim in the message array.
> [!tip] Insight
> Input history (arrow-key recall) is stored separately from sessions in ~/.claude/history.jsonl. It persists across all sessions โ your previous inputs are always available regardless of which session you resume.
> [!tip] Insight
> Session files accumulate over time and are never automatically deleted. Heavy users can have hundreds of session files totaling 100MB+. A pruning strategy (delete sessions older than 30 days, or keep only the last 50) would help, but risks deleting sessions users want to resume.
## Code Examples
```typescript
import fs from "fs";
import path from "path";
import os from "os";
// Sessions live under ~/.claude/projects//.jsonl
const PROJECTS_DIR = path.join(os.homedir(), ".claude", "projects");
function getSessionPath(projectDir: string, sessionId: string): string {
return path.join(PROJECTS_DIR, projectDir, \`\${sessionId}.jsonl\`);
}
class SessionStorage {
// Persist session: append one JSON event per line (JSONL)
save(projectDir: string, sessionId: string, state: SessionState): void {
const filePath = getSessionPath(projectDir, sessionId);
const event = {
id: sessionId,
messages: serializeMessages(state.messages),
model: state.model,
cost_usd: state.totalCost,
file_history: state.filesTouched,
created_at: state.createdAt,
updated_at: new Date().toISOString(),
cwd: state.workingDirectory,
};
fs.appendFileSync(filePath, JSON.stringify(event) + "\\n");
}
// Reconstruct session state: read all lines, use last event
restore(projectDir: string, sessionId: string): QueryEngine {
const filePath = getSessionPath(projectDir, sessionId);
const lines = fs.readFileSync(filePath, "utf-8").trim().split("\\n");
const data = JSON.parse(lines[lines.length - 1]!);
// Reconstruct multi-domain state
const engine = new QueryEngine({
initialMessages: deserializeMessages(data.messages),
model: data.model,
cwd: data.cwd,
});
// Restore auxiliary state
restoreFileHistory(data.file_history);
restoreAttribution(data);
extractPendingTodos(data.messages);
return engine;
}
// List recent sessions for selection
listRecent(projectDir: string, limit: number = 20): SessionMetadata[] {
const dir = path.join(PROJECTS_DIR, projectDir);
const files = fs.readdirSync(dir)
.filter((f) => f.endsWith(".jsonl"))
.map((f) => ({ f, mtime: fs.statSync(path.join(dir, f)).mtime }))
.sort((a, b) => b.mtime.getTime() - a.mtime.getTime())
.slice(0, limit);
return files.map((f) => this.readMetadata(f.f));
}
}
```
```typescript
// Arrow-key recall of previous inputs โ persists across all sessions
class InputHistory {
// ~/.claude/history.jsonl โ shared across all projects/sessions
private historyPath: string = path.join(os.homedir(), ".claude", "history.jsonl");
private entries: string[] = this.load();
private cursor: number = this.entries.length;
add(inputText: string): void {
this.entries.push(inputText);
this.save(); // append to file immediately
}
// Up/down arrow through history
navigate(direction: "up" | "down"): string {
if (direction === "up") {
this.cursor = Math.max(0, this.cursor - 1);
} else {
this.cursor = Math.min(this.entries.length - 1, this.cursor + 1);
}
return this.entries[this.cursor];
}
private load(): string[] {
if (fs.existsSync(this.historyPath)) {
// Each line is a JSON object; extract the display field for arrow-key recall
return fs.readFileSync(this.historyPath, "utf-8")
.split("\\n")
.filter(Boolean)
.map((line) => { try { return JSON.parse(line).display ?? line; } catch { return line; } });
}
return [];
}
private save(): void {
// Keep last 100 entries (MAX_HISTORY_ITEMS in source)
const recent = this.entries.slice(-100);
fs.writeFileSync(this.historyPath, recent.map((e) => JSON.stringify({ display: e })).join("\\n") + "\\n");
}
}
```
```typescript
// Resume creates a NEW session that inherits old context
function resumeSession(oldSessionId: string): Session {
const storage = new SessionStorage();
const oldData = storage.restore(oldSessionId);
// New session ID โ the old session is read-only now
const newSessionId = generateUuid();
// The new session starts with old messages as context
const newSession = new Session({
id: newSessionId,
parentId: oldSessionId, // adoption link
initialMessages: oldData.messages,
model: oldData.model,
cwd: oldData.cwd,
});
// Cost tracking starts fresh for the new session,
// but we can show cumulative cost across the chain
newSession.inheritedCost = oldData.cost_usd;
return newSession;
}
```
## Interview Questions
### โ
โ
โ
_(Anthropic)_
**Q:** Design a session persistence system for an AI agent that handles multi-domain state.
Answer
The session isn
### โ
โ
โ _(Google)_
**Q:** How would you handle session corruption or migration when the schema changes?
Answer
Version the schema: every session JSON includes a schema_version field. On load, check the version and run migration functions if needed (v1 -> v2 adds cost tracking, v2 -> v3 renames fields). For corruption: validate JSON structure before deserializing โ if parsing fails, try to recover the message array (the most valuable part) and discard corrupted auxiliary data. Never silently drop sessions โ surface a warning. For large-scale migrations: lazy migration on load (don
### โ
โโ _(OpenAI)_
**Q:** What
Answer
Save every turn: durable against crashes (no lost work), but high I/O overhead โ writing 50KB-5MB JSON on every API response. Save on exit: minimal I/O, but a crash loses the entire session. The pragmatic middle ground: save on exit + periodic checkpoints (every N turns or every M seconds). Use write-ahead logging if durability matters: append each turn to a log file (fast, sequential writes), and periodically compact the log into a full snapshot. This gives crash recovery (replay the log) with low per-turn overhead.
### โ
โ
โ
_(Anthropic)_
**Q:** Design a session persistence format that allows resuming a conversation after a crash, including in-flight tool calls.
Answer
The session format must capture not just completed turns but the agent
## Further Reading
- [Event Sourcing Pattern](https://martinfowler.com/eaaDev/EventSourcing.html)
Martin Fowler โ storing state as a sequence of events, the pattern behind session replay.
- [SQLite Write-Ahead Logging](https://www.sqlite.org/wal.html)
The WAL mechanism that enables concurrent reads during writes โ relevant to session checkpoint design.
- [Redis Persistence: RDB vs AOF](https://redis.io/docs/latest/operate/oss_and_stack/management/persistence/)
Two persistence strategies (snapshot vs append-only) that mirror the session save tradeoffs.
- [Claude Code (source)](https://github.com/anthropics/claude-code)
Open-source reference for session persistence, /resume, and multi-domain state reconstruction.
- [CQRS and Event Sourcing (Microsoft Azure Docs)](https://learn.microsoft.com/en-us/azure/architecture/patterns/cqrs)
Command/Query Responsibility Segregation with event sourcing โ the pattern behind session replay: store events, not snapshots, then replay to reconstruct state.
- [SQLite: The Appropriate Uses for SQLite](https://www.sqlite.org/whentouse.html)
SQLite
- [tmux Session Management](https://github.com/tmux/tmux/wiki)
The gold standard for terminal session persistence โ background processes, detach/attach, and named sessions; the UX model Claude Code
## Related
Agent Harness Architecture ยท Tool System ยท Sub-agents ยท Commands & Skills ยท Plugins & MCP
---
---
title: "Cost Tracking & Budgets"
part: "AI Engineering"
number: 63
emoji: "๐ฐ"
subtitle: "Token counting, budget limits, per-model pricing, rate limit handling, and spend alerts"
tags: ["engineering", "ml", "ai-engineering", "interview-prep", "agent-sdk"]
---
# ๐ฐ Cost Tracking & Budgets
> Token counting, budget limits, per-model pricing, rate limit handling, and spend alerts
> [!question] Key Question
> Every tool call has a price โ the agent tracks spend in real-time and stops before you go broke
โ Session Persistence
## Key Insights
> [!tip] Insight
> Output tokens cost 5x more than input tokens on Claude models ($75/M vs $15/M for Opus). A verbose agent that generates long explanations costs far more than one that gives concise answers. This is why agent prompts often say "be concise."
> [!tip] Insight
> The cost tracker can also estimate remaining turns: divide remaining budget by average cost per turn. This lets the agent prioritize โ if only 3 turns remain, skip exploration and go straight to implementation.
> [!tip] Insight
> The 5x output-to-input cost ratio means that a concise agent (generating 500 output tokens per turn) costs 5x less in output than a verbose one (2,500 tokens). Over 50 turns, that's $3.75 vs $9.38 in output costs alone โ conciseness is a cost optimization.
## Code Examples
```typescript
// Per-model pricing (approximate, illustrative)
interface ModelPricing {
input: number;
output: number;
cacheRead: number;
cacheWrite: number;
}
const PRICING: Record = {
"claude-opus-4": {
input: 15.00 / 1_000_000, // $15/M tokens
output: 75.00 / 1_000_000, // $75/M tokens
cacheRead: 1.50 / 1_000_000, // $1.50/M (10x cheaper!)
cacheWrite: 18.75 / 1_000_000,
},
"claude-sonnet-4": {
input: 3.00 / 1_000_000,
output: 15.00 / 1_000_000,
cacheRead: 0.30 / 1_000_000,
cacheWrite: 3.75 / 1_000_000,
},
};
```
```typescript
interface TokenUsage {
inputTokens: number;
outputTokens: number;
cacheReadTokens: number;
cacheCreationTokens: number;
}
class CostTracker {
private totalCost: number = 0;
private turnCosts: number[] = [];
constructor(
private model: string,
private maxBudget?: number,
) {}
// Called after every API response
recordUsage(usage: TokenUsage): number {
const pricing = PRICING[this.model];
const cost =
usage.inputTokens * pricing.input +
usage.outputTokens * pricing.output +
usage.cacheReadTokens * pricing.cacheRead +
usage.cacheCreationTokens * pricing.cacheWrite;
this.totalCost += cost;
this.turnCosts.push(cost);
if (this.maxBudget && this.totalCost >= this.maxBudget) {
throw new BudgetExceededError(
\`Budget $\${this.maxBudget.toFixed(2)} exceeded (spent $\${this.totalCost.toFixed(2)})\`
);
}
return cost;
}
// Predict how many more turns the budget allows
estimateRemainingTurns(): number {
if (!this.turnCosts.length || !this.maxBudget) return Infinity;
const avgCost = this.turnCosts.reduce((a, b) => a + b, 0) / this.turnCosts.length;
const remaining = this.maxBudget - this.totalCost;
return Math.floor(remaining / avgCost);
}
}
```
```typescript
class RateLimitTracker {
private remaining: number = 0;
private limit: number = 0;
private resetAt: Date = new Date();
// Parse rate limit headers from API response
update(headers: Record): string {
this.remaining = parseInt(headers["x-ratelimit-remaining-tokens"]);
this.limit = parseInt(headers["x-ratelimit-limit-tokens"]);
this.resetAt = parseTime(headers["x-ratelimit-reset"]);
// Are we using tokens faster than they replenish?
const utilization = 1.0 - this.remaining / this.limit;
const timePct = timeRemainingPct(this.resetAt);
if (utilization > timePct + 0.1) {
return "WARNING: approaching rate limit";
}
return "OK";
}
// True if we should slow down to avoid hard limit
shouldThrottle(): boolean {
return this.remaining < this.limit * 0.1; // less than 10% remaining
}
}
```
## Interview Questions
### โ
โ
โ
_(Anthropic, Databricks)_
**Q:** Design a cost tracking system for an AI agent that handles multiple pricing tiers.
Answer
Each API response includes token counts: input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens. Multiply each by the per-model rate (different for input vs output vs cached). Track at three granularities: (1) per-turn โ how much did this API call cost, (2) per-session โ cumulative cost for budget enforcement, (3) per-tool-call โ attribute cost to specific operations. Store per-model pricing as a lookup table, updated when pricing changes. Key: cache_read tokens cost ~10% of uncached input โ so the tracker must distinguish them. Add budget limits (maxBudgetUsd) that raise BudgetExceededError when the session total exceeds the cap. Display real-time cost in the TUI status bar.
### โ
โโ _(Google, OpenAI)_
**Q:** How does prompt caching affect the economics of AI agent systems?
Answer
Cached input tokens cost ~10% of uncached ones ($1.50/M vs $15/M for Opus). For an agent making 50+ API calls per task with a ~10K token system prompt, caching saves ~$7 per task on Opus. The system prompt (instructions + tool definitions) is the same across calls โ it
### โ
โ
โ _(Anthropic)_
**Q:** What
Answer
All three, layered. Per-turn limits catch runaway single calls (an agent generating a 100K token response). Per-session limits cap total spend for a task ($10 default). Per-project limits enforce organizational budgets across all sessions. Implementation: check per-turn first (cheapest check), then per-session, then per-project. When any limit is hit, stop gracefully โ save session state so the user can resume after increasing the limit. The tricky part is estimation: before executing an expensive operation, estimate its cost and warn if it would exceed the budget. This requires tracking average cost per turn type.
### โ
โ
โ
_(OpenAI)_
**Q:** Design a cost tracking system that predicts when a conversation will exceed a budget threshold and suggests cheaper alternatives.
Answer
Layer the system into three components: a retrospective tracker, a prospective estimator, and an advice engine. The retrospective tracker records cost per turn with token-type breakdown (input, output, cache_read) and computes a rolling average cost per turn type (tool-heavy turns vs. pure text turns). The prospective estimator runs before each API call: given the current session cost, the remaining budget, and the rolling average, it projects turns_remaining = (budget - cumulative_cost) / avg_cost_per_turn. When turns_remaining drops below a threshold (e.g., 5), the estimator emits a warning. The advice engine activates when the budget is tight and suggests: (1) switch from Opus to Sonnet (5x cheaper input, 5x cheaper output) if task complexity allows; (2) enable prompt caching by sorting tool definitions deterministically (recovers 90% of system prompt cost); (3) truncate verbose Bash output to 2K lines (cuts tool_result input tokens). The key design choice: advice is ranked by ROI (savings per unit of quality loss), not just absolute savings, so the agent suggests the cheapest optimization that preserves task quality.
## Further Reading
- [Anthropic API Pricing](https://docs.anthropic.com/en/docs/about-claude/models)
Current pricing for all Claude models โ input, output, and cached token rates.
- [Prompt Caching with Claude](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching)
How prompt caching works, cache breakpoints, and cost implications for agent systems.
- [Token Economics of LLM Applications](https://a16z.com/generative-ai-enterprise-2024/)
a16z analysis of cost structures in production LLM applications.
- [Cloud Cost Optimization Patterns](https://cloud.google.com/architecture/cost-optimization)
Google Cloud cost optimization โ the same principles (metering, budgets, alerts) apply to LLM spend.
- [LLM API Pricing Comparison (Artificial Analysis)](https://artificialanalysis.ai/)
Live benchmark tracking price, throughput, and latency across all major LLM providers โ the reference for model selection decisions in cost-aware agents.
- [OpenTelemetry for LLM Observability](https://opentelemetry.io/docs/concepts/signals/metrics/)
The open standard for emitting cost, latency, and token metrics โ the instrumentation layer beneath production LLM cost dashboards.
- [Simon Willison: Costs and Pricing for LLM APIs](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
Simon Willison
## Related
Agent Harness Architecture ยท Tool System ยท Sub-agents ยท Commands & Skills ยท Plugins & MCP
---
---
title: "Mechanistic Interpretability"
part: "Trust & Evaluation"
number: 64
emoji: "๐ฌ"
subtitle: "SAE training, activation patching, attribution graphs, circuit tracing, and feature steering"
tags: ["trust", "ml", "ai-engineering", "interview-prep", "transformer"]
---
# ๐ฌ Mechanistic Interpretability
> SAE training, activation patching, attribution graphs, circuit tracing, and feature steering
> [!question] Key Question
> Anthropic traced a complete reasoning chain inside Claude โ from question to multi-step feature activation to answer
โ Safety & Alignment | โ Induction Heads & ICL
## Key Insights
> [!tip] Insight
> The key insight: after training, each column of{" "} W_dec is a feature direction in the model's activation space. If column 4,721 consistently activates for “Golden Gate Bridge” text, that direction is{" "} the Golden Gate Bridge feature. You can read features by checking what inputs maximize each column's activation.
> [!tip] Insight
> This is circuit tracing โ the core method. It combines sparse autoencoders (to name the features) with attribution patching (to measure which features caused which). The result is a computational graph you can read, not a black box.
> [!tip] Insight
> Hallucination mechanism (Biology paper, 2025):{" "} Circuit analysis found “known answer” features that suppress the model's default refusal circuit. When a question is asked about something the model genuinely knows, these features fire and allow the answer to proceed. Hallucination occurs when these “known answer” features activate despite the model not actually having sufficient knowledge โ confidence gates open when they shouldn't.
> [!tip] Insight
> CLT limitations to know for interviews: Attribution graphs succeed on only{" "} ~25% of attempted prompts {" "} (the rest are too complex or ambiguous to yield clean sparse graphs). The replacement model โ which substitutes CLT features for the original MLP computations โ{" "} explains ~61% of end-to-end computation (replacement score 0.61) {" "} and matches the model's top-1 next-token prediction on ~50% of a filtered evaluation set (prompts where the base model predicts correctly with confidence below 80%), with ~11.5% normalized mean reconstruction error.
> [!tip] Insight
> Feature steering is the interpretability equivalent of unit testing: if the feature truly represents a concept, amplifying it should reliably inject that concept into outputs. It does. This is how we know SAE features are real computational objects, not just post-hoc labels.
> [!tip] Insight
> The “biology” framing is intentional but cautious. Anthropic draws analogies to neuroscience (circuits, features, neurons) but emphasizes these are mechanistic descriptions, not claims about consciousness or intent. The features are real computational objects; what they “mean” is inferred by humans looking at activation patterns.
> [!tip] Insight
> Start with Neuronpedia to build intuition for what SAE features look like, then move to TransformerLens when you want to run your own experiments. ARENA exercises bridge the gap between “I understand the theory” and “I can find circuits myself.”
## Code Examples
```python
import torch
import torch.nn as nn
from torch.optim import Adam
class SparseAutoencoder(nn.Module):
def __init__(self, d_model: int, expansion: int = 64):
super().__init__()
d_sae = d_model * expansion
self.W_enc = nn.Linear(d_model, d_sae, bias=True)
self.W_dec = nn.Linear(d_sae, d_model, bias=True)
self.relu = nn.ReLU()
# Normalize decoder columns to unit norm
self._normalize_decoder()
def _normalize_decoder(self):
with torch.no_grad():
norms = self.W_dec.weight.norm(dim=0, keepdim=True).clamp(min=1e-8)
self.W_dec.weight.div_(norms)
def forward(self, x: torch.Tensor):
# Center around decoder bias before encoding
x_cent = x - self.W_dec.bias
f = self.relu(self.W_enc(x_cent)) # sparse features
x_hat = self.W_dec(f) # reconstruction
return x_hat, f
def sae_loss(x, x_hat, f, lam: float = 5e-3):
recon = (x - x_hat).pow(2).mean() # MSE reconstruction
sparsity = f.abs().mean() # L1 on features
return recon + lam * sparsity, recon, sparsity
# Training loop sketch
sae = SparseAutoencoder(d_model=4096, expansion=64)
opt = Adam(sae.parameters(), lr=2e-4)
for activations in dataloader: # activations: [B, d_model]
opt.zero_grad()
x_hat, f = sae(activations)
loss, recon, sparse = sae_loss(activations, x_hat, f, lam=5e-3)
loss.backward()
opt.step()
sae._normalize_decoder() # keep decoder cols unit norm
```
```python
import torch
from contextlib import contextmanager
@contextmanager
def patch_activation(model, layer_name: str, patch_value: torch.Tensor):
"""Context manager to swap one layer's output mid-forward-pass."""
hooks = []
def hook_fn(module, input, output):
return patch_value # replace with clean-run activation
handle = dict(model.named_modules())[layer_name].register_forward_hook(hook_fn)
hooks.append(handle)
try:
yield
finally:
for h in hooks:
h.remove()
def activation_patching_score(model, clean_tokens, corrupt_tokens, layer_name,
clean_cache, metric_fn):
"""
Measure how much layer_name causally matters for the metric.
metric_fn(logits) -> scalar (e.g., logit diff between two tokens)
"""
# Baseline: corrupted run
with torch.no_grad():
corrupt_logits = model(corrupt_tokens)
baseline = metric_fn(corrupt_logits)
# Patched: corrupted run but swap in the clean activation
clean_act = clean_cache[layer_name]
with torch.no_grad():
with patch_activation(model, layer_name, clean_act):
patched_logits = model(corrupt_tokens)
patched = metric_fn(patched_logits)
return (patched - baseline).item() # positive = component helps
```
```python
# Logit lens: project intermediate residual stream to vocab space
import torch
def logit_lens(model, tokens: torch.Tensor):
"""
At each layer, unembed the residual stream directly to get
a probability distribution over vocabulary โ no more processing.
Shows what the model 'thinks' the next token is at each depth.
"""
unembed = model.lm_head # W_U: (d_model, vocab)
ln_f = model.transformer.ln_f # final layer norm
residual_stream = []
def save_residual(module, inp, out):
# out[0] is the hidden state after this transformer block
h = out[0] if isinstance(out, tuple) else out
residual_stream.append(h.detach().clone())
hooks = [block.register_forward_hook(save_residual)
for block in model.transformer.h]
with torch.no_grad():
model(tokens)
for h in hooks:
h.remove()
# Project each layer's residual stream through the unembedding
layer_logits = []
for h in residual_stream:
normed = ln_f(h) # apply final norm
logits = normed @ unembed.weight.T # (B, seq, vocab)
layer_logits.append(logits[:, -1, :].softmax(-1)) # last position
return layer_logits # list of (B, vocab) โ one per layer
```
## Interview Questions
### โ
โ
โ _(Anthropic)_
**Q:** Walk through training a sparse autoencoder on transformer activations. What are the key hyperparameters?
Answer
Training an SAE: (1) Collect activations from a target layer (e.g., MLP output or residual stream) across a large corpus โ typically 1B+ tokens. (2) Center inputs by subtracting the decoder bias before encoding (prevents the bias from absorbing signal). (3) Train with loss = MSE(x, x_hat) + ฮป * ||f||_1. Key hyperparameters: expansion factor (d_sae / d_model) โ Anthropic uses 32xโ256x; sparsity coefficient ฮป โ typically 1e-3 to 1e-1, tune so average L0 (features active per token) is in the range 20โ100; learning rate โ 1e-4 to 5e-5 with Adam; normalize decoder columns to unit norm after each gradient step to prevent feature collapse. Monitor: reconstruction loss (should be >95% variance explained), L0 sparsity, and fraction of dead features (features that never activate โ indicates ฮป too high).
### โ
โ
โ
_(Anthropic, Google)_
**Q:** How does attribution patching differ from activation patching? When would you use each?
Answer
Activation patching (causal tracing): run two passes โ clean and corrupted (e.g., replace subject token). For each component, swap its activation from the clean run into the corrupted run and measure the effect on the output metric. This gives an exact causal estimate but requires O(N) forward passes for N components โ expensive at scale. Attribution patching: first-order Taylor approximation. Compute the gradient of the output metric with respect to each activation, then multiply by the difference between clean and corrupted activations: attr โ (โoutput/โf_i) * (f_i^clean - f_i^corrupt). This runs in O(1) passes (one forward + one backward) while closely approximating full patching. Use activation patching when you have a small targeted circuit and need exact results. Use attribution patching when sweeping across all features/components, building full attribution graphs, or when compute is constrained. Attribution patching can miss nonlinear effects; activation patching catches them but doesn
### โ
โ
โ _(Anthropic)_
**Q:** What is the L1/reconstruction trade-off in SAE training and how do you pick the right sparsity coefficient?
Answer
The SAE loss is MSE + ฮป * L1. Too high ฮป: the model sacrifices reconstruction accuracy to achieve extreme sparsity โ features become coarse and miss fine-grained concepts; many features die (never activate). Too low ฮป: reconstruction is near-perfect but features are dense and polysemantic โ SAE fails to decompose superposition, defeating the purpose. Picking ฮป: sweep over values and monitor three metrics: (1) Explained variance of reconstruction (target: >95%), (2) Mean L0 โ average number of features active per token (target: 20โ100 for interpretability), (3) Dead feature fraction (target: <5% dead after 100M tokens). The
### โ
โ
โ
_(Anthropic, OpenAI)_
**Q:** Describe how you would find the circuit responsible for a specific model behavior (e.g., gendered pronoun resolution).
Answer
Circuit discovery workflow: (1) Define a contrastive pair: clean input (
### โ
โ
โ _(Anthropic, Google, OpenAI)_
**Q:** What are the limitations of current mechanistic interpretability methods? What can
Answer
Current limitations: (1) Scale: full circuit tracing works on individual inputs, not on aggregate model behavior across all possible inputs โ we trace one computation, not the general algorithm. (2) Completeness: SAEs capture a subset of model computation; some features are uninterpretable or semantically ambiguous even to human annotators. (3) Superposition in SAEs: SAEs can themselves develop superposition if ฮป is too low or d_sae is too small โ partial solution, not total fix. (4) Attention vs. MLP asymmetry: SAEs work well on MLP outputs; attention head decomposition is harder because attention mixes token positions non-linearly. (5) Causal vs. correlational: a feature that activates doesn
## Further Reading
- [Circuit Tracing: Revealing Computational Graphs in Language Models](https://transformer-circuits.pub/2025/attribution-graphs/methods.html)
Ameisen et al. 2025 โ combining SAEs with attribution patching to trace full computational circuits in Claude 3.5 Haiku
- [On the Biology of a Large Language Model](https://transformer-circuits.pub/2025/attribution-graphs/biology.html)
Lindsey et al. 2025 โ probing Claude 3.5 Haiku
- [Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html)
Templeton et al. 2024 โ dictionary learning at scale finds ~34M features in a production frontier model
- [Towards Monosemanticity: Decomposing Language Models with Dictionary Learning](https://transformer-circuits.pub/2023/monosemantic-features/index.html)
Bricken et al. 2023 โ the first successful SAE decomposition of a one-layer transformer; established the field
- [Toy Models of Superposition](https://transformer-circuits.pub/2022/toy_model/index.html)
Elhage et al. 2022 โ controlled experiments showing how and why neural networks encode more features than dimensions
- [When Models Manipulate Manifolds](https://transformer-circuits.pub/2025/linebreaks/index.html)
Gurnee et al. 2025 โ studying how models use linebreaks and whitespace as geometric pivots in activation space
- [Chris Olah โ Neural Networks, Manifolds, and Topology](https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/)
Olah 2014 โ the foundational visual intuition for how neural networks transform data through manifold operations
- [3Blue1Brown โ How might LLMs store facts (Chapter 7)](https://www.youtube.com/watch?v=9-Jl0dxWQs8)
Grant Sanderson 2024 โ visual walkthrough of how MLP layers in transformers store and retrieve facts, with connections to superposition and sparse autoencoders.
- [Neel Nanda โ How to Become a Mechanistic Interpretability Researcher](https://www.neelnanda.io/mechanistic-interpretability/getting-started)
Nanda 2023 โ comprehensive guide to getting started in mech interp research, with recommended papers, exercises, and learning path.
- [Neuronpedia โ Interactive SAE Feature Explorer](https://www.neuronpedia.org/)
Open-source platform for exploring 50M+ SAE features across GPT-2, Gemma, Llama, and more โ search, visualize activations, and steer model behavior interactively.
- [ARENA โ Mechanistic Interpretability Exercises](https://arena3-chapter1-transformer-interpretability.streamlit.app/)
Hands-on coding tutorials for transformer interpretability โ TransformerLens, induction heads, superposition, SAEs, and circuit analysis.
## Related
LLM Evaluation ยท Eval-Driven Development ยท Interpretability ยท Safety & Alignment ยท Induction Heads & ICL
---
---
title: "Induction Heads & ICL"
part: "Trust & Evaluation"
number: 65
emoji: "๐ง "
subtitle: "The two-head circuit that powers in-context learning โ and why it emerges as a phase transition"
tags: ["trust", "ml", "ai-engineering", "interview-prep", "transformer"]
---
# ๐ง Induction Heads & ICL
> The two-head circuit that powers in-context learning โ and why it emerges as a phase transition
> [!question] Key Question
> GPT learns to copy patterns mid-training โ and that single circuit explains in-context learning
โ Mechanistic Interpretability
## Contents
- Circuit Diagram
- The Intuition
- The QK/OV Circuit
- Break It โ See What Happens
- Real-World Numbers
## Key Insights
> [!tip] Insight
> The previous-token head is the key: it makes every token carry its predecessor's identity. This lets the induction head do indirect lookup โ "find where the current token appeared before by searching for positions that say they were preceded by the current token."
> [!tip] Insight
> The QK circuit of the induction head "reads" from the previous-token head's output. This means the K matrix of the induction head must have been learned to be compatible with the output directions of the previous-token head โ a beautiful example of emergent inter-layer coordination.
> [!tip] Insight
> The phase transition is visible across all{" "} 16 models Olsson et al. studied โ from 2-layer attention-only models to full GPT-style architectures. Bigger models show the same transition, just at different training-token counts and with more sophisticated generalizations of the basic circuit.
## Code Examples
```python
import torch
import torch.nn.functional as F
def induction_score(attn_pattern: torch.Tensor) -> float:
"""
Measure how strongly a head shows induction behavior.
attn_pattern: (seq_len, seq_len) attention weight matrix on a
repeated random sequence of length seq_len//2.
An induction head attends at the [seq_len//2 - 1] diagonal:
position i attends to position i - (seq_len//2 - 1), the spot
where the current token appeared in the first copy.
"""
seq_len = attn_pattern.shape[0]
half = seq_len // 2
# Extract the diagonal offset = -(half - 1)
# i.e., for position i in second copy, attend to position i - half + 1
offset = -(half - 1)
diag = torch.diagonal(attn_pattern, offset=offset)
return diag.mean().item()
def find_induction_heads(
model,
seq_len: int = 50,
threshold: float = 0.4,
device: str = "cpu"
) -> list[tuple[int, int]]:
"""
Run a repeated random sequence through the model and return all
(layer, head) pairs with induction score above threshold.
"""
vocab_size = model.config.vocab_size
n_layers = model.config.num_hidden_layers
n_heads = model.config.num_attention_heads
# Build a repeated random sequence: [A B C ... A B C ...]
rand_tokens = torch.randint(1, vocab_size, (1, seq_len), device=device)
tokens = torch.cat([rand_tokens, rand_tokens], dim=1) # (1, 2*seq_len)
with torch.no_grad():
outputs = model(tokens, output_attentions=True)
induction_heads = []
for layer_idx, layer_attn in enumerate(outputs.attentions):
# layer_attn: (batch, n_heads, seq, seq)
for head_idx in range(n_heads):
pattern = layer_attn[0, head_idx] # (2*seq_len, 2*seq_len)
score = induction_score(pattern)
if score > threshold:
induction_heads.append((layer_idx, head_idx))
print(f"Layer {layer_idx}, Head {head_idx}: score={score:.3f}")
return induction_heads
```
```python
# Induction head detection: repeated-sequence attention score
import torch
def induction_score_for_head(
model, layer: int, head: int, seq_len: int = 50
) -> float:
"""
Feed a repeated random sequence [A...A] of length 2*seq_len.
An induction head at (layer, head) will strongly attend at
diagonal offset -(seq_len - 1): position i attends to i-(seq_len-1),
the spot right after the previous occurrence of token[i].
Returns mean attention weight on that diagonal (0=no induction, 1=perfect).
"""
vocab = model.config.vocab_size
rand_seq = torch.randint(1, vocab, (1, seq_len))
tokens = torch.cat([rand_seq, rand_seq], dim=1) # (1, 2*seq_len)
with torch.no_grad():
out = model(tokens, output_attentions=True)
# out.attentions[layer]: (batch, n_heads, 2*seq_len, 2*seq_len)
attn = out.attentions[layer][0, head] # (2*seq_len, 2*seq_len)
offset = -(seq_len - 1)
diag = torch.diagonal(attn, offset=offset) # values on the induction diagonal
return diag.mean().item()
```
## Interview Questions
### โ
โ
โ _(Anthropic, Google)_
**Q:** What is an induction head and how does it implement pattern completion?
Answer
An induction head is a two-layer attention circuit that implements the rule:
### โ
โ
โ
_(Anthropic)_
**Q:** Why do induction heads emerge as a phase change during training rather than gradually?
Answer
Induction heads require coordination between two separate attention heads โ a previous-token head and a matching head. Neither is useful alone: the previous-token head only becomes beneficial when the induction head exists to use its output, and vice versa. This creates a coordination problem where the two circuits must develop together. Olsson et al. (2022) observed a sharp phase transition around 2B tokens in small models: loss on repeated random sequences drops suddenly, all attention heads in the model change simultaneously, and a
### โ
โ
โ _(Anthropic, Google)_
**Q:** How would you detect induction heads in a trained transformer? Describe the experimental setup.
Answer
The canonical detection method uses a repeated random sequence: generate a random token sequence [A, B, C, D, ...] and concatenate it with itself to get [..., A, B, C, D, A, B, C, D]. Then inspect the attention patterns of each head on the second copy. An induction head will show a characteristic pattern: each token attends to the position where it appeared in the first copy, offset by +1 (attending one step ahead of where it last appeared). Quantitatively, you compute an
### โ
โ
โ
_(Anthropic, OpenAI)_
**Q:** Can induction heads explain generalization beyond exact copying? Give an example of fuzzy induction.
Answer
Yes. In small models, induction heads do literal copying โ they match on exact token identity. But in larger models (GPT-2 and beyond), analogous circuits operate in embedding space, enabling
### โ
โ
โ _(Anthropic)_
**Q:** What is the relationship between induction heads and the in-context learning loss bump?
Answer
The in-context learning loss bump is a sudden drop in loss on sequences where context helps prediction โ it appears mid-training as a sharp discontinuity rather than a smooth improvement. Olsson et al. (2022) showed this bump is causally linked to induction head formation: (1) the bump timing matches exactly when induction heads form across 16 different models, (2) ablating induction heads removes most of the ICL benefit, restoring pre-bump loss levels, and (3) the bump correlates with performance on held-out tasks that require using context. The bump accounts for roughly 50% of the total in-context learning performance. The mechanism is direct: before induction heads form, the model can only use the current token and learned priors; after, it can scan the context for pattern matches and use them for prediction.
## Further Reading
- [In-Context Learning and Induction Heads](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)
Olsson et al. 2022 โ the definitive paper showing induction heads are the mechanistic basis of in-context learning, with phase-change evidence across 16 models
- [A Mathematical Framework for Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html)
Elhage et al. 2021 โ introduces the QK/OV decomposition and residual stream view used throughout induction head analysis
- [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
Olah 2015 โ the gold-standard visual explainer of recurrent memory; useful context for understanding why in-context learning is surprising in attention-only models
- [Tracing Attention Computation Through Feature Interactions](https://transformer-circuits.pub/2025/attention-qk/index.html)
Kamath et al. 2025 โ traces how attention QK circuits interact with features, extending induction head analysis to larger and more complex models
## Related
LLM Evaluation ยท Eval-Driven Development ยท Interpretability ยท Safety & Alignment ยท Mechanistic Interpretability
---
---
title: "The Design Doc"
part: "Design Reviews"
number: 66
emoji: "๐"
subtitle: "Working backwards from the SLO โ an annotated, worked design doc for a real ML endpoint"
tags: ["designreviews", "ml", "ai-engineering", "interview-prep", "transformer"]
---
# ๐ The Design Doc
> Working backwards from the SLO โ an annotated, worked design doc for a real ML endpoint
> [!question] Key Question
> Every senior engineer writes design docs โ nobody teaches how
โ Cost Accounting & Eval-Driven Design
## Key Insights
> [!tip] Insight
> Margin note. Notice what's NOT on the list: model choice, framework, cloud provider, even GPU type. Requirements come from the customer and the P&L, not from the tech menu. If you skip to architecture without this table, every subsequent decision is ungrounded.
> [!tip] Insight
> Golden-set sizing โ Wilson interval derivation. For a binary quality judgment, the Wilson score interval gives the 95% CI half-width as approximately{" "} w โ 1.96 ร โ(pฬ(1โpฬ) / n) {" "} where pฬ is the expected pass rate and n is the sample size. At pฬ = 0.80 and n = 200, w โ 1.96 ร โ(0.16 / 200) โ 0.055, so the CI is roughly ยฑ5.5 pp โ adequate for a top-level gate. At pฬ = 0.90 the same n gives ยฑ4.2 pp; at{" "} pฬ = 0.95 it narrows to ยฑ3.0 pp because the binomial variance peaks at 0.50. Note: for multi-cohort drill-downs (per tier, per prompt-length bucket), do not multiply a single pool size by the number of cells. Each cell has its own base rate and therefore its own required n. A cell where the easy-prompt tier passes at 95% needs far fewer examples to pin a ยฑ3 pp CI than a cell where the adversarial tier passes at 60%. Size each cell independently, then sum โ the aggregate is usually 2โ4ร higher than the naive “200 ร cells” estimate would predict.
> [!tip] Insight
> Margin note. The calculator gives a number. The number is wrong โ all back-of-envelope numbers are. The question is whether it's wrong by 1.5ร or by 10ร. 1.5ร means the capacity plan survives; 10ร means the entire architecture needs to change (routing, quantization, disaggregation). This is what calibrates how much detail the architecture deserves.
> [!tip] Insight
> Margin note. Two deep dives โ not four. An interviewer will push into the places you didn't{" "} dive, and the right answer there is “I'd follow the same structure โ here's the risk I'd watch.” Deep diving everything equally is a junior signal.
> [!tip] Insight
> The hardest SLO to write is the quality SLO.{" "} Latency and availability are percentages anyone can check. Quality regressions need the eval harness you wrote in Step 2 โ which is why Step 2 comes before architecture.
## Interview Questions
### โ
โ
โ _(Google, Anthropic)_
**Q:** You
Answer
(1) QPS target โ derived from customer count ร requests/user/day รท 86,400. (2) p95 time-to-first-token SLO โ usually 500โ800 ms for interactive use. (3) Average output token length โ drives total GPU-seconds per request. (4) Acceptable cost per 1K output tokens โ this is the constraint that kills most naive designs. Order matters: QPS without a latency SLO leads to an overspec
### โ
โ
โ
_(OpenAI, Google)_
**Q:** An interviewer says
Answer
Push back, politely but immediately โ this is an SLO-dependent design. The architecture is a direct function of the latency and cost SLOs, and different SLO classes force structurally different choices. Concrete example: a 50 ms p95 TTFT SLO requires a dedicated decode-only GPU pool, speculative decoding (3โ5 draft tokens ahead), and likely disaggregated prefill so no long-context request can stall decode slots. A 5-second batch-completion SLO, by contrast, allows fully asynchronous queuing, large batch accumulation windows (1โ2 s), and no speculative decoding overhead โ the architecture is simpler and 2โ3ร cheaper per token. Those are not the same system, and you can
### โ
โ
โ
_(Anthropic, Meta)_
**Q:** Your capacity math says you need 200 GPUs. Your budget is 60. What do you cut first?
Answer
Quality knobs before capacity knobs. In order: (1) Shorter max_tokens ceiling โ often the biggest single lever. (2) Model routing โ send the easy 80% to a small model, keep the large model for the hard 20% (~70% cost cut). (3) Prompt caching for repeated system prompts (30โ50% prefill savings). (4) Tighter rate limits per tier. Only after those do you look at quantization (INT8 KV cache, INT4 weights), because quantization can hurt quality in subtle ways that only eval catches.
### โ
โ
โ
_(OpenAI, Google)_
**Q:** You ship the design doc. Two weeks in, p95 TTFT regresses from 420 ms to 900 ms. Your doc said
Answer
The design doc is fine โ the regression is an ops event, not a design bug. Look at (1) admission control: is the queue depth higher, and why? (2) batch composition: are long-context requests poisoning the batch by blocking short decodes? This is the classic prefill-decode interference problem โ mitigation is disaggregated prefill or chunked prefill. (3) KV cache pressure: is a new feature pinning context for longer? This is where the eval harness pays off โ a trajectory replay of the regressed requests tells you which cohort broke.
### โ
โ
โ _(Anthropic, Google)_
**Q:** An interviewer asks you to justify a decision your design doc made. You realize you can
Answer
Say so immediately and with precision. The formula is:
## Further Reading
- [Jeff Dean โ Building Software Systems at Google and Lessons Learned](https://research.google/pubs/pub40672/)
The original
- [Amazon Working Backwards โ PR/FAQ + 6-Pager](https://www.workingbackwards.com/)
Not an ML piece, but the discipline of writing the customer-facing press release before the design doc is the methodological backbone of this module.
- [Chip Huyen โ Designing Machine Learning Systems (O](https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/)
The canonical ML system-design textbook. Chapter 1 on business objectives is the framework chapter candidates keep ignoring at their own cost.
- [Shreya Shankar โ Operationalizing ML](https://www.shreya-shankar.com/phd-productionizing-ml/)
The thesis-length argument that the gap between ML design and ML-in-production is owned by the eval harness, not the model.
- [Hamel Husain โ Your AI Product Needs Evals](https://hamel.dev/blog/posts/evals/)
The practitioner post that converted a generation of AI engineers to eval-first design. Required reading before writing any LLM design doc.
## Related
Cost Accounting & Eval-Driven Design ยท Case: Design ChatGPT ยท Case: Design Perplexity ยท Case: Design Claude Code / Cursor ยท Case: Design Midjourney
---
---
title: "Cost Accounting & Eval-Driven Design"
part: "Design Reviews"
number: 67
emoji: "๐ฐ"
subtitle: "Cost-per-bad-day, LLM-judge rubrics, golden-set sizing โ design flows from the eval, not the architecture"
tags: ["designreviews", "ml", "ai-engineering", "interview-prep", "transformer"]
---
# ๐ฐ Cost Accounting & Eval-Driven Design
> Cost-per-bad-day, LLM-judge rubrics, golden-set sizing โ design flows from the eval, not the architecture
> [!question] Key Question
> You can't design what you can't measure โ so write the eval first
โ The Design Doc | โ Case: Design ChatGPT
## Key Insights
> [!tip] Insight
> The judge is a system with its own eval. An LLM-as-judge is a model whose output drives your go/no-go decisions. It deserves the same scrutiny as the production model โ calibration against human labels, drift monitoring, and a refresh schedule.{" "} LLM judges exhibit position bias in 10โ30% of pairwise comparisons {" "} โ another reason to calibrate against human labels rather than trust the judge out-of-the-box. Shankar's “Who Validates the Validators”{" "} (arxiv 2404.12272) {" "} documents what happens when you skip this.
> [!tip] Insight
> Real-world example.{" "} Shankar et al. (2024) “Who Validates the Validators?” {" "} is the canonical empirical study of LLM-judge reliability in production. The paper instruments Spearman ฯ between four judge-model configurations (GPT-4, Llama-70B, and two rubric variants) and human raters across 2,200 labeled examples, finding that{" "} judge agreement with humans ranges from ฯ = 0.47 to ฯ = 0.84 depending on judge model and rubric design {" "} โ a nearly 2ร spread. The paper's central finding for practitioners: no judge works well out-of-the-box; all require domain-specific calibration sets and regular refresh. The cost-recall tradeoff between embedding and LLM judges is documented in Figure 3 of the paper.
> [!tip] Insight
> Real-world example.{" "} RouteLLM (Ong et al., 2024) {" "} is the most rigorous public evaluation of classifier-gated model routing. The paper benchmarks four router architectures on MT-Bench, MMLU, and GSM8K, measuring the cost-quality frontier for each. Key result: on MT-Bench, the matrix factorization router achieves a 2ร cost reduction with <5% quality degradation vs. always routing to GPT-4. The paper also demonstrates that all router architectures degrade under distribution shift between training and test domains โ the routers trained on chatbot-arena data underperform by 8โ12 pp quality on coding-heavy benchmarks โ which is the same drift failure mode described above. Martian (a commercial routing-as-a-service product) extends the RouteLLM approach with online retraining but does not publish accuracy numbers for its production router.
> [!tip] Insight
> The reliability-is-a-dollar-number reframe.{" "} A 99.9% availability SLO allows only ~43 minutes of downtime per month {" "} โ yet that budget hides the asymmetry between a 10-second blip and a 4-hour partial regression. Pricing both in dollars surfaces what the SLO obscures: partial regressions are often the expensive incidents, not the full outages.
> [!tip] Insight
> The eval that's too green. When the offline eval consistently shows bigger wins than the online test, you almost certainly have a selection bias in the golden set โ it over-represents cases where your model is already strong. Fix by re-sampling from recent production traffic.{" "} Evaluation standards often emerge through the grading process itself โ criteria drift is not a bug but a feature of real-world eval pipelines.
## Code Examples
```python
import math
def golden_set_size(
expected_pass_rate: float, # e.g. 0.80
target_half_width: float, # e.g. 0.02 for ยฑ2 pp
n_cohorts: int = 1,
confidence_z: float = 1.96, # 95% CI
) -> int:
"""Size a golden set for a binary pass/fail metric.
Multiplies by n_cohorts when you want independent power
within each stratum (per-tier, per-language, etc.).
"""
p = expected_pass_rate
per_cohort = (confidence_z ** 2) * p * (1 - p) / (target_half_width ** 2)
return int(math.ceil(per_cohort * n_cohorts))
# Example: 80% pass rate, ยฑ2pp, 4 cohorts
print(golden_set_size(0.80, 0.02, n_cohorts=4))
# -> 6147
```
## Interview Questions
### โ
โ
โ
_(Anthropic, OpenAI)_
**Q:** Your offline LLM-judge eval says a new model is 5% better. After launch, user satisfaction is flat. What
Answer
Offline/online divergence is the default state, not the exception. Diagnose in order: (1) Distribution mismatch โ is the golden set representative of real traffic, or a curated slice? (2) Judge calibration โ does the LLM judge
### โ
โ
โ
_(OpenAI, Meta)_
**Q:** Calculate cost-per-bad-day for a product at 1K QPS, $3/1M output tokens, 256 avg output tokens, if a regression routes 20% of traffic to the flagship instead of the cheap model (flagship costs 10x).
Answer
Baseline cost: 1K QPS ร 86,400 s ร 256 tok ร $3/1M = $66K/day. The regression means 20% of traffic costs 10ร more; the other 80% is unchanged. Overrun on the regressed slice = 0.2 ร $66K ร (10โ1) = $119K/day, where the 9ร factor is the excess above baseline cost on the regressed slice (not the full 10ร, because the baseline $66K already accounts for routing all traffic at $3/1M; the incremental delta per regressed request is 9ร the cheap-tier cost). Detected in 5 min โ $415; detected in 4 h โ $19,900 (48ร delta). That is why the eval harness that catches router drift in minutes, not hours, pays for itself in a single incident.
### โ
โ
โ _(Anthropic, Google)_
**Q:** You have a budget for 200 human-labeled eval examples. How do you allocate them across cohorts?
Answer
Don
### โ
โ
โ
_(Anthropic, OpenAI)_
**Q:** Why is
Answer
Cost per request is an average; incidents live in the tail. A product with $0.003/request average cost can absorb a 10x tail without breaking the P&L until the tail gets wide. The useful decomposition: (1) steady-state unit cost, (2) cost-per-bad-day (the integral of the tail during an incident window), (3) cost-per-user-retained (which only makes sense over cohorts, not requests). Instrument all three; the first is for capacity planning, the second is for incident severity, the third is for product decisions.
### โ
โ
โ
_(OpenAI, Google)_
**Q:** You
Answer
Obvious: run an LLM judge over a held-out set, count hallucinations, set a threshold. Better: (1) define hallucination operationally โ
## Further Reading
- [Shreya Shankar โ Who Validates the Validators?](https://arxiv.org/abs/2404.12272)
The canonical paper on LLM-judge calibration. If you take one idea: the judge needs its own eval, and that eval is a human-labeled subsample you refresh on a schedule.
- [RouteLLM โ Learning to Route in LLMs (Ong et al., 2024)](https://arxiv.org/abs/2406.18665)
The paper that formally defines the cost-quality trade-off in LLM routing. Introduces the APGR/CGPT metrics and shows that a trained classifier-router can match 95% of GPT-4 quality at 40% of the cost.
- [Hamel Husain โ Your AI Product Needs Evals](https://hamel.dev/blog/posts/evals/)
The blog post that reframed eval-driven development for a generation of AI engineers. Pair with Hamel
- [Eugene Yan โ LLM Evals](https://eugeneyan.com/writing/llm-evaluators/)
The most thorough practitioner guide to LLM-as-judge design โ rubric construction, bias mitigation, calibration.
- [Chip Huyen โ AI Engineering (O](https://www.oreilly.com/library/view/ai-engineering/9781098166298/)
Chapter 4 on evaluation is the textbook reference. The cost-accounting chapter reframes LLM unit economics around request shape, not just token count.
- [Anthropic โ Building Effective Agents (cost patterns)](https://www.anthropic.com/research/building-effective-agents)
Not a cost paper per se, but the routing and orchestration patterns here are exactly where cost lives in agent systems.
## Related
The Design Doc ยท Case: Design ChatGPT ยท Case: Design Perplexity ยท Case: Design Claude Code / Cursor ยท Case: Design Midjourney
---
---
title: "Case: Design ChatGPT"
part: "Design Reviews"
number: 68
emoji: "๐ฌ"
subtitle: "Multi-tenant chat โ SLOs, model routing, conversation state"
tags: ["designreviews", "ml", "ai-engineering", "interview-prep", "transformer"]
---
# ๐ฌ Case: Design ChatGPT
> Multi-tenant chat โ SLOs, model routing, conversation state
> [!question] Key Question
> Two billion messages a day โ where does the money actually go?
โ Cost Accounting & Eval-Driven Design | โ Case: Design Perplexity
## Key Insights
> [!tip] Insight
> Non-obvious SLO choice: separate p95 targets by tier. {" "} Most teams set a single p95 TTFT across all users. ChatGPT cannot โ because the model router sends free-tier traffic to a smaller model on a separate fleet segment, their tail-latency distribution is structurally different from Plus. Collapsing them into one number masks regressions on the paid tier (revenue-critical) under the larger volume of free-tier requests. Separate SLO tracking per tier is not a reporting preference; it is a detection prerequisite. This follows directly from the Google SRE Book recommendation (Chapter 4) to define SLOs for distinct user populations rather than aggregate service behavior.
> [!tip] Insight
> Golden-set sizing math. For a binary refusal judgment at 95% expected pass rate, a set of 500 examples gives a 95% confidence interval of roughly ยฑ2 percentage points โ tight enough to detect a 5-point regression before it ships. A 100-example set gives ยฑ6 points: too wide to detect gradual drift. The formula is{" "} CI = 1.96 ร sqrt(p(1โp)/n) {" "} โ plug in p=0.95, n=500 to verify. Size for the signal you need, not the round number that fits in a sprint.
> [!tip] Insight
> The deliberate mistake in the defaults above. The preset uses 1,024 input tokens โ reasonable for turn 1. But multi-turn sessions accumulate history. A 10-turn session averaging 512 tokens per turn arrives with 5,000+ tokens of prefill context. The GPU memory required per active session grows proportionally. For the fleet to handle the p95 session without admission-control rejection, the KV cache must be sized for the distribution tail, not the mean โ and that changes the GPU count estimate substantially.
> [!tip] Insight
> Why two, not four. Deep diving everything equally is a junior signal. The conversation store failure cascades into every active session simultaneously and triggers a KV-cache miss storm on the GPU fleet โ both user-visible quality loss and a cost spike in the same event. The router failure is the fastest path to a six-figure cost incident. For everything else: “I would apply the same risk analysis โ here is the failure mode I would watch.”
> [!tip] Insight
> Gate 7 lesson: the 10x detection window. The router regression row shows that detection at 2 minutes vs. 4 hours is a 120x cost difference for the same underlying bug. Reliability is not a percentage โ it is a detection-window investment. The per-tier cost alarm costs nothing to build and is worth six figures per incident it catches. A team that monitors uptime but not per-tier cost is flying one-eyed.
## Interview Questions
### โ
โ
โ
_(OpenAI, Anthropic)_
**Q:** ChatGPT free tier routes to a cheaper model; Plus routes to the flagship. A naive implementation hard-codes this in the gateway. What is the single worst failure mode of that design, and how do you detect and fix it?
Answer
The hardest failure mode is a silent routing regression โ a config push, feature-flag flip, or canary-weight bug that routes free-tier traffic to the flagship for 30โ60 minutes before anyone notices. Hard-coded gateway logic has no quality-check layer: the gateway cannot tell whether a request reached the correct model. Detection has two lines of defense: first, a per-tier GPU-spend alarm that fires within 2 minutes when cost deviates from baseline (20K free-tier QPS ร 512 tokens ร a ~$5/M delta between models is roughly $92K for a 30-minute window โ the cost spike is not subtle); second, a shadow-judge that continuously scores 5% of mini-routed responses against flagship scores and alarms on divergence. The fix is to decouple tier-routing from the gateway and run it as a separate router service with a canary rollout (1% of traffic before 100%) and an automated rollback that triggers on cost-SLO breach. The gateway only enforces which tiers are eligible for which routing class; the routing decision and its quality gate live downstream.
### โ
โ
โ
_(OpenAI, Google)_
**Q:** Design the conversation state store for ChatGPT. What are the three failure modes it must survive, and why is its blast radius higher than a model-server node failure?
Answer
Three failure modes: (1) Cache miss โ Redis unavailable or a session evicted under memory pressure. Every turn re-sends full conversation history; the model server sees no prefix-cache hit; TTFT regresses by the prefill time for the full accumulated context, roughly 200 ms per 1K tokens on H100 (per vLLM benchmarks). At a 10-turn session averaging 5K tokens of history, that is ~1 second of extra prefill per request โ a full SLO breach on the Plus tier. (2) State corruption โ partial write during a network partition; next request reads a truncated prefix; the model produces incoherent output. Mitigation: write-ahead log on the durable tier; Redis write completes only after durable confirm; prefix is versioned so the model server detects length mismatches and falls back to a cold prefill. (3) Full store outage โ every active multi-turn session simultaneously loses context coherence. A model-server node failure loses one in-flight request and traffic re-routes with no user-visible effect. A conversation store outage degrades every concurrent session at once and triggers a KV-cache miss storm on the GPU fleet โ cascading into both user-visible quality loss and a cost spike. The blast radius is the product of active sessions, not one request.
### โ
โ
โ _(OpenAI, Google)_
**Q:** The CFO asks why prefix caching on the system prompt matters to the bottom line. Give a number-backed answer.
Answer
Every ChatGPT request carries a system prompt โ on the order of 500โ1,500 tokens of instruction, policy, and tool definitions. Without prefix caching, every request pays full prefill cost for those tokens. The vLLM paper (Kwon et al., 2023, arXiv:2309.06180) reports 2โ4x throughput improvement from prefix caching on repeated prefixes versus naive serving. Translating to cost: a 30% prefix-cache hit rate reduces prefill GPU-seconds proportionally โ if prefill accounts for P% of fleet compute, caching delivers a 0.3P% effective fleet saving. At 2B-messages-per-day scale, that is directionally hundreds of GPU-days per month. The cache hit rate is therefore tracked as a direct business metric, not a latency metric.
### โ
โ
โ
_(OpenAI, Google)_
**Q:** p95 TTFT regresses from 400 ms to 1,100 ms after a traffic spike. No model changes shipped. Where do you look, in what order?
Answer
Start from the gateway and work downstream. First, check queue depth by tier: is the Plus or flagship queue deeper than baseline? A queue-depth spike without a request-rate spike points to a batch composition problem, not a capacity problem. Second, check batch composition: are long-context requests โ specifically, multi-turn sessions with 8K+ tokens of history โ dominating the prefill phase? This is the classic prefill-decode interference problem: one 8K-token prefill monopolizes decode slots for roughly 300โ500 ms (inferred from vLLM chunked-prefill benchmarks showing ~200 ms per 1K-token atomic prefill on H100). Without chunked prefill, every short request queued behind it sees that penalty in their TTFT. Third, check prefix-cache hit rate: a sudden drop suggests a system-prompt change or serialization format drift that invalidated cached prefixes. Fourth, check KV-cache memory utilization on the model-server fleet: above 90%, admission control should kick in and the queue grows. The mitigation hierarchy is chunked prefill to cap per-request prefill interference, then disaggregated prefill/decode if the prefill share of total GPU-seconds crosses roughly 30%.
### โ
โ
โ
_(Anthropic)_
**Q:** Anthropic
Answer
A standard safety classifier is a single-pass binary gate: request in, allow or block out, sub-10 ms. Constitutional AI (Bai et al., 2022, arXiv:2212.08073) adds a self-critique-and-revision loop: the model generates a response, scores it against a set of constitutional principles, and rewises before the output is returned. From a serving architecture perspective, this adds at least one extra generation pass โ meaning latency roughly doubles for any request that enters the revision path. The critical integration decision is therefore the trigger condition: you cannot afford to run the full CAI loop on every request. The practical approach is to gate it on the output of the cheap post-model classifier: only requests scoring above a harmfulness threshold enter the CAI revision path. This preserves latency for the 95%+ of benign requests while applying the principled revision where it matters. A second integration decision is capping revision rounds โ typically one to two โ to bound worst-case latency. Third, log the revision diffs to the eval harness: a revision that introduces hallucinations while removing a safety issue is not a win, and only the eval harness can detect that pattern systematically. The shadow-run (5% of production through the full CAI path even when below threshold) surfaces classifier-calibration drift before it becomes a production incident.
## Related
The Design Doc ยท Cost Accounting & Eval-Driven Design ยท Case: Design Perplexity ยท Case: Design Claude Code / Cursor ยท Case: Design Midjourney
---
---
title: "Case: Design Perplexity"
part: "Design Reviews"
number: 69
emoji: "๐ญ"
subtitle: "RAG + live web search โ freshness, citations, retrieval fusion"
tags: ["designreviews", "ml", "ai-engineering", "interview-prep", "transformer"]
---
# ๐ญ Case: Design Perplexity
> RAG + live web search โ freshness, citations, retrieval fusion
> [!question] Key Question
> Half retrieval system, half LLM โ which half should you optimize first?
โ Case: Design ChatGPT | โ Case: Design Claude Code / Cursor
## Key Insights
> [!tip] Insight
> Margin note. Notice that the latency SLO is{" "} looser than a pure chat product ( 1.5 s vs ~800 ms). This is intentional: retrieval adds budget. Trying to hit 800 ms would force you to cut the reranker โ and the reranker is where citation precision lives. Do not let a naive latency target kill the quality component that defines your product.
> [!tip] Insight
> Shreya Shankar's “Who Validates the Validators” problem. {" "} Your LLM judge for groundedness and citation precision is itself an LLM. It needs calibration: run the judge on a 50-example human-labeled subsample and measure judge accuracy before trusting its scores at scale. An uncalibrated judge that overestimates groundedness by 8 pp gives you a false sense of security โ and in a system where citations are the trust signal, false security has direct user-trust consequences.
> [!tip] Insight
> The LLM is cheap; retrieval is the bill. A common mistake in Perplexity-style system design is treating the LLM as the dominant cost center and optimizing there first. In practice, at steady-state scale, the embedding model for query encoding, the vector index serving layer, and the cross-encoder reranker together account for a substantial fraction of per-request spend (community estimates:{" "} 30โ50%). Before cutting the LLM size to save money, check whether the reranker can be made leaner or the cache hit rate can be improved.
> [!tip] Insight
> Margin note. Most candidates deep-dive the LLM selection. The LLM is the least differentiating component โ any sufficiently capable model can summarize five retrieved chunks. The reranker and citation binder are where Perplexity's product quality actually lives. Deep dive those.
> [!tip] Insight
> The hardest SLO to operationalize is citation precision. {" "} Latency and availability fire binary alerts. Citation precision requires continuous sampling, an NLI inference pipeline on production traffic, and a calibrated judge โ all running at non-trivial cost. The reranker regression row above shows why this is worth building: a 4-hour detection window vs. a 2-minute detection window is a 120ร cost multiplier, and the cost compounds non-linearly if the regression persists for days. The organizations that skip citation-precision monitoring discover the regression from a viral tweet, not a dashboard.
## Interview Questions
### โ
โ
โ _(Anthropic, Google)_
**Q:** Your eval shows LLM generation quality increased 3% after swapping to a larger model, but user satisfaction is flat. Where do you look first?
Answer
Retrieval and citation quality. Users experience Perplexity through the surface of citations โ a well-cited but merely-OK answer is trusted; a well-written answer with a wrong or missing citation is distrusted. A 3% generation improvement is invisible if the retrieval recall is unchanged or degraded. Check citation precision (does source X actually support claim Y?), check retrieval recall@5 on the golden query set, and check groundedness score (NLI entailment between claims and cited source). Only when those are stable does generation quality become the marginal lever.
### โ
โ
โ
_(Google, OpenAI)_
**Q:** Design the freshness subsystem for Perplexity. How do you decide which queries trigger a live web fetch versus serving from the vector index?
Answer
Two-signal routing: (1) Query classifier โ a lightweight model that identifies freshness-critical intent from keywords and semantic patterns (e.g.,
### โ
โ
โ
_(Anthropic, Google)_
**Q:** The cross-encoder reranker was upgraded. Citation precision silently dropped from 95% to 85%. How was this not caught before it reached production?
Answer
The likely gap: the upgrade was evaluated on standard relevance benchmarks (NDCG, MRR) where the new model improved, but citation precision is a downstream metric โ it depends on what the LLM does with the reranked documents, not just which documents score highest. This is the
### โ
โ
โ
_(Google)_
**Q:** Perplexity
Answer
Partial query degradation: queries whose relevant documents lived on the offline shard will silently receive worse answers โ the system won
### โ
โ
โ _(Anthropic)_
**Q:** On low-evidence queries (
Answer
The correct behavior is calibrated uncertainty, not confident generation from weak context. A groundedness gate checks whether the top retrieved sources actually contain relevant evidence (using NLI entailment or LLM scoring). If groundedness falls below a threshold, the system should: (1) Disclose low-confidence explicitly (
## Further Reading
- [Perplexity Engineering Blog](https://www.perplexity.ai/hub/blog)
Primary source for Perplexity
- [Shreya Shankar โ Who Validates the Validators? Towards LLM-Assisted Evaluation](https://arxiv.org/abs/2405.03600)
The foundational paper for understanding why the LLM judge evaluating your RAG system needs its own calibration. Essential reading before designing any groundedness or citation-precision eval.
- [Dense Passage Retrieval for Open-Domain Question Answering (Karpukhin et al., 2020)](https://arxiv.org/abs/2004.04906)
The DPR paper that established dual-encoder dense retrieval as the production baseline. The retrieval recall numbers here are the standard against which all Perplexity-style systems are measured.
- [ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction](https://arxiv.org/abs/2004.12832)
Khattab & Zaharia 2020 โ the architecture that keeps per-token embeddings and uses MaxSim scoring. Relevant to understanding why single-vector bi-encoders are the retrieval floor, not the ceiling.
- [Eugene Yan โ Patterns for Building LLM-Based Systems & Products](https://eugeneyan.com/writing/llm-patterns/)
Practitioner-level survey of RAG, evals, guardrails, and citation patterns. The sections on retrieval, memory, and guardrails map directly to the Perplexity design problem.
- [Chip Huyen โ Building LLM Applications for Production](https://huyenchip.com/2023/04/11/llm-engineering.html)
The canonical post on production LLM engineering. The hallucination and evaluation sections ground the citation-precision and groundedness design choices in this module.
## Related
The Design Doc ยท Cost Accounting & Eval-Driven Design ยท Case: Design ChatGPT ยท Case: Design Claude Code / Cursor ยท Case: Design Midjourney
---
---
title: "Case: Design Claude Code / Cursor"
part: "Design Reviews"
number: 70
emoji: "๐ค"
subtitle: "Coding agent at scale โ context builder, tools, sandboxing"
tags: ["designreviews", "ml", "ai-engineering", "interview-prep", "transformer"]
---
# ๐ค Case: Design Claude Code / Cursor
> Coding agent at scale โ context builder, tools, sandboxing
> [!question] Key Question
> The model is cheap. The context is what costs you.
โ Case: Design Perplexity | โ Case: Design Midjourney
## Key Insights
> [!tip] Insight
> Margin note. Sandbox escape has a special status: it's not a degraded SLO, it's a program-stopper. The asymmetry between cost incidents (detectable in minutes, recoverable by scaling) and trust incidents (discovered via bug bounty, covered in press) dictates that the sandbox architecture get disproportionate engineering time relative to its steady-state contribution.
> [!tip] Insight
> Hamel's north star. Hamel Husain's framing: your evals are only as good as the failure modes they surface. For coding agents, the failure modes that matter are silent ones โ an edit that compiles but regresses a test the agent never ran, or a context builder that retrieves the wrong file and the model confidently uses it anyway. Write evals that catch these specifically; generic “does it succeed” benchmarks miss them entirely.
> [!tip] Insight
> The multi-turn amplification trap. The calculator above assumes every QPS unit is an independent request. Coding agents violate this assumption badly. A single user task generates a cascade: turn 1 (plan) โ tool calls โ turn 2 (decide) โ more tool calls โ turn 3 (write the patch). Each turn's input context includes all prior tool results, so context length grows each turn. A 3-turn task with tool results accumulating costs roughly 5ร more than the single-inference number suggests. Plan your GPU fleet around task completions, not individual model calls โ then multiply back.
> [!tip] Insight
> You've now seen all three RAG shapes. The ChatGPT case study showed conversation-retrieval: dense vector search over a knowledge base, ranked by semantic similarity to the query. The Perplexity case study showed{" "} web-retrieval: live crawl + recency-weighted ranking against a query that has a freshness requirement. This module showed{" "} context-retrieval: multi-layer (file index + repo graph + embeddings) retrieval under a hard latency budget where the “document” is live, mutable code. The shared shape across all three:{" "} retrieve โ compose โ generate โ cite/use. The differences are (1) latency budget (200 ms for web, 100 ms for coding, flexible for chat), (2) freshness requirement (seconds for web, milliseconds for code, hours for static KB), and (3) what “context” means (web pages, code chunks, conversation history). This is the pattern. You didn't need a dedicated “RAG chapter” because three concrete case studies embedded it better than an abstraction ever could.
> [!tip] Insight
> The asymmetry that shapes the entire architecture.{" "} Cost incidents and latency incidents are recoverable. Trust incidents โ corrupted files, silent failures, sandbox escapes โ are not. The overlay filesystem, the PreToolUse hooks, and the microVM sandbox are expensive engineering investments that exist entirely to prevent the unrecoverable class of failure. Price them accordingly when writing the resource allocation for the platform team.
## Interview Questions
### โ
โ
โ _(Anthropic, Google)_
**Q:** Your engineering manager says the new coding agent has
Answer
300 ms is the LLM first-token time for a single inference call. A coding agent issues many model calls per user-visible task โ one call to plan, one per tool result to decide what to do next, sometimes a final synthesis call. The user-visible latency is the sum of all those turns plus the I/O time for each tool execution. Reporting single-inference latency for a multi-turn agent is like reporting car engine cycle time as the answer to
### โ
โ
โ
_(Anthropic, OpenAI)_
**Q:** Your LLM compute cost is tripling month-over-month but user count is flat and average task count per user is flat. What
Answer
The amplifier is inside the agent loop, not the user-facing metrics. Flat users ร flat tasks means the same number of tasks are being started โ but each task is doing more model calls. The most common causes, in order: (1) Tool-loop bloat โ the agent is calling more tools per task (possibly because context changed and it re-explores more). (2) Context window expansion โ longer conversations mean longer input context for every subsequent call, driving up prefill cost even with the same number of turns. (3) A routing regression that
### โ
โ
โ
_(OpenAI, Google)_
**Q:** Describe the overlay filesystem used by a coding agent
Answer
An overlay filesystem layers a writable
### โ
โ
โ
_(Anthropic, Meta)_
**Q:** You
Answer
Layer 1 โ Static file index (ripgrep-class): full-text and symbol search over the repo, built once and maintained by fs-watch. Latency: <10 ms for most queries. Trade-off: must be kept fresh after rebases and bulk renames; stale index leads to the agent
### โ
โ
โ
_(Anthropic, OpenAI)_
**Q:** An interview question at a company building coding agents:
Answer
SWE-bench measures task success rate โ did the agent produce a diff that passes the test suite? It does not penalize for token cost or wall-clock time. An agent that uses 10ร more tokens and 3ร more turns can score higher on SWE-bench while being materially worse on the product dimensions users feel. Resolution: weight SWE-bench success against token efficiency (useful tokens / total tokens per completed task) and task-completion latency as a multi-objective eval. A Pareto-dominant agent improves success rate without degrading efficiency โ that is the correct optimization target. Concretely: add a
## Further Reading
- [Anthropic โ Building Effective Agents](https://www.anthropic.com/research/building-effective-agents)
Anthropic
- [ReAct: Synergizing Reasoning and Acting in Language Models](https://arxiv.org/abs/2210.03629)
Yao et al., 2022 โ the paper that formalized the observe-think-act loop underpinning all modern coding agents. Read before designing any tool-call architecture.
- [SWE-bench Verified: Can Language Models Resolve Real GitHub Issues?](https://arxiv.org/abs/2310.06770)
Princeton / Chicago, 2023 โ the benchmark that made coding-agent trajectory eval rigorous. Essential reading for any team designing agent eval harnesses.
- [Hamel Husain โ Your AI Product Needs Evals](https://hamel.dev/blog/posts/evals/)
The practitioner post that converted a generation of AI engineers to eval-first design. The section on
- [Toolformer: Language Models Can Teach Themselves to Use Tools](https://arxiv.org/abs/2302.04761)
Schick et al., 2023 โ the foundational paper on training LLMs to use tools self-supervised. Provides theoretical grounding for tool-call precision/recall eval design.
- [Simon Willison โ Things I](https://simonwillison.net/2023/Nov/18/complex-tool-use/)
Hard-won practitioner lessons on tool-use reliability, prompt design for tool selection, and the gap between benchmark performance and real-world correctness.
## Related
The Design Doc ยท Cost Accounting & Eval-Driven Design ยท Case: Design ChatGPT ยท Case: Design Perplexity ยท Case: Design Midjourney
---
---
title: "Case: Design Midjourney"
part: "Design Reviews"
number: 71
emoji: "๐จ"
subtitle: "Multi-tenant diffusion โ queueing, step budgets, content safety, GPU economics"
tags: ["designreviews", "ml", "ai-engineering", "interview-prep", "transformer"]
---
# ๐จ Case: Design Midjourney
> Multi-tenant diffusion โ queueing, step budgets, content safety, GPU economics
> [!question] Key Question
> A 50-step generation that fails at step 48 still costs you 48 steps
โ Case: Design Claude Code / Cursor | โ Case: Design TikTok For-You Ranking
## Key Insights
> [!tip] Insight
> Non-obvious SLO: track queue wait and denoising latency separately. {" "} A p95 end-to-end SLO of 45 s can be missed two entirely different ways: the queue is backed up (capacity problem) or individual denoising runs are slow (GPU health problem). Collapsing them into one number hides the root cause and leads to the wrong remediation. Separate dashboards, separate alert thresholds. For a video counterpart with the same queue-starvation dynamics, see the{" "} Sora video generation case study . For a cross-system SLO and cost comparison, see the{" "} SLO & Cost Compare {" "} module.
> [!tip] Insight
> Why image evals need larger calibration sets. Human raters agree on text quality ~85% of the time. On images, inter-rater agreement drops to ~70% for aesthetic quality โ judges disagree on style, composition, and “good enough.” A smaller set that would give ยฑ3 percentage points on a text eval gives ยฑ6+ on images. Budget for 2ร the calibration set size compared to an equivalent text eval, and run monthly human-anchor refreshes on a 100-image subsample to detect judge drift.
> [!tip] Insight
> The 50-step multiplier. An LLM uses one forward pass for prefill and one per output token. A diffusion model uses one forward pass per denoising step โ 50 passes for a standard generation. Each pass processes the full spatial latent (e.g.,{" "} 128ร128 at 4 channels for 1024ร1024 output ), which is compute-intensive in a way that has no LLM analogue. Rule of thumb: a single H100 can serve{" "} roughly 1โ3 images per second at 50 steps {" "} (per SwiftDiffusion,{" "} arxiv.org/abs/2402.10781 ยง4 ), compared to hundreds of LLM decode tokens per second. Design your fleet sizing from this measured baseline, not from LLM throughput numbers.
> [!tip] Insight
> Two deep dives, not four. The priority queue and checkpointer are the components with the highest blast radius if wrong โ the queue affects every user's wait time and every paying customer's SLO, and the checkpointer determines what every GPU failure costs. Every other component (CDN, post-filter, API gateway) has a clear off-the-shelf design with well-understood failure modes. Deep-dive the novel parts; reference-design the commodity parts.
> [!tip] Insight
> Asymmetry: image-gen incidents skew toward reputational, not financial. {" "} An LLM service that goes down costs SLA credits. An image-gen service that generates one viral bad image costs the trust of an entire user base and potentially triggers regulatory action. The engineering implication: invest in safety infrastructure at a level that looks disproportionate relative to the financial downside โ because the reputational downside is existential.
## Interview Questions
### โ
โ
โ
_(OpenAI, Google)_
**Q:** A generation fails at step 48 of 50. How do you design the system so you don
Answer
Three interlocking mitigations: (1) Safety pre-filter on the text prompt โ cheap classifier rejects policy violations before any GPU cycles are allocated. This is the highest-ROI mitigation because adversarial prompts fail text screening at a much higher rate than benign ones, and they are the dominant source of
### โ
โ
โ _(Google, Meta)_
**Q:** How do you enforce per-tier step budgets at the scheduler level without modifying the diffusion model itself?
Answer
The scheduler wraps the denoising loop: it maintains a counter per job and halts the loop once it reaches the tier
### โ
โ
โ
_(OpenAI, Anthropic)_
**Q:** Your safety post-filter has a 1% false-positive rate (blocks one in 100 legitimate images). At 50 QPS with 4 images per generation, what does that cost in GPU-seconds per day? How do you detect this regression?
Answer
50 QPS ร 4 images ร 86,400 seconds/day = ~17.3 million images/day. At 1% FP rate that is ~173,000 images wasted per day. At roughly 5 GPU-seconds per image (50 steps ร ~0.1s/step on H100), that is ~865,000 GPU-seconds (~240 GPU-hours) of wasted compute daily. Detection: maintain a
### โ
โ
โ _(Google, Meta)_
**Q:** How would you explain to a new engineer why the CapacityCalculator built for LLM serving gives the wrong answer for a diffusion service?
Answer
An LLM processes a prompt in roughly one forward pass (prefill) plus one pass per output token (decode). Total compute is proportional to input + output tokens โ typically a few hundred forward passes at most. A diffusion model runs the denoising network 30โ100 times per image with a full U-Net or DiT pass each time. A single 1024ร1024 image generation on SDXL costs roughly 30โ50 U-Net forward passes โ each much heavier than an LLM decode step because the spatial resolution is large. The LLM calculator treats
### โ
โ
โ
_(OpenAI, Anthropic)_
**Q:** Midjourney surfaces a high-profile content violation that bypassed both text-level pre-filter and image-level post-filter. Walk through the immediate response and the three-week follow-up.
Answer
Immediate (hours): (1) Identify and delete the offending content; (2) Temporarily lower the detection threshold on the post-filter to cast a wider net, accepting higher false-positive rate as a safety-first tradeoff during investigation; (3) Pull the prompt and full generation parameters for forensic analysis. Week one: characterize the bypass โ was it a novel adversarial prompt, a gap in the pre-filter
## Further Reading
- [Denoising Diffusion Probabilistic Models (Ho et al., 2020)](https://arxiv.org/abs/2006.11239)
The foundational DDPM paper. Understanding the denoising loop is prerequisite knowledge for reasoning about step budgets, checkpointing, and why failures at step 48 are expensive.
- [High-Resolution Image Synthesis with Latent Diffusion Models (Rombach et al., 2022)](https://arxiv.org/abs/2112.10752)
Introduced latent diffusion โ the architecture behind Stable Diffusion. Shows why denoising in latent space (not pixel space) is tractable at scale, and how the VAE bottleneck interacts with generation quality.
- [DALL-E 3 Technical Report (OpenAI, 2023)](https://cdn.openai.com/papers/dall-e-3.pdf)
OpenAI
- [Stability AI Research](https://stability.ai/research)
Primary source for Stable Diffusion architecture notes, SDXL improvements, and the open-weight model family that forms the technical baseline for most independent diffusion services.
- [Efficient Diffusion Serving โ Ying Sheng et al., SwiftDiffusion (2024)](https://arxiv.org/abs/2402.10781)
A practitioner paper on batching strategy, LoRA switching, and GPU utilization for diffusion serving at scale. The most directly relevant systems paper for this case study
- [C2PA Content Credentials Specification](https://c2pa.org/specifications/specifications/2.0/specs/C2PA_Specification.html)
The open standard for embedding AI-generation provenance metadata in images. Relevant to the OpenAI company-lens discussion on watermarking and traceability.
## Related
The Design Doc ยท Cost Accounting & Eval-Driven Design ยท Case: Design ChatGPT ยท Case: Design Perplexity ยท Case: Design Claude Code / Cursor
---
---
title: "Case: Design TikTok For-You Ranking"
part: "Design Reviews"
number: 72
emoji: "๐ฑ"
subtitle: "Two-tower retrieval + ranker + feature store โ classical ML@scale canon"
tags: ["designreviews", "ml", "ai-engineering", "interview-prep", "transformer"]
---
# ๐ฑ Case: Design TikTok For-You Ranking
> Two-tower retrieval + ranker + feature store โ classical ML@scale canon
> [!question] Key Question
> Why the Explore/Exploit slider matters more than the model
โ Case: Design Midjourney | โ Case: Design an Embeddings Platform
## Key Insights
> [!tip] Insight
> The setpoint as a product SLO. The explore/exploit ratio does not appear in the table above as a fixed number โ and that is intentional. It is a variable controlled by Product, tuned via A/B experiments on retention and creator health metrics. ML teams that treat it as a model hyperparameter and tune it on offline NDCG will optimize it in the wrong direction. Offline NDCG rewards exploit (known-good items); long-run retention often rewards explore (novel content that prevents filter-bubble fatigue). The SLO table is where Product declares the goal; the setpoint is one dial they turn to achieve it. The candidate generation stage relies on the same ANN index infrastructure covered in the{" "} Embeddings Platform case study . For a cross-system failure taxonomy comparison, see{" "} Failure Taxonomy Compare .
> [!tip] Insight
> The bitter experience of recsys eval. The field has decades of evidence that offline NDCG improvements do not reliably translate to online retention gains. The YouTube two-tower paper noted this explicitly: the most important signal was whether the model improved live A/B metrics, not offline numbers. Design the eval harness to treat offline metrics as regression detectors (did something break?) and online A/B as the source of truth for improvements.
> [!tip] Insight
> The real bottleneck is not the ranker. At{" "} 100K QPS, the GPU budget for the heavy ranker is manageable because the model is small and the computation per request (scoring ~500 candidates) is highly parallelizable. The harder engineering problem is the feature-store read latency: every request needs to assemble real-time user features (last-N interactions) plus item features (freshness score, engagement rate) for all candidates within the{" "} p99 200 ms{" "} budget. Optimizing feature-store read latency โ batching reads, pre-computing hot user embeddings, sharding by user ID โ is where the real capacity work lives.
> [!tip] Insight
> Why the diversity re-ranker is separate. It is tempting to add diversity constraints directly into the heavy ranker's loss function (e.g., a diversity regularization term). Resist this. Entangling relevance and diversity in a single model means every policy change โ a new safety rule, a new creator-fairness target โ requires re-training and re-deploying the ranker. A separate re-ranker is deterministic, fast, and policy-configurable without ML involvement. The division of labor: ML maximizes relevance; the re-ranker applies constraints. This mirrors the Product/ML ownership boundary in the explore/exploit setpoint.
> [!tip] Insight
> The diversity re-ranker is your safety circuit breaker. {" "} Because it sits between the relevance ranker and the user, it is the correct place to enforce policy constraints, creator-fairness floors, and content caps. Putting safety logic in the relevance ranker couples two concerns that should evolve independently โ a policy change should not require a model re-train.
## Interview Questions
### โ
โ
โ _(Meta, Google)_
**Q:** The PM asks to
Answer
This is a product decision wearing an ML costume. Before touching anything: (1) Define the current setpoint โ what fraction of each user
### โ
โ
โ
_(Meta, Google)_
**Q:** Offline NDCG@10 improved by 1.5 points in your candidate generator experiment. The online A/B shows flat retention and a small drop in creator fairness. Explain why, and what you do next.
Answer
The classic offline/online gap in recsys. Three likely causes: (1) Training data bias โ the offline set reflects past impressions, which were already filtered by the old ranker. Your new generator retrieves different candidates that the user has never been shown, so engagement labels are missing for them (counterfactual gap). NDCG improves on seen items but the model is blind on unseen ones. (2) Distribution shift โ offline eval uses a static snapshot; online users respond to position, context, and session state that offline eval doesn
### โ
โ
โ
_(Meta, Databricks)_
**Q:** Design the feature store for a TikTok-scale feed ranker. What features live in which tier, and what is the failure mode if online/offline feature parity breaks?
Answer
Three tiers: (1) Real-time (sub-second latency): user
### โ
โ
โ _(Meta, Google)_
**Q:** A new video goes viral within 10 minutes of upload. Your ranker gives it near-zero relevance scores. What architectural components are failing, and what is the fix?
Answer
This is the cold-start / fresh-item problem. The ranker relies on engagement history (watch rate, like rate, share rate) to score items. A video uploaded 10 minutes ago has no engagement history โ it falls to the bottom of the ranked list regardless of quality. The failure is in two places: (1) The candidate generator
### โ
โ
โ
_(Databricks, Meta)_
**Q:** Databricks asks: how do you structure the ML training pipeline so that a new ranker version can be shadow-tested, compared to the champion, and promoted โ without taking the feature store offline or requiring a full data re-backfill?
Answer
The pattern is a champion/challenger shadow pipeline. (1) Feature store versioning: features are versioned by name + version tag (e.g.,
## Further Reading
- [Eugene Yan โ Patterns for Personalization in Recommendations](https://eugeneyan.com/writing/patterns-for-personalization/)
Practitioner-grade breakdown of retrieval, ranking, and re-ranking patterns at scale. The canonical starting point for recsys system design.
- [Covington et al. โ Deep Neural Networks for YouTube Recommendations (RecSys 2016)](https://research.google/pubs/pub45530/)
The paper that introduced the two-tower architecture for candidate generation at scale. Still the reference implementation for user-tower + item-tower + ANN retrieval.
- [Chip Huyen โ Designing Machine Learning Systems (O](https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/)
Chapter 6 on feature engineering and chapter 9 on the feedback loop are directly relevant to the online/offline feature-store parity problem and counterfactual logging.
- [Tecton โ The Feature Store Explained](https://www.tecton.ai/blog/what-is-a-feature-store/)
The clearest public explanation of the online/offline feature store architecture, backfill strategies, and train/serve skew. Written by practitioners who built the Uber Michelangelo feature store.
- [Pinterest Engineering โ Pinnability: Machine Learning in the Pinterest Home Feed](https://medium.com/pinterest-engineering/pinnability-machine-learning-in-the-home-feed-64be2074bf60)
A real-world case study of the explore/exploit tradeoff, diversity re-ranking, and the product/ML boundary in a large-scale feed system.
- [Instagram Engineering โ Powered by AI: Instagram](https://ai.meta.com/blog/powered-by-ai-instagrams-explore-recommender-system/)
Meta
## Related
The Design Doc ยท Cost Accounting & Eval-Driven Design ยท Case: Design ChatGPT ยท Case: Design Perplexity ยท Case: Design Claude Code / Cursor
---
---
title: "Case: Design an Embeddings Platform"
part: "Design Reviews"
number: 73
emoji: "๐งญ"
subtitle: "Pinterest-style โ backfill, drift, model upgrades, serving with HNSW"
tags: ["designreviews", "ml", "ai-engineering", "interview-prep", "transformer"]
---
# ๐งญ Case: Design an Embeddings Platform
> Pinterest-style โ backfill, drift, model upgrades, serving with HNSW
> [!question] Key Question
> The day you change your embedding model, every index goes stale
โ Case: Design TikTok For-You Ranking | โ Case: Design Llama Training Infra
## Key Insights
> [!tip] Insight
> The migration SLO is the least obvious. Without a 7-day budget cap, teams underestimate dual-index storage costs.{" "} 10M items ร 768 dims ร 4 bytes ร 2 indexes โ 60GB {" "} โ manageable. At 1B items, that is 6TB of extra storage that must be provisioned, warmed, and then decommissioned in a bounded window. The feed ranking pipeline that consumes these embeddings is covered in the{" "} Feed Ranking case study . For the retrieval-augmented generation pattern that uses embedding lookup at inference time, see the{" "} RAG comparison module .
> [!tip] Insight
> Eval-first discipline pays off most during rollback. {" "} If the migration goes wrong mid-way, the eval harness determines exactly which consumer crossed the recall regression threshold, enabling selective rollback (roll back ads but keep recsys on the new index) rather than a full revert.
> [!tip] Insight
> Storage math matters more than compute here.{" "} At 10M items/day ร 768 dims ร 4 bytes = 30GB of new vectors per day. After 1 year that is ~10TB. {" "} During a 7-day migration window, dual-write adds another ~210GB of temporary storage. Budget for this in your capacity plan โ it is the infra cost that constrains the migration window, not GPU time.
> [!tip] Insight
> Interview framing. Every interviewer will ask “how do you upgrade the model?” The wrong answer is “retrain and redeploy.” The right answer starts with: “model upgrade is a migration event โ here's the dual-write protocol, here's the backfill SLA, and here's the eval gate that triggers cutover.”
> [!tip] Insight
> Silent degradation is the hardest incident type.{" "} The platform API returns 200 OK. The embedder is running. The HNSW index is healthy. But recall@k has dropped 15pp because a shard is serving stale embeddings from before the last rebuild. Only the per-consumer recall@k monitor catches this โ which is why building that monitor is not optional.
## Interview Questions
### โ
โ
โ
_(Meta, Google)_
**Q:** You upgrade your embedding model. All existing HNSW indexes are now stale. How do you plan the migration without regressing search quality overnight?
Answer
The migration has four phases: (1) Dual-write โ the Embedder Service begins writing to both the old index and the new index for every incoming item. This prevents the new index from falling behind on fresh content. (2) Backfill โ an offline pipeline re-embeds the full corpus with the new model and inserts into the new index; priority queue by item recency so high-traffic items land first. (3) Blended read โ the retrieval layer blends results from both indexes with a sliding weight (100% old โ 0% old over ~3 days), controlled by a feature flag per consumer. (4) Cutover โ once the new index matches or exceeds the old index on recall@k golden queries for all consumers, the old index is taken offline and the dual-write layer is removed. Failure mode: index divergence during writes (network partition writes to only one index). Mitigation: consistency check job that samples 1% of items per hour and alerts if the two indexes differ by more than 5%.
### โ
โ
โ _(Meta, Anthropic)_
**Q:** You have four internal consumers (search, recsys, ads, dedup) sharing the same embedding platform. How do you design SLOs that satisfy all four without over-provisioning for the most demanding one?
Answer
Segment SLOs by consumer tier and access pattern. Ads requires the tightest recall@k (0.90) and lowest latency because a missed embedding directly costs revenue; it gets dedicated online capacity with p95 <30ms. Search (recall@k 0.85) and recsys (0.75) share an online serving pool with p95 <50ms โ their tolerance for occasional cache misses is higher. Dedup is a batch consumer with no latency SLO; it uses the async endpoint and shares GPU time with the backfill pipeline during off-peak hours. The key design principle: each consumer owns its own HNSW shard replica with the right recall tuning (ef_search parameter), so one consumer
### โ
โ
โ
_(Google, OpenAI)_
**Q:** The semantic drift monitor fires an alert โ the cosine similarity distribution of new embeddings has shifted relative to last month
Answer
First question:
### โ
โ
โ _(Google, Meta)_
**Q:** An interviewer asks why you chose HNSW over an exact k-NN index or a flat FAISS index. Give a number-backed answer.
Answer
For a corpus of 10M+ items, exact k-NN requires O(N) distance computations per query โ at 10M items and a 768-dim embedding, that is 10M dot products per query, roughly 10ms on a modern CPU. At 10,000 QPS, you need ~100 CPU cores just for retrieval with zero overhead. HNSW (Hierarchical Navigable Small World) achieves sub-linear query time by building a multi-layer graph; at M=16, ef_construction=200, recall@10 of ~0.95, query latency is ~1ms on a single core (per Malkov & Yashunin 2018, Table 2 โ https://arxiv.org/abs/1603.09320). The tradeoff is memory: HNSW stores the graph structure at ~100 bytes/item overhead beyond the raw vectors. At 10M items ร (768 dims ร 4 bytes + 100 bytes overhead) = ~34GB โ fits on one 40GB GPU or a couple of CPU nodes. FAISS flat is appropriate for corpora under 1M items or for offline eval; HNSW is the standard choice for online serving at Pinterest/Meta scale.
### โ
โ
โ
_(Meta, OpenAI)_
**Q:** Describe the
Answer
When a viral item category spikes (e.g., a breaking news event), queries cluster around a narrow region of the embedding space. If the HNSW index is partitioned by item type or topic cluster, one shard receives a disproportionate fraction of QPS while others sit idle. The hot shard
## Further Reading
- [Pinterest Engineering โ Unifying Visual Embeddings for Visual Search at Pinterest](https://medium.com/pinterest-engineering/unifying-visual-embeddings-for-visual-search-at-pinterest-74ea7ea103f0)
Primary source for Pinterest
- [Malkov & Yashunin โ Efficient and Robust Approximate Nearest Neighbor Search Using HNSW (2018)](https://arxiv.org/abs/1603.09320)
The foundational HNSW paper. Read Section 4 on layered graph construction and Section 5 on query complexity โ essential for justifying M, ef_construction, and ef_search tradeoffs in an interview.
- [Eugene Yan โ Patterns for Building LLM-Based Systems & Products](https://eugeneyan.com/writing/llm-patterns/)
Eugene
- [Chip Huyen โ Designing Machine Learning Systems (O](https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/)
Chapter 7 on feature pipelines and Chapter 10 on infrastructure cover the embedding lifecycle โ freshness, serving, versioning โ at the right abstraction level for a senior design interview.
- [Weaviate Engineering Blog โ HNSW vs. Flat Index Performance](https://weaviate.io/blog/ann-algorithms-vamana-vs-hnsw)
Benchmark-grounded comparison of ANN algorithms with real recall/latency/memory numbers. Use this to back up the HNSW justification in the architecture deep dive.
- [Shreya Shankar โ Who Validates the Validators? Verifying Parity in ML Pipelines](https://www.shreya-shankar.com/rethinking-ml-monitoring/)
The argument that online/offline parity is the hardest SLO to enforce in an embedding platform. Directly relevant to the eval and canary sections of this module.
## Related
The Design Doc ยท Cost Accounting & Eval-Driven Design ยท Case: Design ChatGPT ยท Case: Design Perplexity ยท Case: Design Claude Code / Cursor
---
---
title: "Case: Design Llama Training Infra"
part: "Design Reviews"
number: 74
emoji: "๐ฅ"
subtitle: "Data pipeline + checkpoint management + failure-tolerant orchestration"
tags: ["designreviews", "ml", "ai-engineering", "interview-prep", "transformer"]
---
# ๐ฅ Case: Design Llama Training Infra
> Data pipeline + checkpoint management + failure-tolerant orchestration
> [!question] Key Question
> At 16K GPUs, a GPU fails every 3 hours โ design for it
โ Case: Design an Embeddings Platform | โ Case: Design an Agent Platform
## Key Insights
> [!tip] Insight
> Goodput is not GPU utilization. A GPU running at 100% utilization on repeated identical micro-batches because the data pipeline stalled has 0% goodput for those steps. Goodput counts only steps whose output advances the accepted training trajectory โ it penalizes failures, stalls, and corrupted batches equally.
> [!tip] Insight
> Eval-before-commit is load-bearing. The eval fleet gates checkpoint promotion โ it is not a reporting dashboard. A checkpoint that writes successfully to the object store but has not passed the eval harness is stored as{" "} pending, not{" "} committed . Recovery rolls back only to the last{" "} committed {" "} checkpoint, avoiding the silent-corruption failure mode.
> [!tip] Insight
> Goodput is the real axis. A{" "} 16K H100 cluster at{" "} $3.50/GPU-hour (spot-market estimate, 2024; on-demand rates higher) {" "} costs ~$57,500/hour. The difference between 70% and 90% goodput on a 90-day run is roughly 4,320 wasted GPU-hours ร 16,384 GPUs ร $3.50 = on the order of tens of millions of dollars. This is why the SLO table lists goodput first, before any latency metric.
> [!tip] Insight
> Both components are operationally invisible when working. {" "} The training researcher sees a smooth loss curve and doesn't know that the ring-health monitor replaced three nodes overnight and the async checkpoint offloader ran 840 saves without pausing training. This is the correct outcome โ failure should be handled below the researcher's attention layer. The cost of getting it wrong is that the researcher notices, which means days of investigation and millions of dollars of wasted compute.
> [!tip] Insight
> Fast-detected failures are cheap; slow-detected failures are catastrophic. {" "} The rack event and the cluster hang cost roughly the same at 3 hours of undetected failure. But the rack event with good detection costs less than $30K. The asymmetry is not the failure mode โ it is the detection window. Every architectural choice that shrinks detection latency (ring-health monitor, continuous loss alerting, parity monitoring) is actually a cost-reduction investment, not an operational overhead.
## Interview Questions
### โ
โ
โ
_(Meta, OpenAI)_
**Q:** You
Answer
With synchronous all-reduce, the 64 lost ranks cause every other rank to hang waiting for the collective to complete. The ring-health monitor must detect the missing ranks within 30โ60 seconds (not 3 hours) and signal the orchestrator. Recovery: (1) roll back to the last committed checkpoint in the object store โ the most recent successfully eval
### โ
โ
โ _(Meta, Google)_
**Q:** Your team is debating checkpoint frequency: every 100 steps vs every 500 steps on a 16K H100 cluster. How do you decide?
Answer
The decision is a recovery-cost calculation. Recovery cost = (tokens between checkpoints) ร (GPU-hours per token) ร (GPU cost per hour). At 16K H100s running ~$3.50/GPU-hour, a 500-step gap with micro-batch 4M tokens/step means 2B tokens of re-computation at roughly $175K per wasted hour. The checkpoint write time is a fixed overhead per save โ with async CPU offload + streaming to object store, this is typically 2โ5 minutes per checkpoint for a 70B model. So: if failures happen every 6 hours and checkpoints take 3 minutes to write, a 100-step cadence adds ~1% overhead for async offload but halves expected re-computation. The asymmetric cost (small overhead vs catastrophic re-compute) almost always favors more frequent checkpoints. The right answer is: set cadence such that expected recovery cost โค 2ร the checkpoint overhead cost.
### โ
โ
โ
_(Anthropic, OpenAI)_
**Q:** Your loss curve shows a sharp spike at step 48,000, then returns to trend. The checkpoint at step 47,900 looks clean. What do you investigate and in what order?
Answer
A transient spike that resolves suggests a bad batch, not a corrupted model. Investigation order: (1) Data pipeline: inspect the batch at step 48,000 โ high loss often comes from a tokenizer bug that introduced garbled sequences, repeated content, or wrong language distribution. Grep for outlier token IDs, unusually long sequences, or domain-distribution jumps in that batch. (2) Numeric stability: check for NaN/Inf in loss, gradient norms, and activations at that step. A nan that resolves suggests a single bad sequence was responsible. (3) Learning-rate schedule: was there a warm-up/cool-down boundary, or a scheduled LR spike at that step? (4) Hardware: did any rank show elevated error-correction counts (GPU ECC) at that step? A single bit-flip in activations produces exactly this signature. The checkpoint at 47,900 being clean is your recovery anchor โ if you can replay step 48,000 deterministically with the same seed and reproduce the spike, it
### โ
โ
โ _(Google, Meta)_
**Q:** An interviewer asks:
Answer
Data parallelism alone fails at two limits: (1) Memory โ a 70B model in bf16 needs ~140 GB for parameters + ~560 GB for Adam optimizer states. That doesn
### โ
โ
โ _(Meta, Anthropic)_
**Q:** A research engineer says:
Answer
Both are wrong. Goodput (effective training flops / theoretical peak flops ร time) is not binary โ it has a cost-optimal point that depends on the economics of the cluster. Goodput < 85% is typically a red flag because the re-computation cost from failures + checkpoint overhead + pipeline bubbles together usually stays under 15% on a well-tuned cluster. At 72%, there
## Further Reading
- [Meta โ Llama 3 Herd of Models (Dubey et al., 2024)](https://arxiv.org/abs/2407.21783)
The primary source for Llama-scale training infrastructure at Meta. Section 3 on pre-training covers the 3D-parallel strategy, checkpoint policies, and failure-recovery design that this case study is grounded in.
- [Megatron-LM: Training Multi-Billion Parameter Language Models (Narayanan et al., 2021)](https://arxiv.org/abs/2104.04473)
The paper that systematized 3D parallelism (DP ร TP ร PP) for large-scale training. Essential reading for the orchestration and tensor-parallelism sections of this module.
- [PyTorch FSDP: Fully Sharded Data Parallel (Zhao et al., 2023)](https://arxiv.org/abs/2304.11277)
The engineering paper behind PyTorch FSDP. Covers the ZeRO-3 sharding strategy, memory savings, and communication overlap that complement 3D parallelism.
- [PyTorch Distributed โ Official Docs](https://pytorch.org/docs/stable/distributed.html)
Reference for torch.distributed, NCCL backend, process groups, and the DDP/FSDP/RPC APIs that underpin every production training stack.
- [Chip Huyen โ Large Language Model Training at Scale](https://huyenchip.com/2023/05/02/rlhf.html)
Practitioner overview of the economic and operational realities of large-scale training โ goodput, failure modes, and the org structure implications of running a cluster at this scale.
## Related
The Design Doc ยท Cost Accounting & Eval-Driven Design ยท Case: Design ChatGPT ยท Case: Design Perplexity ยท Case: Design Claude Code / Cursor
---
---
title: "Case: Design an Agent Platform"
part: "Design Reviews"
number: 75
emoji: "๐๏ธ"
subtitle: "Multi-agent infra โ sandboxing, tool registries, trajectory eval, spend control"
tags: ["designreviews", "ml", "ai-engineering", "interview-prep", "transformer"]
---
# ๐๏ธ Case: Design an Agent Platform
> Multi-agent infra โ sandboxing, tool registries, trajectory eval, spend control
> [!question] Key Question
> An agent that spawns agents โ where does the budget live?
โ Case: Design Llama Training Infra | โ Case: Design Gemini
## Key Insights
> [!tip] Insight
> Why sandbox escape is existential but slow start is P2. {" "} An agent platform is a multi-tenant system. A sandbox escape lets one tenant read another's trajectory store, tool credentials, or model outputs โ this is a data breach. The company does not survive this as a hosted platform. Slow start (2.5s instead of 2s) is annoying; a sandbox escape is company-ending. Prioritization by blast radius, not by technical difficulty.
> [!tip] Insight
> Why trajectory eval, not final-answer eval. An agent that succeeds by burning 10ร more tokens than necessary, or that selected the correct answer after four wrong tool calls, looks perfect on final-answer eval. Trajectory eval catches it: tool-call P/R is low, spend efficiency is low. These are the agents that blow past budgets in production. Hamel Husain's core argument: measuring only the output is measuring only the last inch of a mile-long run.
> [!tip] Insight
> The amplification trap. Every capacity plan for an agent platform that starts from “user tasks per second” is wrong by the average tool-call depth. For 1,000 concurrent agents with 20 tool calls each, the real LLM QPS is 20,000 โ before accounting for sub-agents. A naive design that provisions for 1,000 QPS at the LLM gateway will brown out immediately. Always derive LLM gateway capacity from{" "} user_tasks ร avg_llm_calls_per_task ร (1 + avg_child_agent_depth) .
> [!tip] Insight
> The recursive-agent trap. The most common spend-control bug on agent platforms: the parent agent spawns 50 child agents to parallelize a research task. Each child is below the per-trajectory cap. The parent has not been charged for child spend because child budgets were tracked independently. Total cost: 50 ร per-child budget, which far exceeds the parent's cap. Fix: always roll child spend into the parent's envelope before the child is dispatched, not after it returns.
> [!tip] Insight
> Agent platforms amplify blast radius. A traditional LLM API: one bad request โ one bad response. A hosted agent platform: one bad task dispatch โ 50 child agents โ 1,000 model calls โ $200 in unaccounted spend, all before the user sees an error. The spend-control and sandboxing SLOs in this module are P0 specifically because the amplification factor makes every latent failure catastrophically larger than it would be in a stateless API.
## Interview Questions
### โ
โ
โ
_(Anthropic, OpenAI)_
**Q:** An interviewer asks:
Answer
The right unit is the trajectory boundary โ the cost of the current user-facing task. Per-model-call enforcement is too fine: a single task issues 20โ100 model calls, and a cap that fires per call kills the task prematurely, arbitrarily, and repeatedly. Per-tool-call is too coarse: tools vary from a cheap grep to an expensive sub-agent spawn. The trajectory is the unit the user actually cares about:
### โ
โ
โ
_(Anthropic, Google)_
**Q:** Design the capability-token scheme for a tool registry on a multi-tenant agent platform. What does a token contain and how does the runner validate it?
Answer
A capability token is a short-lived signed credential (HMAC-SHA256 or similar) that contains: (1) tenant ID, (2) tool ID and allowed parameter schema, (3) expiry (e.g., 5 minutes), (4) trajectory ID it was issued for. The agent runner presents the token when invoking a tool; the tool registry validates the signature, checks expiry, and confirms the trajectory ID matches the current session. Tokens are issued by the Trajectory Orchestrator at session start โ the agent never sees raw credentials for the underlying tool APIs. Failure mode without this: a prompt-injected agent extracts the raw AWS credentials embedded in a tool and exfiltrates them. With capability tokens, the worst a compromised agent can do is invoke the permitted tools within the current session window.
### โ
โ
โ _(Anthropic, OpenAI)_
**Q:** Your trajectory store goes down during a P1 incident. What are the three compounding effects, and how does each one extend MTTR?
Answer
(1) Incident replay is blocked โ the on-call engineer cannot reconstruct the agent
### โ
โ
โ
_(Google, Anthropic)_
**Q:** A Google interviewer asks:
Answer
Yes, with two caveats. The 125ms boot cost is a one-time cost per session โ it hits session-start latency, not per-tool or per-step latency. For a session that runs 20+ tool calls over several minutes, 125ms amortizes to noise. The SLO is <2s to first tool call, which leaves 1.875s after the 125ms boot for the orchestrator โ runner โ LLM โ first tool call sequence; that
### โ
โ
โ
_(Meta, Anthropic)_
**Q:** Meta
Answer
Trajectory eval for multi-agent systems must be hierarchical. Leaf evals measure end-to-end task success (did the root agent return a useful result?), tool-call correctness (precision/recall on tool selections vs. a golden trajectory), and spend efficiency (useful-work-$ / total-$ where useful-work is measured by a task-success judge). But leaf success can mask intermediate failures: a parent agent that succeeded only because a child agent hit a lucky path. Add intermediate eval: for each sub-agent invocation, record the child
## Further Reading
- [Anthropic โ Building Effective Agents](https://www.anthropic.com/research/building-effective-agents)
Anthropic
- [ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022)](https://arxiv.org/abs/2210.03629)
The paper that formalized the observe-think-act loop underpinning every agent on a hosted platform. The trajectory concept in this module maps directly to a ReAct episode.
- [Firecracker: Lightweight Virtualization for Serverless Applications (Agache et al., 2020)](https://www.usenix.org/conference/nsdi20/presentation/agache)
AWS
- [E2B โ Secure Open-Source Cloud Runtime for AI Agents](https://e2b.dev/blog/how-we-built-e2b)
E2B
- [Hamel Husain โ Your AI Product Needs Evals](https://hamel.dev/blog/posts/evals/)
The practitioner post that reframed eval-first design for the AI engineering generation. The trajectory eval section of this module follows Hamel
- [LangSmith โ Tracing and Evaluation for LLM Applications](https://docs.smith.langchain.com/)
LangSmith
## Related
The Design Doc ยท Cost Accounting & Eval-Driven Design ยท Case: Design ChatGPT ยท Case: Design Perplexity ยท Case: Design Claude Code / Cursor
---
---
title: "Case: Design Gemini"
part: "Design Reviews"
number: 76
emoji: "๐"
subtitle: "Multi-modal frontier serving โ TPU stack, 1M-token attention, safety classifier chain"
tags: ["designreviews", "ml", "ai-engineering", "interview-prep", "transformer"]
---
# ๐ Case: Design Gemini
> Multi-modal frontier serving โ TPU stack, 1M-token attention, safety classifier chain
> [!question] Key Question
> 1M-token context is cheap to promise, expensive to serve โ here's the bill
โ Case: Design an Agent Platform | โ Case: Design NotebookLM
## Key Insights
> [!tip] Insight
> Non-obvious SLO choice: separate latency targets by context length. {" "} Most serving systems define a single p99 TTFT. Gemini cannot โ the difference between a 1K-token and 1M-token query is three orders of magnitude in prefill compute. A single p99 number would be dominated by the long-context tail and would mask regressions on the short-context path that serves 95%+ of queries. The right design is separate SLO buckets: <32K, 32Kโ128K, 128Kโ1M. This follows directly from the Google SRE Book (Chapter 4) recommendation to define SLOs for distinct user populations and workload classes, not aggregate service behavior.
> [!tip] Insight
> Assumptions in the above table: All compute efficiency figures are (community estimate) derived from public TPU v5p specs and measured API latencies. Google's actual cost structure is proprietary. The implied margin is a floor โ it does not include networking, cooling, datacenter amortization, or team costs. The table's value is the relative magnitudes and sensitivities, not the absolute numbers.
> [!tip] Insight
> Cross-study connections. This module connects directly to the{" "} NotebookLM case study {" "} (long-context retrieval augmentation, same 1M-token window applied to document Q&A) and the{" "} Sora case study {" "} (multi-modal generation โ video tokens as first-class inputs, same patch-grid tokenization math applied to video). If you've studied all three, you can describe Google's multi-modal strategy as a coherent stack: Gemini as the reasoning layer, NotebookLM as the long-context application layer, and the video understanding capability as the sensory input layer.
## Interview Questions
### โ
โ
โ
_(Google, Anthropic)_
**Q:** Gemini's 1M-token context window is real but serving it profitably is hard. Derive the minimum prefix-cache hit rate needed so the cost per 1M-token query stays below $10 (use publicly available API pricing as a reference point). What architectural components make or break that number?
Answer
Using current public Gemini 2.5 Pro standard pricing as a reference point (Google AI for Developers pricing page, April 2026), prompts above 200K tokens are priced at $2.50 per 1M input tokens and cached input at $0.25 per 1M. That means a 1M-token cold query costs about $2.50 before output tokens โ already below $10. The real problem is repeated turns: if a session resends the same 900K-token prefix five times with no caching, you pay about $12.50 in repeated input cost. With a 90% cache hit on that 900K prefix, the repeated-turn input cost becomes roughly 100K uncached ร $2.50/M + 900K cached ร $0.25/M = $0.25 + $0.225 = $0.475 per turn. The load-bearing components are therefore: (1) a stable context hash so repeated prefixes actually hit the cache, (2) a serving path that keeps long prefixes warm on the same worker or a recoverable external cache, and (3) admission control so 1M-token sessions do not evict each other. The interview-safe conclusion is that long context is economically viable under current public pricing, but only if the cache hit path is treated as the default path rather than an optimization.
### โ
โ
โ _(Google, Meta)_
**Q:** A Gemini multi-modal query arrives with a 10-image product catalog (each ~512KB JPEG). Walk through the full serving path, identifying the two highest-latency steps and how you bound them.
Answer
Per the Google blog on Gemini image tokenization, each image is converted to roughly 258 tokens by the multimodal encoder (variable based on resolution, but 258 is the documented canonical value for standard inputs). Ten images = ~2,580 image tokens added to the context. The two highest-latency steps are: (1) Image encoding โ the SigLIP/ViT encoder processes each image into patch embeddings before the language model sees them. At batch size 1 on TPU v5p, encoding a 512KB JPEG takes on the order of 5โ15 ms per image (inferred from ViT-L benchmarks on comparable accelerators); ten images serial = 50โ150 ms. Bound this by parallelizing encoding across the 10 images โ independent inputs, embarrassingly parallel. At batch 10, total encoding drops to the single-image time (15 ms) plus scheduling overhead. (2) Prefill for the full prompt โ 2,580 image tokens + N text tokens must be prefilled on the generation model. At 1K token/ms prefill throughput on H100/TPU equivalent, 3K tokens = ~3 ms prefill โ fast. But if the user has a long conversation history in the 1M-context window, the prefill cost dominates (1M tokens / 1K tokens/ms = 1 second, minus any KV cache hits). Bound this with prefix caching on the conversation history and chunked prefill so the image tokens do not block decode slots for other users. The multimodal encoder path must complete before the language model starts prefill โ this is the hard dependency. If the encoder is on a separate TPU slice, ensure the embedding tensor is co-located (or transferred via NVLink-equivalent ICI) to avoid a D2D copy penalty.
### โ
โ
โ
_(Google, OpenAI)_
**Q:** Gemini 2.5 Thinking charges separately for thinking tokens. Design the serving-side token budget enforcer: what does it check, when does it fire, and what happens if the model tries to exceed the budget mid-generation?
Answer
The thinking budget is a per-request parameter (e.g., max_thinking_tokens: 8192). The enforcer lives as a generation wrapper around the TPU decoding loop. On each forward pass it maintains a running count of emitted thinking tokens (tokens inside the model's internal reasoning scratchpad, delimited by a special token pair). When the running count reaches the budget cap, the enforcer injects a “stop thinking” control token that signals the model to transition to the output phase. Three checks required: (1) Token classification โ thinking tokens use a reserved token range or are wrapped in special delimiters; the enforcer must correctly distinguish thinking tokens from output tokens to avoid counting output against the budget (which would truncate the actual response). (2) Mid-generation preemption โ if the model exceeds the budget before completing its reasoning, the enforcer must inject the stop-thinking signal without corrupting the KV cache state; the model must have been trained to handle a budget-exceeded interrupt gracefully. (3) Billing accuracy โ thinking tokens consumed must be recorded per-request before the KV cache entry is written, so a node crash after generation but before billing does not silently undercount. The worst failure mode is a classifier bug that mistakes output tokens for thinking tokens and truncates the response when it hits the budget ceiling โ this manifests as abruptly cut-off answers that pass safety checks but are incoherent. Detection: monitor response-length distribution; a sudden left-shift (short answers) after a thinking-classifier deploy is the signal.
### โ
โ
โ _(Google, Anthropic)_
**Q:** Your team's safety post-classifier has a 2% false-positive rate on medical queries. That means 2% of legitimate doctor-patient research questions are refused. At 50K QPS and 5% medical query share, how many users per hour are wrongly blocked? What is the right architectural fix?
Answer
Arithmetic: 50,000 QPS ร 5% medical share = 2,500 medical QPS. 2% false-positive rate ร 2,500 = 50 wrong refusals per second. Per hour: 50 ร 3,600 = 180,000 users per hour wrongly blocked. That is not a rounding error โ it is a service-level failure on a user segment that includes healthcare professionals. The right architectural fix has two layers: (1) Calibrated fallback classifier โ instead of a single binary classifier, use a three-outcome model: BLOCK, ALLOW, and UNCERTAIN. For UNCERTAIN results (~5% of edge cases), route to a more expensive but more accurate secondary classifier or a human review queue. This reduces the hard false-positive rate at the cost of latency on the uncertain slice, which is acceptable because users who receive UNCERTAIN-routed queries are presumably not in the critical streaming path. (2) Query-type context signal โ feed the router's inferred query type (medical, legal, security, code) as a feature to the safety classifier. A query with strong medical intent markers (ICD codes, drug names, clinical terminology) should have a lower false-positive prior, not a higher one. The current failure mode is a context-free classifier that treats “what is the lethal dose of acetaminophen” identically whether it comes from a clinical database API or a user account with 50 prior jailbreak attempts. Personalization of the safety threshold based on trust signals is the correct direction (per Google's SafetySettings API, which already exposes per-category thresholds as a first-class feature).
## Related
The Design Doc ยท Cost Accounting & Eval-Driven Design ยท Case: Design ChatGPT ยท Case: Design Perplexity ยท Case: Design Claude Code / Cursor
---
---
title: "Case: Design NotebookLM"
part: "Design Reviews"
number: 77
emoji: "๐"
subtitle: "Long-context RAG over user docs โ source-pinned citations, audio-overview pipeline"
tags: ["designreviews", "ml", "ai-engineering", "interview-prep", "transformer"]
---
# ๐ Case: Design NotebookLM
> Long-context RAG over user docs โ source-pinned citations, audio-overview pipeline
> [!question] Key Question
> Upload 50 PDFs, ask one question โ which half of the stack wins?
โ Case: Design Gemini | โ Case: Design Sora
## Key Insights
> [!tip] Insight
> Why citation precision, not accuracy, is the primary SLO. {" "} NotebookLM does not claim to be factually accurate in the world-knowledge sense โ it claims to be accurate relative to the sources you uploaded. A user uploading a wrong paper gets wrong citations, and that is correct behavior. The SLO is about fidelity to source, not fidelity to ground truth. This is why the system should never supplement user sources with model training memory, even when sources are sparse โ doing so would violate the core contract.
> [!tip] Insight
> The silence failure mode. When a query asks about something not in the user's sources, the correct behavior is an explicit “not found in your sources” response, not a confident answer from training memory. Measure the rate of out-of-scope answers that cite non-existent paragraphs โ this is the most trust-destroying failure because the user cannot detect it without reading the original document.
> [!tip] Insight
> The KV-cache is the business model. Without prefix caching on static document content, NotebookLM's long-context serving cost would be prohibitive at free-tier scale. The insight is that user-uploaded documents are {'"'}static prefixes{'"'} โ they do not change between queries. Any inference engine that supports prefix KV-cache reuse (as Gemini 1.5 does, per Google's published{" "} context caching docs ) turns the 10x{" "} input-token cost reduction into a direct margin improvement for every query after the first in a session.
> [!tip] Insight
> The silent-wrong-citation failure is the worst. A system outage is visible โ users see an error page. A citation that links to a plausible but incorrect paragraph is invisible. The user clicks it, sees related (but wrong) text, and trusts the answer anyway. This is how source-grounded AI systems erode trust: not through obvious failures, but through calibration failures that look correct on the surface. The citation assignment eval is the only defense.
## Interview Questions
### โ
โ
โ
_(Google, Anthropic)_
**Q:** NotebookLM offers a free tier with no clear monetization path. Long-context inference over a 200-page PDF is expensive. How does the system serve the free tier profitably, or at least sustainably?
Answer
Three interlocking mechanisms keep the free tier viable. First, KV-cache reuse on the static document prefix is the primary lever. Because a user's uploaded sources rarely change between queries, the tokenized document representation can be prefix-cached on the Gemini fleet. Using current public Gemini 2.5 Flash paid pricing as a proxy (April 2026), cached input is 10x cheaper than uncached input: $0.03/M tokens cached vs $0.30/M uncached. A 200-page PDF at ~100K tokens therefore costs roughly $0.03 cold and $0.003 on warm turns before output tokens. Second, quota throttling limits worst-case cost per user: Google's current NotebookLM Help documentation allows up to 50 sources per notebook and up to 500,000 words per source, so the real control surface is query volume and feature gating rather than tiny source caps. Third, free-tier usage likely generates training signal and product-discovery value beyond the marginal serving cost. The structural bet is still freemium: most users ask a few questions, while cache reuse compresses the cost of engaged users who ask many questions over the same notebook.
### โ
โ
โ
_(Google)_
**Q:** A user queries across 10 uploaded PDFs. Gemini's 1M-token context window can fit them all. When should NotebookLM use full-context (all docs in the prompt) vs. RAG (retrieve top-k chunks first)?
Answer
The tradeoff is cost vs. recall completeness. Full-context gives the model access to every sentence in every document โ ideal for queries requiring synthesis across many non-obvious locations (e.g., “find all instances where authors disagree about X”). Using Gemini 2.5 Flash paid pricing as a public proxy, a 500K-token cold context costs about $0.15 (500K ร $0.30/M). A RAG path that retrieves top-20 paragraphs (~10K tokens total) costs about $0.003 on the input side โ still roughly 50x cheaper. The decision rule should be signal-driven: use a query complexity classifier to route. Narrow factual queries (“what is the author's definition of X?”) route to RAG; synthesis queries (“compare the methodologies across all papers”) route to full-context. Cache state is also a signal: if the user's document set was queried recently and the prefix is likely warm, full-context input cost drops another 10x and the tradeoff swings toward full-context. NotebookLM's architecture (community estimate, per reverse-engineered behavior) appears to lean heavily on full Gemini context for source-grounded synthesis, betting that cache reuse makes this economically viable for engaged users.
### โ
โ
โ
_(Google, Anthropic)_
**Q:** At Google, you're reviewing the eval spec for NotebookLM's citation correctness. What are the two most important eval dimensions and how do you measure them?
Answer
Citation correctness has two distinct failure modes requiring separate evals. The first is source-paragraph entailment: does the cited paragraph actually support the generated claim? Measure with an NLI model over (claim, cited-paragraph) pairs, sampling 5% of production query-answer pairs daily. Target: โฅ92% entailment rate. The failure mode here is the model making a plausible claim from training memory and hallucinating a source paragraph that doesn't say that. The second dimension is citation assignment: when a claim is supported by source material, is it assigned to the correct document and paragraph among the user's uploaded sources? Mis-assignment is distinct from non-entailment โ the system might correctly identify that a claim is supported somewhere, but link it to the wrong paragraph, violating the user's trust in the navigation (clicking a citation should take them to the exact sentence). Measure with a golden query set (100+ hand-annotated Q&A pairs where correct source paragraphs are labeled) run offline on every model update. An LLM judge evaluating citation correctness itself needs calibration against the human-labeled set โ Shreya Shankar's EvalGen work (arXiv:2404.12272) shows uncalibrated LLM judges systematically over-report entailment by 8โ12 pp on grounding benchmarks.
### โ
โ
โ _(Google, Anthropic)_
**Q:** The Audio Overview feature generates a two-speaker podcast from user-uploaded documents. What are the two safety failure modes unique to this feature, and how do you architect the mitigation?
Answer
Two failure modes are unique to Audio Overview and absent from the text-query path. The first is voice-cloning abuse: a user could upload an audio recording of a real person (e.g., an executive's earnings call transcript with speaker audio) and attempt to get the TTS pipeline to synthesize content in that person's voice. Mitigation: the TTS models must use fixed synthetic voices that are not conditioned on user-uploaded audio. Google's publicly announced Audio Overview uses two fixed synthetic host voices (per Google Labs blog, 2024). The pipeline must include a speaker-identity guard that confirms the synthesis request routes only to pre-approved voice IDs, never to a user-supplied voice embedding. The second failure mode is PII amplification: user-uploaded documents may contain sensitive data (medical records, personal emails, internal financial docs). The two-speaker dialogue script generated from those docs could surface PII in a more legible, memorable form โ a podcast version of a medical record is a greater privacy risk than the PDF. Mitigation: run the script through a PII detector before TTS synthesis; redact or paraphrase PII-containing spans before audio generation. Both mitigations should be in-pipeline, not advisory โ the audio is not generated if the safety checks fail, with a user-facing error that explains why.
## Related
The Design Doc ยท Cost Accounting & Eval-Driven Design ยท Case: Design ChatGPT ยท Case: Design Perplexity ยท Case: Design Claude Code / Cursor
---
---
title: "Case: Design Sora"
part: "Design Reviews"
number: 78
emoji: "๐ฌ"
subtitle: "Text-to-video at scale โ diffusion transformer GPU economics, safety on generative video"
tags: ["designreviews", "ml", "ai-engineering", "interview-prep", "transformer"]
---
# ๐ฌ Case: Design Sora
> Text-to-video at scale โ diffusion transformer GPU economics, safety on generative video
> [!question] Key Question
> A 10-second clip costs more GPU-hours than your laptop's lifetime
โ Case: Design NotebookLM | โ Case: Design Character.ai
## Key Insights
> [!tip] Insight
> Non-obvious SLO: track queue wait and denoising latency separately. {" "} A p95 end-to-end SLO miss of 120 s can mean either the queue backed up (capacity problem โ add GPUs or shed load) or individual denoising runs are slow (GPU health problem โ inspect node metrics). Collapsing them into one number sends you to the wrong remediation. Additionally, track first intermediate frame latency as a separate SLO (target: <20 s) โ even if the final clip takes 90 s, a low-resolution preview after step 10 dramatically reduces perceived wait time.
> [!tip] Insight
> Video golden sets need 4ร more clips than image sets. {" "} Inter-rater agreement for temporal coherence (~60%) is lower than for image aesthetic quality (~70%), which is already lower than text quality (~85%). At 60% agreement, a set of 200 clips gives a 95% confidence interval of roughly ยฑ7 percentage points on a binary coherence metric โ borderline usable. Target 500+ clips for a production-grade video quality eval. Budget proportionally.
> [!tip] Insight
> Three deep dives, not four โ latency budget was the constraint. {" "} Priority queue design for video follows the same three-lane weighted-fair-share pattern as image-gen (see{" "} Image-Gen Design Review, Deep Dive A ) with one addition: jobs have a “generation budget” in GPU-seconds at admit time so the scheduler can estimate when capacity will free up. The cost model comparison ( SLO vs Cost tradeoffs ) covers the queue scheduling math in more depth.
> [!tip] Insight
> Detection-window sensitivity dominates incident cost for video. {" "} The 15ร cost delta between a 2-minute and 30-minute detection window (from the NaN explosion scenario above) holds across all three incident types. Invest in alarm sensitivity โ a per-tier queue-depth alarm that fires within 2 minutes of a breach, a NaN-rate alarm that fires within 1 minute โ before investing in faster incident response. The cheapest hour is the one you catch in the first 2 minutes.
## Interview Questions
### โ
โ
โ
_(OpenAI, Google)_
**Q:** A Sora generation fails at step 48 of 50, consuming almost full GPU budget with no deliverable. Walk through two structural mitigations and quantify the expected wasted GPU-seconds saved by each.
Answer
Mitigation 1 โ text-level pre-filter: adversarial prompts are the dominant source of late-stage failures because they tend to trigger policy violations discovered only after generation completes. A fast text classifier (sub-100 ms, CPU-only) that rejects known-bad patterns before GPU allocation eliminates the spend entirely for that class. At a 2% adversarial traffic rate and 50 QPS, the pre-filter saves ~1 QPS ร 50 steps ร ~2.4 GPU-s/step = ~120 GPU-seconds per second of traffic โ roughly $7/min in saved H100 time at $3.50/hr. Mitigation 2 โ step-level checkpointing: saves the intermediate latent tensor every 10 steps. A failure at step 48 restarts from step 40, costing only 8 steps instead of 48 โ an 83% reduction in wasted compute for that job. At a GPU fault rate of 0.1% per generation and 20 QPS, checkpointing saves roughly 0.001 ร 20 ร (48โ8) steps ร 2.4 GPU-s/step โ 1.9 GPU-seconds per second of traffic โ a smaller saving than pre-filtering, but critical during hardware instability events when fault rates spike to 1โ5%.
### โ
โ
โ _(OpenAI, Anthropic)_
**Q:** Why is a 120-second p99 generation latency for Sora not directly comparable to a 120-second p99 for a long-document LLM response, and how should you design the UX and SLO differently?
Answer
An LLM streaming a 120-second response is delivering tokens continuously โ the user sees output within the first 300โ500 ms and gets progressive value throughout. Sora produces nothing until all 50 denoising steps complete: the user waits 120 seconds on a progress bar before seeing any output. This makes Sora psychologically closer to a file download than a chat response, which has two architectural implications. First, SLO design: track queue wait and denoising latency separately. A 120 s total that is 5 s queue + 115 s denoising is very different from 90 s queue + 30 s denoising โ the latter signals a capacity crisis. Second, UX design: show a real-time denoising preview (a lower-resolution or coarser-step intermediate frame) every 10 steps so users get feedback that work is progressing. This is similar to how DALL-E shows a blurry preview before the final image. The SLO for the preview stream (e.g., first intermediate frame within 15 s) should be tracked separately from the final-clip SLO, because a broken preview pipeline is a user-experience failure even when the final clip succeeds.
### โ
โ
โ
_(OpenAI, Meta)_
**Q:** Design the safety stack for a service that generates realistic human faces in video. What are the three hardest failure modes, and how do you detect each before a public incident?
Answer
The three hardest failure modes for face-in-video generation: (1) Celebrity likeness generation โ a prompt that does not mention a celebrity by name but uses sufficiently specific descriptors to produce a recognizable likeness. Text-level pre-filters miss this because the violation is in the output, not the input. Detection: a frame-level celebrity-likeness classifier on every generated frame, with a known-celebrities embedding index (perceptual hash + face embedding) built from opt-out databases and updated weekly. Alert threshold: any frame scoring above 0.85 cosine similarity to an indexed celebrity face triggers hold-and-review before delivery. (2) CSAM generation โ even non-explicit prompts can produce frames involving minors in ambiguous contexts when combined with adversarial suffixes. Detection: a dedicated CSAM classifier running on every frame as a mandatory post-filter gate โ this is non-negotiable, and its false-negative rate must be tracked on a red-team golden set updated monthly. (3) Non-consensual intimate imagery (NCII) โ realistic face-swap or de-clothing artifacts can emerge from benign-looking prompts. Detection: a multi-class intimacy classifier that separately scores (a) nudity presence and (b) face-in-frame, and blocks any clip where both are above threshold. Each classifier runs in parallel on sampled frames (every 5th frame for efficiency) with a final pass on the first and last frame of every clip regardless.
### โ
โ
โ _(OpenAI, Google)_
**Q:** The Sora team proposes shipping a free-tier that allows unlimited generations but enforces a 480p resolution cap and a 5-second duration cap. As the infra lead, what do you push back on, and what do you add?
Answer
Push back on “unlimited generations.” Even at 480p and 5 s, each generation runs the full 50-step DiT denoising loop โ the cost reduction from resolution and duration limits is roughly 4ร (resolution) ร 2ร (duration) = 8ร cheaper than a full 1080p/10s clip, but still on the order of $0.10โ0.20 per generation (community estimate). At meaningful free-tier scale (1M users ร 5 generations/day = 5M generations/day), that is $500Kโ$1M/day in raw GPU cost with zero revenue. The right answer is a daily generation credit, not unlimited. What to add: (1) Per-IP and per-account burst limits enforced at the API gateway to prevent batch abuse. (2) A prompt-complexity classifier that estimates generation cost (high-motion scenes are harder than static landscapes) and charges more credits for complex prompts โ this caps the adversarial case of a free-tier user maximizing GPU burn with complex prompts. (3) A queue priority tier below paid users: free-tier generations get best-effort throughput and are the first lane shed under capacity pressure โ explicit in the ToS so users do not treat free-tier latency as an SLO.
## Related
The Design Doc ยท Cost Accounting & Eval-Driven Design ยท Case: Design ChatGPT ยท Case: Design Perplexity ยท Case: Design Claude Code / Cursor
---
---
title: "Case: Design Character.ai"
part: "Design Reviews"
number: 79
emoji: "๐ญ"
subtitle: "Consumer LLM at scale โ MQA, int8, trained-from-scratch, sub-$1/user/month cost floor"
tags: ["designreviews", "ml", "ai-engineering", "interview-prep", "transformer"]
---
# ๐ญ Case: Design Character.ai
> Consumer LLM at scale โ MQA, int8, trained-from-scratch, sub-$1/user/month cost floor
> [!question] Key Question
> 20B tokens served per day on a consumer-priced subscription โ how?
โ Case: Design Sora | โ Compare: RAG Systems
## Key Insights
> [!tip] Insight
> The looser p99 TTFT is a cost-engineering instrument, not a product shortcut. {" "} A 2,500 ms p99 vs. ChatGPT Plus's{" "} ~800 ms p95 sounds like a worse product. But it directly enables larger batch sizes in the serving engine. At p99 2,500 ms, the scheduler can accumulate requests for up to 1.5 additional seconds before dispatching a batch โ increasing average batch size from ~16 to ~48 at 50K QPS. Throughput scales approximately linearly with batch size in the decode phase (per vLLM continuous batching benchmarks, arXiv:2309.06180). The cost per token drops proportionally. This single SLO choice is worth roughly 3x in effective GPU utilization compared to a ChatGPT-Plus-equivalent SLO. Character.ai's consumer positioning enables a cost structure that a premium assistant product cannot access.
> [!tip] Insight
> Why cache hit rate belongs in the eval harness. At 50K QPS and a{" "} 60% cache hit rate , only 20K QPS reaches the GPU for full prefill computation. If a deploy drops hit rate from 60% to 30%, effective prefill QPS jumps from 20K to 35K โ a 75% increase in GPU prefill load that is not visible in latency metrics during off-peak but blows the cost SLO by end of month. The eval harness catches it in CI before the deploy lands.
> [!tip] Insight
> Original research caveat. Character.ai has not published per-message cost figures. The table above is a reverse-engineered estimate from the publicly disclosed fleet size (~3,000 GPUs, per the cost-engineering blog), published subscription pricing ($10/mo), and reported DAU metrics. All derived values are labeled accordingly. The exercise is useful for interviews because it demonstrates reasoning from first principles to a defensible cost model โ not because the exact numbers are correct.
## Code Examples
```python
import torch
import torch.nn.functional as F
def mha_attention(x, Wq, Wk, Wv, num_heads, head_dim):
"""Standard multi-head attention โ separate K, V per head."""
B, T, D = x.shape
# Project: each of num_heads heads gets its own K and V
Q = (x @ Wq).view(B, T, num_heads, head_dim).transpose(1, 2) # (B, H, T, d)
K = (x @ Wk).view(B, T, num_heads, head_dim).transpose(1, 2) # (B, H, T, d)
V = (x @ Wv).view(B, T, num_heads, head_dim).transpose(1, 2) # (B, H, T, d)
# KV cache memory: B * num_heads * T * head_dim * 2 bytes * 2 (K+V)
scale = head_dim ** -0.5
attn = F.softmax(Q @ K.transpose(-2, -1) * scale, dim=-1)
return (attn @ V).transpose(1, 2).reshape(B, T, -1)
def mqa_attention(x, Wq, Wk_shared, Wv_shared, num_heads, head_dim):
"""Multi-query attention โ single shared K, V for all query heads."""
B, T, D = x.shape
Q = (x @ Wq).view(B, T, num_heads, head_dim).transpose(1, 2) # (B, H, T, d)
# K and V are shared: only 1 head's worth of K and V stored
K = (x @ Wk_shared).view(B, T, 1, head_dim).transpose(1, 2) # (B, 1, T, d)
V = (x @ Wv_shared).view(B, T, 1, head_dim).transpose(1, 2) # (B, 1, T, d)
# KV cache memory: B * 1 * T * head_dim * 2 bytes * 2 (K+V) โ 32x smaller!
K = K.expand(-1, num_heads, -1, -1) # broadcast to all query heads at attention time
V = V.expand(-1, num_heads, -1, -1)
scale = head_dim ** -0.5
attn = F.softmax(Q @ K.transpose(-2, -1) * scale, dim=-1)
return (attn @ V).transpose(1, 2).reshape(B, T, -1)
```
## Interview Questions
### โ
โ
โ
_(Google, Meta)_
**Q:** Character.ai serves millions of users chatting with the same popular character. Describe how you would architect prefix caching to exploit this, what the cache hit rate ceiling is, and what breaks the cache.
Answer
A popular character's personality prompt is 4โ16K tokens shared across potentially millions of simultaneous conversations. The key insight is that the shared personality prefix is reusable across users, while the per-user dialogue suffix is not. Architecturally, that means: prefill the shared prefix once, hash it, keep the KV cache resident on the sticky serving shard, and route subsequent turns for that dialogue back to the same shard. The token-level ceiling for savings depends on how much of a typical request is shared prefix versus user-specific suffix: with a 4K shared prefix and a 2K user suffix, the shared fraction is 4/(4+2) = 67%. Character.AI's June 2024 inference post reports a much higher 95% fleet-level cache rate because they also reuse inter-turn dialogue prefixes with longest-prefix matching, not just the static character preamble. What breaks the cache: (1) personality prompt version bumps โ even a whitespace change invalidates the prefix hash; treat prompt text as a deployment artifact. (2) Loss of shard affinity โ once dialogue turns stop landing on the same server, the warm KV state becomes useless. (3) Checkpoint or quantization changes โ a serving image update that changes KV layout requires invalidating old cache entries. The important interview move is distinguishing token-level shared-prefix savings from fleet-level query cache rate; they are related, but not the same metric.
### โ
โ
โ
_(Google, Anthropic)_
**Q:** Multi-query attention (MQA) is cited in the Character.ai cost blog as a key memory-saving technique. Explain the mechanism, quantify the KV cache memory reduction versus multi-head attention, and describe what you give up.
Answer
Standard multi-head attention (MHA) keeps separate K and V tensors for every head. For one transformer layer with 32 heads, 4,096 tokens, 128 dims/head, and fp16 KV, the cache size is 2 (K+V) ร 32 ร 4,096 ร 128 ร 2 bytes = 67,108,864 bytes, or 64 MiB per layer. MQA (Shazeer, 2019, arXiv:1911.02150) shares K and V across heads, so the same layer drops to 2 MiB โ a 32x reduction versus MHA for this geometry. Character.AI's June 2024 inference post says they use MQA in all attention layers and combine it with hybrid attention horizons plus cross-layer KV sharing to reduce KV-cache size by more than 20x without quality regression; that is the public source you should cite rather than reverse-engineering the whole fleet. What you give up is representational flexibility: GQA ablations (Ainslie et al., 2023, arXiv:2305.13245) show that more aggressive KV sharing can trade away some reasoning quality versus full MHA. The interview-safe framing is: MQA is a training-time architectural choice that buys huge memory savings, but you only take it when your product economics care more about batchable long-dialogue serving than squeezing out every last bit of head specialization.
### โ
โ
โ
_(Google)_
**Q:** You are a Google DeepMind interviewer. Character.ai was acquihired by Google in 2024. The Character.ai team proposes to migrate the serving infrastructure to Google's TPU v5e fleet. What are the top three integration risks, and how do you mitigate each?
Answer
Risk 1: int8 quantization incompatibility. Character.ai's model uses int8 attention matmul and int8 KV cache calibrated for NVIDIA A100/H100 tensor core layouts. TPU v5e uses bfloat16 as its native compute type with limited int8 support in the matrix multiply units. The migration requires either (a) re-calibrating the model in bfloat16 โ which likely recovers the ~1โ2% quality gap sacrificed for int8 on GPU but costs more memory and thus requires more TPU chips โ or (b) implementing custom int8 kernels in JAX/XLA for the specific attention pattern. Risk: either path takes 3โ6 months and carries regression risk on persona consistency. Mitigation: run A/B traffic on GPU vs. TPU with identical prompts and track the persona-judge score daily before cutting over more than 5% of traffic. Risk 2: prefix cache architecture mismatch. vLLM-style prefix caching relies on GPU HBM being addressable as a hash table keyed on token hash. TPU memory management under JAX/XLA is less flexible โ tensor shapes must be static at compile time. Replicating the dynamic prefix caching behavior requires engineering a custom TPU serving layer (similar to what Google did for PaLM serving). This is solvable but not trivial; budget 6+ months. Risk 3: character-to-shard affinity routing. Character.ai routes conversations to the GPU shard holding the warm KV states for the target character. Google's TPU Borg scheduler is optimized for batch training, not request-affinity routing at LLM serving latency. A custom Borg job configuration or a sidecar routing layer is required. If the routing layer is not ready at migration time, cache hit rate drops to near zero and GPU-equivalent cost increases 2โ3x, blowing the economics of the migration.
### โ
โ
โ _(Meta, Anthropic)_
**Q:** Character.ai must enforce safety for minors at consumer scale. A naive keyword filter fails; a full LLM safety judge per message is too slow. Design a tiered safety architecture that hits p99 <2,500 ms TTFT while protecting under-18 users.
Answer
The architecture has three tiers, each gating the next more expensive tier. Tier 1 โ sub-millisecond lexical + embedding gate: a pre-trained embedding classifier (BERT-small equivalent, ~12M params, runs in <2 ms on CPU) scores the user message for obvious harm signals and age-specific risk indicators. Hit rate on clear-positive blocks: ~40% of all policy-violating content. Cost: essentially free per request. Tier 2 โ 50 ms risk classifier: a fine-tuned 125M-param model specialized for character.ai's taxonomy (NSFW roleplay, self-harm, CSAM adjacent). Runs on GPU in a dedicated safety cluster on the 60% of messages that pass Tier 1. This classifier was trained specifically on roleplay context โ generic classifiers trained on social media text dramatically under-perform on fictional framing (e.g., “my character asks how to...” bypasses most off-the-shelf models). Hit rate on remaining violations: ~85%. Tier 3 โ post-generation LLM judge: runs after the character model generates a response, on the 5โ10% of outputs that produced a high-risk activation in the post-processing hook. This judge has up to 500 ms budget. Age-gating layer: the gateway attaches an age-tier flag (inferred at account creation) to every request. For accounts flagged as under-18 or age-unverified, the Tier 2 classifier threshold is tightened (lower logit threshold for blocking), and the Tier 3 judge runs on a larger sample (20% vs. 5% for adult accounts). The core engineering insight: the expensive safety compute is not flat across all users โ it is concentrated on the highest-risk (under-18, unverified) user segment. By tiering the compute and routing only the high-risk segment to the expensive judge, you achieve comparable safety outcomes at 30โ40% of the flat-cost alternative.
## Related
The Design Doc ยท Cost Accounting & Eval-Driven Design ยท Case: Design ChatGPT ยท Case: Design Perplexity ยท Case: Design Claude Code / Cursor
---
---
title: "Compare: RAG Systems"
part: "Design Reviews"
number: 80
emoji: "๐งฎ"
subtitle: "Perplexity vs NotebookLM vs ChatGPT-search vs Phind โ retriever, grounding, citation side-by-side"
tags: ["designreviews", "ml", "ai-engineering", "interview-prep", "transformer"]
---
# ๐งฎ Compare: RAG Systems
> Perplexity vs NotebookLM vs ChatGPT-search vs Phind โ retriever, grounding, citation side-by-side
> [!question] Key Question
> Same question, four systems, four answers โ whose retriever wins?
โ Case: Design Character.ai | โ Compare: SLO โ Cost
## Key Insights
> [!tip] Insight
> The hierarchy interviewers test. Citation precision is the “trust surface” โ users experience the system through citations, not raw text. Groundedness is the “silent killer” โ it degrades without any UI signal until user satisfaction collapses. Freshness is the “loudest failure” โ users notice immediately. Rank your attention in that order.
> [!tip] Insight
> Interview trap. Interviewers at Google frequently ask “which system has the best grounding?” expecting you to say Perplexity. The correct answer is NotebookLM โ its controlled corpus and explicit document-fidelity objective yield lower estimated hallucination rates ( ~2โ4%) than Perplexity (~5โ8%) on document-answerable questions. But NotebookLM has no freshness, so the question is under-specified. Always ask: “On which query class?”
> [!tip] Insight
> The pattern. All four retrievers reflect their corpus constraints. Perplexity owns the corpus โ controls freshness and chunking. NotebookLM's corpus is user-defined โ small enough to skip chunking. ChatGPT outsources the corpus โ trades control for scale. Phind narrows the corpus โ trades breadth for depth. In a design interview, the retriever choice is the first question: “What corpus are you serving? Who controls it? What's the freshness requirement?”
> [!tip] Insight
> ChatGPT Search's single point of failure. Bing API downtime is not a degraded state for ChatGPT Search โ it is a total retrieval failure. Perplexity can degrade to its vector index if the live-fetch path fails. NotebookLM can brute-force search if Matching Engine degrades. ChatGPT has no fallback corpus. This is the most important architectural difference in the comparison, and Google interviewers regularly probe it.
## Interview Questions
### โ
โ
โ
_(Anthropic, Google)_
**Q:** You're designing the eval harness for a new RAG product that competes with Perplexity and NotebookLM. You have 2,000 human-labeled examples. How do you allocate them across eval axes, and what does your offline-to-online correlation strategy look like?
Answer
Allocate by axis risk, not evenly. Suggested split: 600 examples for citation precision (the trust metric โ wrong citations destroy the product immediately), 500 for groundedness (LLM-drifts-to-memory is invisible until measured), 400 for freshness accuracy (freshness-sensitive query cohort only), 300 for recall@K (retrieval coverage on head vs. tail queries), 200 for refusal/disclosure behavior on low-evidence queries. Offline-to-online correlation: instrument a 5% production sample for each axis using the same eval logic โ track the online-offline gap monthly. If offline groundedness says 92% but online thumbs-down on factual queries says 15%, the gap is real and the eval is not measuring what users experience. Calibrate LLM judges quarterly against a human-labeled subsample (Shankar et al., 2404.12272).
### โ
โ
โ
_(Google, OpenAI)_
**Q:** NotebookLM uses Gemini 1.5 Pro with 1M-token context instead of a traditional chunked RAG pipeline. When does this architectural choice hurt, and how would you fix it?
Answer
It hurts in three scenarios: (1) Cost at scale โ a 128K-context Gemini call costs significantly more than passing top-5 chunks to a smaller model. At 10K QPS, the per-query cost difference compounds to millions per month. (2) Latency ceiling โ long-context inference latency scales roughly linearly with context length; at 500K tokens, TTFT can exceed 5s even with KV cache. (3) Needle-in-haystack degradation โ Gemini's attention is not uniformly strong across 1M tokens; claims from the middle of a large document are under-attended (per Kamradt's NIAH benchmark). Fix: introduce a two-stage retrieval path โ semantic search retrieves the top 20 passages, Gemini synthesizes over those 20, keeping context under 50K tokens while preserving the “no explicit re-ranker” property. This cuts cost ~5x at a small quality cost on cross-document synthesis tasks.
### โ
โ
โ _(OpenAI, Anthropic)_
**Q:** Phind's citation rate is lower than Perplexity's on code queries โ instead of citing every sentence, it cites at the function level. A product manager wants Phind-style citations. How do you defend or reject this?
Answer
Defend if the content is primarily code, reject if it is primarily prose. The reason: sentence-level citations for code are semantically wrong โ a single function spans many sentences and the citation unit is the function, not the sentence. Phind's function-level citations match developer mental models (I want to see which package/file this pattern came from, not which line). Conversely, for prose claims about APIs or behavior, sentence-level is more precise and catches grounding failures at finer granularity. The architectural choice: add a query-type classifier that routes code-heavy queries to function-level citation mode and prose queries to sentence-level. Eval separately โ citation precision on code queries and citation precision on prose queries should have separate thresholds.
### โ
โ
โ
_(Google, Meta)_
**Q:** A senior interviewer at Google asks: 'Vertex AI Matching Engine vs. HNSW-backed self-hosted ANN โ which would you choose for a 50B-passage production RAG system, and why?'
Answer
Vertex AI Matching Engine for a team without dedicated ANN infrastructure expertise; self-hosted HNSW (via Weaviate, Vespa, or Milvus) for a team with retrieval engineers and a need for custom scoring. Trade-offs: Vertex offers managed scaling, SLA-backed availability, and native Google Cloud IAM integration โ reducing operational burden but limiting control over the ANN graph construction, quantization settings, and filtering logic. Self-hosted HNSW gives you control over ef_construction, M (max connections per node), and hybrid sparse-dense scoring โ critical for retrieval systems that need query-time filtering (e.g., filter by domain, date range, or language) without post-filter recall collapse. At 50B passages, index sharding becomes the primary design problem regardless of backend โ plan for 20โ50 shards with a scatter-gather query fan-out. The deciding factor is query-time filter complexity: if you need more than 2โ3 filter dimensions at ANN time, self-hosted Vespa or Weaviate with native filter support outperforms Vertex's post-filter approach by 30โ60% recall at the same latency budget (per Weaviate benchmark, 2023).
## Further Reading
- [Perplexity Engineering Blog โ How Perplexity Builds Its Products](https://www.perplexity.ai/hub/blog)
Primary source for Perplexity's retrieval architecture, freshness design, and citation strategy. The most candid engineering disclosure from any answer engine.
- [Google NotebookLM โ Product Changelog & Architecture Notes](https://notebooklm.google.com/)
Product-level documentation for NotebookLM's Gemini 1.5 Pro long-context approach. Pair with Google I/O 2024 talks on Vertex AI Matching Engine.
- [Phind Engineering Blog โ How We Built a Code Search Engine](https://www.phind.com/blog)
Phind's description of their code-specialized retrieval pipeline, domain-weighted re-ranking, and function-level citation design.
- [RAGAS: Automated Evaluation of Retrieval Augmented Generation (Es et al., 2023)](https://arxiv.org/abs/2309.15217)
The evaluation framework for RAG systems โ faithfulness, answer relevance, context precision, context recall. The eval metrics used in the cross-system comparison in this module are grounded in RAGAS.
- [Lilian Weng โ Retrieval-Augmented Generation for LLMs](https://lilianweng.github.io/posts/2023-10-02-rag/)
The canonical survey of RAG architectures โ covers bi-encoders, cross-encoders, fusion-in-decoder, and long-context approaches. Essential background for defending any retrieval design choice.
- [Dense Passage Retrieval for Open-Domain QA (Karpukhin et al., 2020)](https://arxiv.org/abs/2004.04906)
The DPR paper that defined the dual-encoder retrieval baseline. Understanding why DPR works is prerequisite to understanding why every system here extends or departs from it.
- [Shreya Shankar โ Who Validates the Validators? Towards LLM-Assisted Evaluation](https://arxiv.org/abs/2405.03600)
The foundational paper for cross-system eval design โ explains why LLM-judge calibration is not optional and how to measure judge-to-human agreement across RAG eval axes.
## Related
The Design Doc ยท Cost Accounting & Eval-Driven Design ยท Case: Design ChatGPT ยท Case: Design Perplexity ยท Case: Design Claude Code / Cursor
---
---
title: "Compare: SLO โ Cost"
part: "Design Reviews"
number: 81
emoji: "โ๏ธ"
subtitle: "Interactive sensitivity โ slide p99, watch GPU count, $/req, and cache hit-rate move together"
tags: ["designreviews", "ml", "ai-engineering", "interview-prep", "transformer"]
---
# โ๏ธ Compare: SLO โ Cost
> Interactive sensitivity โ slide p99, watch GPU count, $/req, and cache hit-rate move together
> [!question] Key Question
> Cut p99 latency in half โ how much more expensive does it get?
โ Compare: RAG Systems | โ Compare: Failure-Mode Taxonomy
## Key Insights
> [!tip] Insight
> Measure the right thing. Gil Tene's “How NOT to Measure Latency” (QCon 2015) makes the point explicitly: coordinated omission in latency benchmarks causes p99 to look like p50. If your load generator doesn't account for back-pressure, every latency histogram you publish is a lie. The fix is HDR histograms with coordinated-omission correction โ the standard in production SLO tooling since circa 2016.
> [!tip] Insight
> The cross-system comparison. The three sandboxes reveal the cost-SLO slope difference: consumer chat has a moderate slope (high baseline cache saves most from cache hits), search/RAG has a steeper slope (low baseline cache, bigger cache investment payoff), and image gen has a near-vertical p99 slope (no cache benefit, pure latency-cost tradeoff). Designing across all three in a single interview shows range โ most candidates only know the chat model.
> [!tip] Insight
> Interviewer trap. “Our p95 is 500 ms, so p99 should be around 600 ms.” This is only true for near-Gaussian distributions. LLM serving latency is heavy-tailed due to variable output length and prefill interference. In practice, p99 is often 3–8ร p95 for serving workloads with long-context requests in the batch. Always ask for the histogram, not the point estimate.
> [!tip] Insight
> The cache-hit cost curve bends at 60%. Below 60% cache hit, each 10 pp increase saves roughly linearly in GPU costs. Above 60%, the marginal gain starts to diminish because you're already deflecting most of the cheaply cacheable traffic โ the remaining misses are long-tail queries with inherently low reuse. The investment threshold for semantic caching infrastructure is when your query distribution has identifiable clusters (FAQ, support topics, similar intents). If your query distribution is uniform (open-ended chat, creative writing), semantic cache ROI is poor.
## Code Examples
```python
import time
def compute_burn_rate(
error_count: int, # errors in window
total_requests: int, # requests in window
slo_target: float, # e.g. 0.999 for 99.9%
window_seconds: int, # observation window (e.g. 3600 = 1h)
budget_seconds: int = 2_592_000, # 30-day month
) -> float:
"""
Burn rate > 1.0 means budget is draining faster than it refills.
Burn rate > 14.4 means the full monthly budget is exhausted in 2 days.
Matches the multi-window alerting scheme from the Google SRE Workbook.
"""
error_rate = error_count / max(total_requests, 1)
allowed_error_rate = 1 - slo_target # 0.001 for 99.9%
burn_rate = error_rate / allowed_error_rate
return burn_rate
# Example: 50 errors in 10k requests over 1h, 99.9% SLO
rate = compute_burn_rate(50, 10_000, slo_target=0.999, window_seconds=3600)
print(f"Burn rate: {rate:.2f}x") # 5.00x โ page immediately
```
```python
import math
def gpu_cost_after_slo_tightening(
baseline_gpus: int,
baseline_p99_ms: float,
target_p99_ms: float,
gpu_hourly_usd: float,
hours_per_month: float = 730.0,
) -> dict:
"""
Estimate GPU fleet delta when tightening p99 latency SLO.
Uses the sqrt(latency) empirical exponent from the transition regime
between weight-bound and KV-cache-bound decode.
Cite: Pope et al. 2022 (PaLM inference) + SloCostSandbox empirical fit.
"""
latency_ratio = baseline_p99_ms / target_p99_ms
gpu_scale_factor = math.sqrt(latency_ratio)
new_gpus = math.ceil(baseline_gpus * gpu_scale_factor)
baseline_monthly = baseline_gpus * gpu_hourly_usd * hours_per_month
new_monthly = new_gpus * gpu_hourly_usd * hours_per_month
return {
"baseline_gpus": baseline_gpus,
"new_gpus": new_gpus,
"gpu_scale_factor": round(gpu_scale_factor, 3),
"baseline_monthly_usd": round(baseline_monthly, 0),
"new_monthly_usd": round(new_monthly, 0),
"delta_monthly_usd": round(new_monthly - baseline_monthly, 0),
}
# Consumer chat: 5,000 H100s at $3.50/hr, p99: 3000ms -> 1500ms
result = gpu_cost_after_slo_tightening(5000, 3000, 1500, 3.50)
print(result)
# {'baseline_gpus': 5000, 'new_gpus': 7072, 'gpu_scale_factor': 1.414,
# 'baseline_monthly_usd': 12775000.0, 'new_monthly_usd': 18073040.0,
# 'delta_monthly_usd': 5298040.0}
# Cutting p99 in half costs +$5.3M/month on a $12.8M baseline โ +41%.
```
```python
def mm1_wait_factor(utilization: float) -> float:
"""
Mean queue wait time as a multiple of mean service time.
M/M/1 queue formula: rho / (1 - rho).
Diverges as utilization -> 1.0.
"""
assert 0 < utilization < 1.0, "Utilization must be in (0, 1)"
return utilization / (1 - utilization)
for rho in [0.5, 0.6, 0.7, 0.8, 0.9, 0.95]:
print(f" ฯ={rho:.2f} wait_factor={mm1_wait_factor(rho):.2f}x")
```
## Interview Questions
### โ
โ
โ
_(Anthropic, Google)_
**Q:** An interviewer asks:
Answer
The โ(latency) batch-size rule: halving p99 latency forces batch size to shrink by roughly โ2 โ 1.41ร, so throughput drops by the same factor. To sustain the same QPS, you need โ2 more GPUs โ approximately 41% capacity increase. For a fleet of 5,000 H100s at $3.50/hr: baseline monthly burn = 5,000 ร $3.50 ร 730 = $12.775M. After SLO tightening: 5,000 ร 1.41 ร $3.50 ร 730 โ $18.01M โ a $5.24M/month increment, or ~41%. The non-obvious piece: the โ exponent comes from the relationship between GPU decode throughput and batch-level memory bandwidth saturation; it is not a linear relationship. Cite the memory-bandwidth-bound decode argument from Pope et al. 2022 (PaLM inference paper) for credibility.
### โ
โ
โ
_(OpenAI, Meta)_
**Q:** Your cache hit rate drops from 55% to 20% overnight. How does that change your GPU fleet sizing, and what caused it?
Answer
Effective QPS hitting the GPU path = total QPS ร (1 โ cache hit %). At 55% hit: effective QPS = 0.45 ร total. At 20%: effective QPS = 0.80 ร total. Ratio = 0.80 / 0.45 โ 1.78ร, so you need ~78% more GPUs to sustain the same p99 SLO. Root causes: (1) system prompt format changed, busting prefix cache keys; (2) a new feature added personalization tokens at the start of the prompt (prefix keys now per-user, not per-product); (3) a rollout changed the prompt template hash; (4) semantic cache TTL expired or was flushed. The correct first diagnostic step is plot cache-key distribution โ if cache hit is spreading across 10ร more unique keys, it is a prefix-key churn event, not a traffic spike.
### โ
โ
โ
_(Google, Anthropic)_
**Q:** At 80% GPU utilization, p99 latency is 2.2ร p50. At 50% utilization, it is 1.3ร. Why, and what is the threshold you should design around?
Answer
Queuing theory: at utilization ฯ, mean wait time in an M/M/1 queue scales as ฯ / (1 โ ฯ). At ฯ = 0.8: factor = 0.8 / 0.2 = 4. At ฯ = 0.5: factor = 0.5 / 0.5 = 1. The tail (p99) is dominated by queuing wait, not service time. The empirical design threshold is ฯ โค 0.7 for serving workloads where p99 โค 2ร p50 is the SLO; above 70%, p99 climbs super-linearly and any burst crosses SLO. Google SRE book codifies this as “error budget consumption accelerates non-linearly above 70% utilization” โ it is not an opinion, it is the M/M/1 formula.
### โ
โ
โ _(OpenAI, Google)_
**Q:** You have a system with p99 = 120 s and high variance (image generation). The product team wants a p99 SLA commitment. How do you price and architect it?
Answer
High-variance workloads like image/video gen are fundamentally different from chat: the distribution is multi-modal (fast 30s generations vs. slow 180s for complex scenes). Steps: (1) Instrument the full empirical distribution, not just mean. (2) Offer the SLA on a percentile the system can actually hold โ p95 at 150 s is defensible; p99 at 120 s probably requires 2ร GPU buffer. (3) Price the SLA tier to cover the buffer: if p99 requires 40% more fleet headroom, the guaranteed tier price must cover the cost difference. (4) For Sora-class workloads, the cost-optimal architecture separates fast and slow jobs (latency disaggregation): fast jobs run on a smaller dedicated pool, slow jobs fill capacity gaps. Without job-class routing, slow jobs block the fast pool and SLO breaches are correlated.
## Further Reading
- [Gil Tene โ How NOT to Measure Latency (QCon 2015)](https://www.youtube.com/watch?v=lJ8ydIuPFeU)
The canonical talk on why averages and even p95 lie, and why p99/p99.9 are the only metrics that capture the user's experience. The HDR histogram argument is mandatory background for SLO design.
- [Amazon DynamoDB โ Dynamo: Amazon's Highly Available Key-value Store (DeCandia et al., SOSP 2007)](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf)
The paper that defined SLO-driven design at scale. Section 4 on the latency-at-p99.9 requirement and its architectural implications is the playbook this module derives from.
- [Google SRE Book โ Chapter 20: Load Balancing at the Frontend](https://sre.google/sre-book/load-balancing-frontend/)
The M/M/1 queueing argument and the 70% utilization cap are made explicit here. The error-budget math in the SLO chapter pairs with this module's queueing deep dive.
- [Lilian Weng โ Large Transformer Model Inference Optimization](https://lilianweng.github.io/posts/2023-01-10-inference-optimization/)
The best single reference for how batch size, memory bandwidth, and latency interact at the hardware level โ the physical grounding for the โ(latency) derivation.
- [Pope et al. โ Efficiently Scaling Transformer Inference (Google, 2022)](https://arxiv.org/abs/2211.05100)
First-principles analysis of memory bandwidth vs. compute bottlenecks in large model serving. The paper that grounds the batch-size/latency tradeoff in hardware arithmetic.
## Related
The Design Doc ยท Cost Accounting & Eval-Driven Design ยท Case: Design ChatGPT ยท Case: Design Perplexity ยท Case: Design Claude Code / Cursor
---
---
title: "Compare: Failure-Mode Taxonomy"
part: "Design Reviews"
number: 82
emoji: "๐งฏ"
subtitle: "One master table of every failure mode across 14 real systems โ with detectโescalateโrollback playbooks"
tags: ["designreviews", "ml", "ai-engineering", "interview-prep", "transformer"]
---
# ๐งฏ Compare: Failure-Mode Taxonomy
> One master table of every failure mode across 14 real systems โ with detectโescalateโrollback playbooks
> [!question] Key Question
> The 3am page happens โ you have 30 seconds to pick the right lever
โ Compare: SLO โ Cost
## Key Insights
> [!tip] Insight
> The scoring rubric: interviewers are listening for (1) blast radius quantified, (2) how fast you detected it, (3) whether your rollback was principled or lucky, and (4) whether the post-incident action prevents recurrence. A taxonomy gives you a mental checklist to tick off in real time.
> [!tip] Insight
> The dark pattern: post-mortems that produce action items with no owner and no deadline. Every action item must have a DRI and a due date. The taxonomy table is only useful if the “detection” column is wired to a real alert.
> [!tip] Insight
> Interview move: when asked “how do you handle incidents?”, lead with “we separate detection, escalation, and rollback SLOs.” Then quantify each. This signals L6 thinking immediately โ most candidates describe a single MTTR without breaking it down.
> [!tip] Insight
> The key insight for interviewers: prompt injection is not a content-moderation problem โ it is a trust-boundary problem. The fix is architectural (separate trust levels) not just a better content filter. Candidates who say “add a content filter” as the only mitigation are missing the structural issue.
> [!tip] Insight
> The L6 answer on model-swap risk: “We never skip the canary phase, even under competitive pressure. The canary phase is cheap โ it costs 1% of traffic and 24 h. An incident from a rushed model swap costs weeks of MTTR and potentially months of user trust recovery.”
> [!tip] Insight
> The universal opener: regardless of company, start with the blast radius in one sentence, then the detection speed, then the resolution. This hits the primary scoring axis for every company (scale at Google, speed at OpenAI, revenue at Meta) and buys you time to tailor the rest.
## Interview Questions
### โ
โ
โ
_(Anthropic, OpenAI)_
**Q:** Walk me through an incident where a model swap caused a quality regression that wasn
Answer
The shadow-eval gap is the root cause. The fix: (1) run a canary eval on the new model against a golden set BEFORE traffic migration; (2) gate on per-cohort pass rates, not just aggregate โ a new model can improve average quality while degrading safety-sensitive or edge-case cohorts; (3) route 1% of live traffic through the new model for 24 h before full rollout, with a kill switch on thumbs-down rate > baseline + 3 pp. The non-obvious lesson: most model-swap regressions appear in latency tail (p99 TTFT), not average quality, because the new model has a different speculative-decode profile. Instrument p95/p99 TTFT separately from average quality in your shadow period.
### โ
โ
โ
_(Anthropic, Google)_
**Q:** You
Answer
Immediately: (1) check if a classifier config was pushed in the last 30 min โ a threshold change or model swap is the most likely cause; (2) check if the spike is correlated with a specific topic cluster (news event, trending query) vs. uniform across all categories โ uniform = classifier issue, topic-specific = distribution shift; (3) measure revenue impact: at 18% refusal, every 10 min = ~X% of daily active users hitting a wall, price it immediately for incident severity. Rollback path: classifier config has a feature flag โ revert to previous config in < 5 min. Post-incident: add a canary that runs the classifier on a fixed 200-sample distribution probe every 5 min and pages on deviation > 2 pp.
### โ
โ
โ _(Google, Meta, Anthropic)_
**Q:** Describe the difference between how Google and Anthropic interviewers ask about production incidents in the behavioral round. How does your answer change?
Answer
Google (L6 SWE/MLE): wants the STAR format with emphasis on SCOPE (how many users affected), SPEED (how fast did you detect and mitigate), and SYSTEMIC FIX (what monitoring did you add). They prize quantitative blast radius. Anthropic: wants you to surface the reasoning behind your safety trade-offs โ specifically, what did you do when the right call was ambiguous? They care about the principle you applied, not just the outcome. Meta: wants the business impact number immediately, then the technical root cause โ revenue first, architecture second. Answer template: Lead with blast radius (N users, $X revenue at risk), then detection speed, then root cause, then the durable fix that made the post-mortem unnecessary to repeat.
### โ
โ
โ
_(Anthropic, OpenAI)_
**Q:** A prompt-injection attack is discovered in your RAG pipeline: a retrieved document contains instructions that override the system prompt. What are your defense layers?
Answer
Defense in depth with four layers: (1) Input sanitization โ strip known injection patterns before retrieval (<system>, IGNORE PREVIOUS, etc.); (2) Retrieval-path trust โ treat retrieved documents as untrusted user input, never as system-level context; use a separate system prompt section that is not part of the retrieved context window; (3) Output monitoring โ safety classifier on the output looks for instruction leakage signals (e.g., the model repeating back injected instructions verbatim); (4) Rate limiting on semantic similarity to known injections โ embed the query against a library of known injection patterns and block above a cosine similarity threshold. Real-world example: Simon Willison documented the Bing Chat indirect injection in Feb 2023 where a malicious web page caused Bing to reveal its system prompt via a retrieved context window.
## Related
The Design Doc ยท Cost Accounting & Eval-Driven Design ยท Case: Design ChatGPT ยท Case: Design Perplexity ยท Case: Design Claude Code / Cursor
---