Skip to content

Transformer Math

Module 61 · AI Engineering

🔮 Speculative Execution

While you're still typing, a speculative agent already searched the codebase for you

Status:

While you're typing your next message, the agent is already working. A speculative agentruns in the background, predicting what you'll ask next and pre-computing the answer. If the prediction is right, the result appears instantly. If wrong, the work is silently discarded — no harm done.

  • Writes go to an overlay filesystem — never touches real files until accepted
  • Only safe tools allowed: Read, Glob, Grep, TaskGet, TaskList (no writes, no Bash)
  • State machine: idlerunning accepted/rejected
🎮

Speculative Execution Flow

What you are seeing

The lifecycle of a speculative execution: the agent predicts the user's next action, runs it with an overlay filesystem and restricted tools, then either merges the result or discards it based on the user's actual message.

What to try

Follow the two paths: what happens when speculation aligns with user intent (accept + merge) vs when it diverges (reject + discard). Notice how the overlay FS makes both paths safe.

# Speculative execution lifecycle

User finishes turn → agent idle

Suppression check: last turn expensive? → YES → speculate

1. Predict next action from conversation context

2. Create overlay FS (copy-on-write layer)

3. Filter tools → [Read, Glob, Grep, TaskGet, TaskList]

4. Run speculative agent with overlay + safe tools

# User sends next message

Aligns with speculation? → YES → ACCEPT

→ merge overlay to real FS

→ skip redundant work, show cached result

Diverges from speculation? → NO → REJECT

→ discard overlay (no harm done)

→ run user's actual request normally

💡

The Intuition

How It Works in Practice

While you are thinking about what to type next, the agent is already working. It predicts your next request and runs a speculative search using an overlay filesystem — writes go to a temp layer, not real files.

  • Prediction matcheswhat you actually ask → instant results, overlay merges into real FS
  • Prediction diverges→ discard the overlay, no harm done, run normally

Like CPU branch prediction, but for coding tasks. The overlay filesystem makes the bet fully reversible.

The CPU Analogy

This is the same pattern CPUs use for branch prediction: predict which branch the code will take, execute speculatively, then commit the result if the prediction was right or flush the pipeline if wrong. The agent predicts the user's "branch" (their next request), executes speculatively with the overlay FS as its pipeline, and either commits (merge) or flushes (discard).

💡 Tip · The overlay filesystem is the key safety mechanism. Like Docker's layered filesystem, reads fall through to the real FS while writes go to a temporary layer. This is an application-managed copy-on-write abstraction: merge copies changed files back to the real FS, discard deletes the temp layer. Both operations are lightweight (proportional to files changed, not total codebase size).

Tool Filtering

The speculative agent only gets 5 tools: Read, Glob, Grep, TaskGet, and TaskList. All read-only, all side-effect-free. Even if the speculative agent hallucinates a dangerous action, it literally cannot execute it — the tool isn't available. This is defense in depth: the overlay FS protects against bad writes, and tool filtering prevents writes from being attempted at all.

Suppression Heuristics

Speculation isn't free — it costs an API call. The executor suppresses speculation when: the last turn was cheap (simple question, no tools), when the last tool usage was read-only (nothing to follow up on), or when the cost would exceed a threshold. In practice, speculation is suppressed ~60% of turns.

Prompt Suggestions

Beyond full speculative execution, the system can generate prompt suggestions — predictions of what the user will ask next, shown as clickable options. These are cheaper than full speculation (just a prediction, no execution) and help the user articulate their intent faster.

✨ Insight · The accept/reject decision compares the user's actual message against the speculation's predicted intent — not exact string matching, but semantic alignment. "Fix the bug" and "Can you fix that error?" both align with a speculation that investigated the error.

How Semantic Alignment Works

The accept/reject decision cannot use exact string matching — users rephrase the same intent in many ways. Instead, the speculative executor records a predicted intent label(e.g., "investigate the TypeError in auth/login.ts") alongside the result. When the user's message arrives, the system asks the model to classify whether the message aligns with that label — this is a cheap, single-turn call with no tool access. The classification prompt is short and templated: "Does ' {userMessage}' ask for '{predictedIntent}'? Answer YES or NO." Because the model is asked a binary question with forced-choice output, the latency is minimal (~100ms). A YES triggers the overlay merge; a NO triggers discard and normal execution. This is the same self-consistency pattern used in chain-of-thought prompting to validate reasoning steps — a second, cheaper model call to verify the first model's prediction.

Partial Speculation — Pre-Reading Without Pre-Writing

Even when full speculation is suppressed (the previous turn was cheap, or the task is unpredictable), the system can do partial speculation: pre-read files the user is likely to ask about next. If the user just edited src/auth/login.ts, the system pre-reads src/auth/middleware.ts and the test file. These reads are cheap (no API call, just disk I/O) and populate the tool result cache. When the user does ask about those files, the agent can skip the Read tool call entirely — the content is already in memory. This is a lower-risk form of speculation: no writes, no overlay, no alignment check needed — just prefetching.

Quick Check

Why does the speculative agent use an overlay filesystem?

📐

Key Code Patterns

Speculative Executor (TypeScript pseudocode)

typescript
const SpeculationState = {
  IDLE: "idle",
  RUNNING: "running",
  ACCEPTED: "accepted",
  REJECTED: "rejected",
} as const;

type SpeculationStateValue = typeof SpeculationState[keyof typeof SpeculationState];

class SpeculativeExecutor {
  private state: SpeculationStateValue = SpeculationState.IDLE;
  private overlayFs: OverlayFileSystem = new OverlayFileSystem();
  private safeTools: string[] = ["Read", "Glob", "Grep", "TaskGet", "TaskList"];
  private result: unknown = null;

  // Run speculative work in background
  async speculate(conversation: Conversation): Promise<void> {
    if (this.shouldSuppress(conversation)) return;

    this.state = SpeculationState.RUNNING;

    // Predict next steps
    const prediction = await predictNextAction(conversation);

    // Run with overlay FS and restricted tools
    const engine = new QueryEngine({
      tools: filterTools(this.safeTools),
      filesystem: this.overlayFs, // writes go to overlay
    });
    this.result = await engine.submit(prediction);
  }

  // Check if speculation matches user intent
  onUserMessage(message: string): void {
    if (this.state !== SpeculationState.RUNNING) return;

    if (alignsWithSpeculation(message, this.result)) {
      this.state = SpeculationState.ACCEPTED;
      this.overlayFs.mergeToReal(); // apply cached work
    } else {
      this.state = SpeculationState.REJECTED;
      this.overlayFs.discard(); // throw away, no harm
    }
  }

  // Don't speculate if it's not worth it
  private shouldSuppress(conversation: Conversation): boolean {
    if (conversation.lastTurnCost < threshold) return true; // cheap turn
    if (conversation.lastToolWasReadOnly) return true; // nothing to speculate
    return false;
  }
}

Overlay Filesystem (Copy-on-Write)

typescript
// Copy-on-write filesystem — reads from real, writes to temp
class OverlayFileSystem {
  private overlay: Map<string, string> = new Map(); // path -> content

  read(path: string): string {
    if (this.overlay.has(path)) {
      return this.overlay.get(path)!;
    }
    return realFs.read(path);
  }

  write(path: string, content: string): void {
    this.overlay.set(path, content); // never touches real FS
  }

  mergeToReal(): void {
    for (const [path, content] of this.overlay) {
      realFs.write(path, content);
    }
  }

  discard(): void {
    this.overlay.clear();
  }
}
🔧

Break It — See What Happens

No overlay (speculate directly on real FS)
No tool filtering (full tool access for speculation)
📊

Real-World Numbers

MetricValue
Safe tools allowed5 (Read, Glob, Grep, TaskGet, TaskList)
Suppression rate~60% of turns (cost/relevance thresholds)
Overlay FS read latency<1ms overhead per read
Merge/discard costInstant (file copy or dir delete)
Acceptance rateVaries by task type
✨ Insight · Speculation is suppressed ~60% of the time because most turns are either cheap (not worth speculating) or read-only (nothing actionable to predict). The 40% of turns where speculation runs tend to be high-value: after file edits, after bug fixes, after complex tool chains — exactly the moments when pre-computation saves the most time.
🧠

Key Takeaways

What to remember for interviews

  1. 1Speculative execution predicts the user's next request and pre-computes the answer while they type — if correct, results appear instantly; if wrong, work is silently discarded.
  2. 2An overlay filesystem (copy-on-write) makes speculation safe: reads fall through to the real FS, writes go to a temp layer that is either merged (accept) or deleted (reject).
  3. 3Tool filtering is defense in depth: the speculative agent only gets 5 read-only tools (Read, Glob, Grep, TaskGet, TaskList) — it literally cannot execute writes even if it tries.
  4. 4Speculation is suppressed ~60% of turns via heuristics: skip if the last turn was cheap, read-only, or prediction confidence is low.
  5. 5Accept/reject uses semantic alignment, not string matching — a cheap binary classification call checks whether the user's message aligns with the predicted intent label.
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 4 of 4

Design a speculative execution system for an AI agent. How do you ensure safety?

★★★
AnthropicGoogle

What's the difference between an overlay filesystem and a git worktree for isolation?

★★☆
Meta

How would you decide when speculation is worth the compute cost?

★★☆
OpenAIDatabricks

What verification strategy prevents speculative execution from committing side effects that the user hasn't approved?

★★★
Anthropic