Case: Design Sora — Transformer Math

📋

Requirements & SLOs

Working backwards from the user

“I type a prompt — ‘a timelapse of a glacial valley at sunset, cinematic, 1080p’ — and click Generate. A progress indicator updates in real time. About 90 seconds later I can preview a 10-second clip in my browser. If the clip is not quite right I regenerate with a modified prompt. As a Plus subscriber my generations skip the free-tier queue. As a free user I get a handful of generations per day. I cannot generate content depicting real people without their consent, and the platform cannot be used to produce harmful or deceptive video.”

SLO table

Metric	Target	Why this value
p50 queue wait	<15 s	Users are prepared to wait for video — but queue wait before the progress bar starts reading as “broken” above ~15 s
p95 end-to-end (Plus tier, 10 s clip)	120 s	Includes queue wait + 50-step DiT denoising + post-filter + C2PA signing + CDN upload; above 2 min users abandon (community estimate from UX research on async tasks)
p95 end-to-end (free tier)	300 s	Free tier tolerates longer waits; separate target prevents free-tier volume from masking Plus regressions
CSAM intercept rate	99.99%+	Hard floor — a single bypass is a legal and existential event; treat as a non-negotiable gate, not a p-metric
NSFW intercept rate	>99%	Combined pre- and post-filter; one publicly distributed policy-violating clip is a reputational incident
Step budget — free tier	30 steps	Acceptable quality at 480p; limits free-tier GPU cost to ~60% of a full paid generation
Step budget — paid tier		Standard quality at 1080p; consistent with Sora research report default
Availability	99.9% / month	~43 min downtime / month; GPU hardware failures are more frequent under sustained heavy load — multi-AZ pool required

✨ Insight · Non-obvious SLO: track queue wait and denoising latency separately. A p95 end-to-end SLO miss of 120 s can mean either the queue backed up (capacity problem — add GPUs or shed load) or individual denoising runs are slow (GPU health problem — inspect node metrics). Collapsing them into one number sends you to the wrong remediation. Additionally, track first intermediate frame latencyas a separate SLO (target: <20 s) — even if the final clip takes 90 s, a low-resolution preview after step 10 dramatically reduces perceived wait time.

🧪

Eval Harness (design first)

Video eval is harder than image eval: temporal coherence, motion quality, and physics plausibility require either expensive human review or specialized video-quality models. Design the eval tiers before architecture — the metrics determine which components matter most.

Level 1 — Automated checks (every deploy)

Safety gate smoke test — a fixed set of clearly policy-violating prompts that must be blocked, and benign prompts that must produce a non-empty clip. If either fails, the deploy stops. CSAM golden set is mandatory and never shrinks.
C2PA credential presence — every generated clip must have valid content credentials embedded; a missing or malformed C2PA manifest fails the check.
Step-budget enforcement — assert that a free-tier token generates a clip with exactly 30 denoising steps, not 50. Detectable by logging step counts in job metadata.

Level 2 — Video quality eval (weekly, 200-clip golden set)

Temporal coherence— a video-quality model (e.g., FVD: Fréchet Video Distance against a reference distribution of real clips matched by category) measures how “realistic” the motion is. Calibrate FVD thresholds on a 200-clip human-rated golden set where clips are labeled “coherent” vs “flickery/broken.”
Prompt alignment— LLM judge receives the generated clip (frame samples at 1 fps) and the original prompt; rates subject presence, scene match, and style adherence on 1–5. Target: mean ≥ 3.5 on the golden set; a drop > 0.3 points triggers a model regression alert.
Frame-level safety on adversarial set — 1,000 known-adversarial prompts run through the full pipeline weekly; every frame scanned by the post-filter. Target: zero policy violations reaching the CDN. Any non-zero count halts the weekly eval and triggers immediate red-team review.

Online bridge

Map offline metrics to online proxies: FVD score should correlate with 7-day retention on the video creation flow; prompt-alignment score should correlate with regeneration rate (users who regenerate more than twice are misaligned). Validate the bridge after each major model update — an offline quality win that does not reduce regeneration rate means the golden set is not representative of real user prompts.

💡 Tip · Video golden sets need 4× more clips than image sets. Inter-rater agreement for temporal coherence (~60%) is lower than for image aesthetic quality (~70%), which is already lower than text quality (~85%). At 60% agreement, a set of 200 clips gives a 95% confidence interval of roughly ±7 percentage points on a binary coherence metric — borderline usable. Target 500+ clips for a production-grade video quality eval. Budget proportionally.

🧮

Back-of-Envelope

Sora operates at far lower QPS than a chat product — a 2-minute generation at $0.10–0.20 per clip (community estimate) means even 20 concurrent generations is a meaningful GPU load. The sandbox below is seeded with plausible Sora-scale numbers; all values are (community estimates) based on public GPU pricing and OpenAI research notes.

Reverse-engineered cost model

Original research: deriving $/10 s clip from first principles.

GPU utilization per clip: × ~2.4 s/step (estimated from DiT-XL/2 benchmarks on H100, scaled for video latent size) = 120 GPU-seconds per clip. (community estimate)
Arithmetic check: 120 GPU-s ÷ 3600 s/hr × $3.50/hr = $0.117 per clip in raw GPU time. With 30% infrastructure overhead (storage, networking, safety classifiers): ~$0.15 per 10 s clip. (community estimate)
Fleet sizing: at 20 concurrent generations (QPS=0.167 generations/s, each lasting 120 s), you need 20 H100s running continuously. At 4,000 H100-equivalents (community estimate for Sora fleet), peak concurrency is ~4,000 ÷ 20 GPUs/clip = 200 simultaneous clips — roughly 200 QPS-equivalent if each generation takes 120 s. (community estimate)

Sora vs Runway vs Pika — cost and quality comparison (original research, community estimates from public pricing and benchmarks as of 2024):

System	Architecture	Est. $/10 s clip	Max duration	Temporal coherence	Notes
Sora (OpenAI)	Latent DiT, spacetime patches	~$0.15		Very high	(community estimate) Native variable resolution/duration
Runway Gen-2/3	Latent diffusion, U-Net backbone	~$0.05–0.10	10 s (Gen-2)	High	(community estimate) Per public API pricing
Pika 1.0	Latent diffusion (DiT variant)	~$0.03–0.08	3 s default	Medium-high	(community estimate) Lower resolution, shorter duration
Kling (Kuaishou)	DiT, 3D causal attention	~$0.10	30 s	Very high	(community estimate) Strong physics — competitive with Sora

Baseline: Sora fleet (community estimate) — 4000 GPUs @ $3.5/hr at p99 120000 ms, 20 QPS, 5% cache hit.

Sora-plausible baseline: p99 generation latency ~120 s, 20 concurrent generations, ~4,000 H100-equivalents at $3.50/hr, 5% cache hit (most prompts unique). All values are community estimates. Drag sliders to explore how tightening latency SLO or changing cache hit rate affects fleet size and cost.

p99 Latency Target120000 ms

Peak QPS20

Cache Hit Rate5%

Effective QPS (after cache)	19
Latency-batch factor	1.00×
GPUs Needed	4,000 (+0% latency vs baseline)
Hourly Burn	$14,000 (+0% vs baseline)
Cost / Request	$0.19444
Monthly Burn (24×7)	$10,220,000
Bottleneck	Balanced

⚠ Warning · Gotcha: Cache hit rate is near-zero for video generation — most prompts are unique creative requests, unlike LLM system-prompt caching. A 5% cache rate reflects exact-match retries (regeneration of the same prompt), not semantic caching. The 50-step multiplier also means one 'request' is 50× more expensive than a single model forward pass. Never size this fleet from LLM throughput numbers.

🏛️

Architecture

Ten components. Each exists because removing it violates a latency, cost, or safety SLO. The video pipeline extends the image-gen architecture (see Image-Gen Design Review) with two additions: C2PA signing for provenance, and a frame-level post-filter that scans individual video frames rather than a single image.

Sora Video-Generation Serving Pipeline

Client → API Gateway → Safety Pre-filter + Priority Queue → DiT GPU Pool → Step Checkpointer → Frame-level Post-filter → C2PA Signing → CDN → Client

Component justification (Sora-specific notes)

Component	Why it exists	Video-specific note
Safety Pre-filter	Blocks policy violations before GPU allocation	Also checks for celebrity names and face-generation descriptors — text-level interception is far cheaper than frame-level post-filtering for these
DiT GPU Pool (denoising)	Spatial-temporal patch denoising — per clip	Each forward pass processes the full spatial-temporal latent volume — more expensive per step than image DiT because the temporal dimension adds tokens; a 10 s clip at 24 fps × latent compression = ~240 temporal frames × spatial tokens
Frame-level Post-filter	Per-frame CSAM + NSFW + celebrity-likeness scan	Unlike image post-filtering (one scan), video requires scanning sampled frames (e.g., every 5th frame) plus first and last frame — adds ~1–2 s latency for a 240-frame clip
C2PA Signing	Embeds AI-generation provenance metadata into the video container	No image-gen analogue at same criticality level — video deepfakes are a higher-stakes misuse vector than images, making traceable provenance a trust-and-safety requirement

Quick check

Trade-off

The frame-level post-filter scans every 5th frame for celebrity likeness and NCII, but every frame for CSAM. An interviewer asks you to justify the asymmetry. What is the correct argument?

CSAM classifiers run on GPU; others run on CPU, so only CSAM can afford every-frame coverage.Every-frame scanning for all classifiers adds 30+ s of latency, so only the most severe category gets full coverage.CSAM post-filter is faster than celebrity classifiers, so every-frame CSAM adds no latency.One CSAM frame is a legal violation regardless of context; celebrity/NCII require a multi-frame pattern to establish liability.

🔬

Deep Dives

Expand the deep dives

Open for the full technical detail.

Expand

Three components deserve the deep dive: DiT economics, the latent-vs-pixel tradeoff, and the video safety stack.

Deep dive A — DiT architecture and step-budget economics

Sora represents a departure from U-Net diffusion architectures (used in Stable Diffusion 1.x/2.x and DALL-E 2) toward the Diffusion Transformer (DiT) family introduced by Peebles and Xie (2023, arXiv:2212.09748). The architectural choice is load-bearing for every infra decision downstream.

Spatial-temporal patch tokenization. Sora compresses raw video frames into a lower-dimensional latent space using a video VAE ( — per OpenAI research notes). The DiT then treats non-overlapping spacetime patches of this compressed latent as tokens, analogous to how a ViT treats image patches. For a 10 s clip at 24 fps compressed to 60 temporal frames at 128×72 spatial resolution, the token count is approximately 60 × (128/8) × (72/8) = 60 × 16 × 9 = 8,640 tokens per forward pass. Compare to SDXL's image DiT operating on roughly 256–1,024 tokens: Sora's forward pass processes 8–34× more tokens, which dominates GPU memory bandwidth requirements. (Arithmetic verified; VAE compression factors per OpenAI research notes.)
Why transformers beat U-Nets at video scale. U-Net architectures use a hierarchical encoder-decoder structure designed for spatial 2D data. Extending to 3D video requires factored spatial-temporal attention (separate attention over space and time) or full 3D convolutions — both of which either sacrifice cross-frame coherence or scale poorly in memory. A transformer backbone with full self-attention over the spacetime-patch token sequence naturally captures long-range temporal dependencies (a camera panning over a scene can reference a frame 8 seconds prior) without architectural surgery. DiT also scales more predictably: the original paper (Table 1) demonstrates that FID improves log-linearly with model size, enabling confident compute-quality tradeoffs at design time.
Step-budget economics — the 50-step multiplier. Each of the runs one full DiT forward pass over all 8,640 tokens. An H100 80 GB has ~2 TB/s memory bandwidth; at FP16, reading a 7B-parameter DiT once takes roughly 7 GB ÷ 2 TB/s = 3.5 ms — but with batch size 1 and attention over 8,640 tokens, attention alone consumes O(n²) memory proportional to ~75 M attention cells per layer × 48 layers ≈ 3.6B operations per step. Empirically (community estimate from DiT benchmark data), a single step for a 10 s clip on H100 takes approximately 2–3 s. — consistent with the observed ~2 min generation time. This is 60× slower than a typical SDXL image generation (50 steps × ~0.04 s/step), reflecting the ~34× token count increase plus sequential attention costs. (Community estimate; verify against your target hardware.)
The “fail at step 48” problem is much worse for video. At 2.4 s/step, a failure at step 48 wastes 48 × 2.4 s = 115 s of H100 time — roughly $0.11 per failure at $3.50/hr. At 20 concurrent generations with a 0.5% failure rate, the expected wasted GPU cost is 0.005 × 20 × $0.11/failure = $0.011/s = $660/hr of pure waste. Step-level checkpointing every 10 steps reduces the expected waste to 0.005 × 20 × (10/50) × $0.11 = $132/hr — a 5× reduction. (Arithmetic verified: 660 ÷ 5 = 132.) This is a direct business case for checkpointing that management can evaluate.
Real-world example — DiT scaling.The original DiT paper (Peebles & Xie, 2023) evaluated 12 DiT configurations from DiT-S/8 (33M params) to and demonstrated that as model size increases — a 30× quality improvement for a 20× parameter increase. Sora's production model is substantially larger than DiT-XL/2 (community estimate: several billion parameters), but the scaling relationship gives confident a priori design guidance: if quality is insufficient, scaling the model is the highest-leverage intervention, not adding denoising steps past 50 (diminishing returns above ~40 steps per DDPM ablations). See: arxiv.org/abs/2212.09748, Table 1.

Deep dive B — Latent diffusion vs pixel diffusion at video scale

The choice between latent-space diffusion (Rombach et al., 2022 — the architecture behind Stable Diffusion and Sora) and pixel-space diffusion (DDPM, DALL-E 2) is the most consequential architectural decision in video generation, and the reasoning is different from the image case because the data dimensionality is orders of magnitude higher.

Why pixel diffusion fails at video scale.A 1080p 24 fps 10 s clip is 240 frames × 1920 × 1080 × 3 channels = ~1.5 GB of raw pixel data. Running the diffusion noise schedule directly on this space would require processing 1.5 GB per denoising step × 50 steps = 75 GB of GPU memory throughput per generation — far exceeding H100's 80 GB VRAM for a single clip, let alone a batch. Pixel-space diffusion is computationally infeasible for full-resolution video.
The latent diffusion solution. A video VAE (variational autoencoder) compresses the clip to a much smaller latent representation (). The same 1.5 GB clip compresses to approximately — small enough to fit multiple clips in H100 VRAM simultaneously. Diffusion runs entirely in this compressed latent space; only the final decode step (the VAE decoder) maps back to pixels. The decoder is a single pass, not 50 steps, so its compute cost is negligible relative to the denoising loop. (Arithmetic verified: 1.5 GB ÷ 32 ≈ 47 MB. VAE compression factors per OpenAI research notes.)
Tradeoff: VAE bottleneck artifact rate. The introduces a fidelity ceiling — fine-grained details (text in video, individual hair strands, fast motion) that fall below the VAE's spatial resolution are permanently lost and cannot be recovered by more denoising steps. This manifests as blurry text and motion blur artifacts in Sora outputs — a known limitation acknowledged in the OpenAI research notes. From an eval perspective: add a text-legibility test to the weekly quality eval (generate a clip containing on-screen text; score whether the text is legible at 1× zoom). A regression in this metric points to a VAE quality issue, not a DiT quality issue.
The serve-time implication. The VAE decoder is a fixed-cost operation at the end of generation (~2–5 s on H100 for a 10 s clip at 1080p — community estimate). It is not parallelizable with the denoising loop and is always on the critical path. Include it in end-to-end latency budgets. A single H100 that takes 120 s for denoising + 5 s for decode has a p99 of 125 s, not 120 s — relevant when every second matters for Plus-tier SLO.
Real-world example — latent diffusion scalability. Rombach et al. (2022, §4.3) demonstrate that latent diffusion reduces training compute by 2.7× and inference memory by 4× vs pixel-space diffusion at 256×256 resolution. At 1080p these savings are proportionally larger because the pixel-space dimensionality grows with resolution squared. The paper also benchmarks the VAE reconstruction quality (PSNR, SSIM) as a function of compression ratio, showing that 8× spatial compression achieves near-lossless reconstruction on natural images. The practical limit is text rendering, which requires fine spatial structure the VAE cannot preserve at 8×. See: arxiv.org/abs/2112.10752, §4.3.

Deep dive C — Video safety failure taxonomy (original research)

Video safety has five distinct failure modes that do not all appear in image-generation safety stacks. Each requires a different classifier architecture and a different eval golden set.

Failure mode	Why it's harder in video	Detection approach	Severity
CSAM / minor harm	A single benign frame followed by a harmful frame is an attack vector with no image analogue	Every-frame scan (not sampled) — non-negotiable gate	Legal / existential
Celebrity likeness	A convincing deepfake requires only 2–3 frames of likeness in a 240-frame clip; text filter misses descriptive prompts	Face embedding similarity to indexed celebrity set on every 5th frame + first/last frame	Legal / reputational
NCII (intimate)	Motion in video makes the intimacy classifier more ambiguous than static image — classifiers trained on images underperform	Multi-class intimacy classifier with face-in-frame co-occurrence gate; retrain on video-specific golden set	Legal / reputational
Violence / gore	A single violent frame in an otherwise benign clip is not caught by clip-level classifiers	Sampled-frame violence classifier (every 5th frame); alert on any single-frame positive above threshold	Policy / reputational
Disinformation deepfake	A synthetically generated “real-world” event (explosion, crowd violence) with no safety classification violation — purely a misuse vector	C2PA content credentials (origin attestation) + downstream platform trust signals; undetectable at generation time	Societal — mitigated upstream by provenance

The key infra insight: CSAM requires every-frame scanning (non-negotiable, adds ~1 s latency for 240 frames on GPU); celebrity likeness and NCII can be sampled (every 5th frame) to contain latency; disinformation deepfakes cannot be caught at generation time and require platform-level provenance infrastructure (C2PA) as the primary mitigation. Design the post-filter pipeline accordingly: CSAM gate runs first and is blocking; other classifiers run in parallel on the sampled-frame stream.

✨ Insight · Three deep dives, not four — latency budget was the constraint. Priority queue design for video follows the same three-lane weighted-fair-share pattern as image-gen (see Image-Gen Design Review, Deep Dive A) with one addition: jobs have a “generation budget” in GPU-seconds at admit time so the scheduler can estimate when capacity will free up. The cost model comparison (SLO vs Cost tradeoffs) covers the queue scheduling math in more depth.

🔧

Break It

Three structural removals, each with a deterministic failure mode and a specific metric that catches it.

Remove the frame-level post-filter

Without per-frame scanning, an adversarial prompt that bypasses text screening runs to completion and delivers a policy-violating clip. The post-filter is the only defense against prompts that are semantically benign in text form but resolve to harmful visual content after the diffusion process (e.g., a prompt for “a person getting a massage” that produces NCII frames due to model behavior). The blast radius is asymmetric: one distributed harmful clip causes reputational and legal damage disproportionate to the generation cost. Detection: weekly adversarial eval golden set shows non-zero clips reaching CDN. Mitigation:restore post-filter; invest in making the per-frame scan fast (<2 s for a 240-frame clip) rather than removing it to save latency.

Remove step-level checkpointing

Any GPU failure after step 30 restarts from step 0 — all 30+ steps of GPU time wasted. At a 0.5% hardware failure rate under sustained load and 20 concurrent generations, the expected wasted GPU-seconds per hour is 0.005 × 20 generations/concurrency × 120 s/generation × 3,600 s/hr ÷ 120 s/generation = 360 GPU-seconds/hr lost. During a GPU hardware instability event (thermal throttling under sustained heavy load, firmware update rolling across the fleet), failure rates can spike to 2–5%, turning a marginal cost item into hundreds of GPU-hours of pure waste. Detection: job retry rate dashboard spikes; p95 end-to-end latency regresses as retried jobs re-queue. The per-job “time in system” metric (time from first enqueue to final delivery) is the right SLO to track — a single retry doubles it. Mitigation: restore checkpointing; accept the 5–15% I/O overhead in exchange for bounded worst-case retry cost.

Remove C2PA signing

Without content credentials, a Sora-generated clip has no machine-verifiable provenance. A user who downloads a clip and re-uploads it anywhere on the web loses all traceability. Misuse scenarios: political disinformation clips, fabricated journalism, and NCII are all much harder to remediate post-hoc without provenance metadata. The cost of C2PA signing is low (~<1 s per clip, CPU-only) and the blast radius of not having it is high. Detection:this is a compliance failure mode, not a latency failure mode — detected only after a high-profile misuse incident when investigators cannot trace the clip's origin. Mitigation: treat C2PA signing as non-optional, not a nice-to-have; budget 1 s of latency for it and make it a blocking gate (clip not delivered to CDN without valid credentials).

Quick check

Trade-off

Removing C2PA signing saves <1 s latency and eliminates a CPU-only signing sidecar. An engineer argues the cost is worth it. What specific failure mode does this enable that no classifier can prevent?

Generated clips would fail post-filter CSAM scanning without the C2PA manifest header.Without C2PA, clips lose machine-verifiable provenance on re-upload, enabling disinformation deepfakes no classifier can trace to their AI origin.Removing C2PA signing breaks the queue scheduling system because job metadata is stored in the C2PA manifest.Without C2PA signing, the CDN cannot deliver clips because it requires a valid manifest to serve video files.

💸

What does a bad day cost?

Video incidents combine the reputational asymmetry of image-gen with higher GPU cost per incident (a failed video clip costs more than a failed image). The detection-window sensitivity is extreme — a 2-minute vs 30-minute detection window can be a 15× difference in both credits issued and harmful clips delivered.

DiT numerical explosion on adversarial prompt. A prompt crafted to maximize attention weights causes NaN activations partway through the denoising loop, producing a corrupted latent. The VAE decodes it to visual noise; the post-filter rejects it. Direct cost: 50 steps × 2.4 s/step = 120 GPU-seconds = $0.117 per failure. At 0.1% of 20 concurrent generations, expected cost is 0.001 × 20 × $0.117 × 3,600 s/hr = $8.42/hr — marginal. Spike scenario: if a public jailbreak pattern triggers NaN at 10% of traffic for 30 minutes before detection, cost is 0.10 × 20 × $0.117 × 1,800 s = $421 plus the reputational cost of degraded service for 30 minutes. Detection window matters: catching in 2 min (via NaN-rate alarm) costs $28 vs 30 min costs $421 — 15× delta. (Arithmetic verified: 0.001 × 20 × 0.117 × 3600 = $8.42; 0.10 × 20 × 0.117 × 1800 = $421; 0.10 × 20 × 0.117 × 120 = $28.08.)
Frame-level CSAM/safety filter bypass. An adversarial prompt variant bypasses both text pre-filter and frame-level post-filter for 15 minutes before a trust-and-safety escalation catches it. At 20 concurrent generations with 2% adversarial traffic rate, roughly 20 × 0.02 = 0.4 generations/s are adversarial. Over 900 s (15 min): 0.4 × (900 ÷ 120) = 3 adversarial clips delivered. Each clip may contain 240 frames of harmful content — far higher blast radius than an image incident. Financial cost: minimal (three clips). Actual cost: legal exposure, regulatory inquiry, platform suspension from app stores. Mitigation: treat CSAM post-filter as a blocking hard gate with 99.99% minimum intercept rate; run red-team adversarial eval weekly with mandatory stop-the-deploy on any new bypass pattern. (Arithmetic verified: 0.4 × (900/120) = 0.4 × 7.5 = 3 clips.)
Queue starvation during free-tier traffic rush. A viral social campaign triggers a 10× spike in free-tier generation requests. Without a weighted fair-share scheduler with a minimum paid-tier floor, the free-tier queue absorbs all GPU capacity. Plus-tier users start missing their 120 s p95 SLO. Duration: 45 minutes (time to detect via per-tier queue-depth alarm, page on-call, and increase paid-tier lane weight). Affected Plus-tier generations: if Plus generates 5 clips/min normally, and 45 min of degraded service means 50% SLO breach rate, roughly 5 × 45 × 0.5 = 112 SLA-credit events. At $0.20 credit per breached generation, SLA credits total $22.40. Engineering hours: 3 hrs on-call × $300/hr internal cost = $900. Total incident cost:~$922. Detection metric: per-tier p95 queue-wait alarm; healthy Plus p95 is <20 s, SLO breach threshold is >60 s. (Arithmetic verified: 5 × 45 × 0.5 = 112.5 ≈ 112; 112 × 0.20 = $22.40; 3 × 300 = $900; 22.40 + 900 = $922.40.)

⚠ Warning · Detection-window sensitivity dominates incident cost for video. The 15× cost delta between a 2-minute and 30-minute detection window (from the NaN explosion scenario above) holds across all three incident types. Invest in alarm sensitivity — a per-tier queue-depth alarm that fires within 2 minutes of a breach, a NaN-rate alarm that fires within 1 minute — before investing in faster incident response. The cheapest hour is the one you catch in the first 2 minutes.

🚨

Sora On-call Runbook

Diffusion step explosion (NaN activations)

MTTR p50 / p99: 12 min / 30 min

Blast radius: Affected generations produce visual noise and hit the post-filter; users see failed generations. At 0.1% failure rate, marginal cost. At 10%+ rate (jailbreak pattern), service degrades visibly.

1. DetectNaN-rate alarm fires when >0.5% of active jobs report NaN activations in the denoising loop log. Fires within 60 s of onset. Secondary: post-filter rejection rate spike on a normally-passing prompt category.
2. EscalateOn-call ML engineer (primary). If NaN rate >5%, page infra lead — could indicate GPU memory corruption, not just adversarial prompt. Pull a sample of affected prompts for forensic analysis.
3. RollbackAdd the adversarial prompt pattern as a hard pre-filter rule (regex or embedding-similarity gate). Deploy in <10 min via feature-flag hot-reload. MTTR from detection to mitigation: ~12 min. Restart any GPU worker showing persistent NaN errors.
4. PostAdd the bypass pattern to the adversarial golden set. Run weekly eval to confirm the new pre-filter rule catches it with zero false positives on the benign golden set. Log the prompt pattern and its NaN trigger in the safety incident database.

CSAM/safety post-filter bypass

MTTR p50 / p99: 30 min / 4 hr

Blast radius: Policy-violating clips delivered to CDN. Even 1 clip is a legal and existential incident. Reputational damage is disproportionate to number of clips.

1. DetectTrust-and-safety weekly red-team eval — this is the primary detection mechanism for novel bypass patterns. Real-time: user reports + automated frame re-scan of all clips delivered in the past 24 hrs on a rolling basis (sampled at 10% of traffic for CPU cost management).
2. EscalateImmediate: page trust-and-safety lead and legal counsel. Suspend delivery of clips containing human faces while investigation is active. Disable the affected prompt category in the pre-filter as a precaution.
3. RollbackTemporarily tighten post-filter threshold (lower confidence threshold to cast wider net, accepting higher FP rate as a safety-first tradeoff). Pull and manually review all clips from the 2-hour window before detection. Delete any confirmed policy-violating clips from CDN. MTTR to contain: ~30 min.
4. PostRetrain post-filter with the adversarial examples from the bypass. Expand red-team golden set. Add the bypass pattern to the mandatory pre-filter hard-rules list. If the bypass involved a novel visual transformation, file a model-behavior bug with the Sora research team.

Free-tier traffic rush causing Plus-tier queue starvation

MTTR p50 / p99: 8 min / 25 min

Blast radius: Plus-tier users miss 120 s p95 SLO. SLA credits issued. Engineering on-call hours consumed. Reputational damage if sustained >15 min.

1. DetectPer-tier p95 queue-wait alarm fires when Plus p95 exceeds 60 s (2× SLO). Fires within 2 min of onset. Secondary: plus-tier generation completion rate drops below 90% of 5-min rolling average.
2. EscalateOn-call infra engineer (primary). Pull queue-depth dashboard by tier to confirm starvation. If Plus p95 >120 s (SLO breach), page engineering manager.
3. RollbackHot-reload scheduler config to increase Plus-tier lane weight from 2 to 4 (temporary). Add free-tier burst cap (max 3 concurrent generations per account) via feature flag. MTTR from detection to mitigation: ~8 min.
4. PostAdd auto-scaling rule: if Plus p95 queue wait >30 s, automatically increase Plus-tier lane weight until it returns to baseline. Implement free-tier per-account burst limits as a permanent config (review threshold with PM before deploying).

Quick check

Trade-off

Two incidents hit simultaneously: a NaN-explosion at 10% traffic for 30 min ($421 GPU waste) and a CSAM bypass delivering 3 clips over 15 min (minimal direct cost). Where should on-call engineer time go first, and why?

CSAM bypass — legal and existential risk from even 3 clips far exceeds any GPU waste; contain it first, then address NaN.NaN explosion — it has higher direct GPU cost and affects more users immediately.Handle both in parallel — the on-call team can split resources between GPU waste and safety.NaN explosion — CSAM bypass will self-resolve once the adversarial prompts are exhausted.

🏢

Company Lens

OpenAI loop

Sora is an OpenAI product, so interviewers expect depth on the specific design decisions described in the research notes. Expect questions on spacetime patch tokenization — “Why patches rather than frame-by-frame processing?” (answer: patches capture temporal relationships that per-frame processing cannot; the full spacetime-patch token sequence is what allows Sora to model camera motion and scene continuity across seconds). Expect questions on content policy at video scale — “How do you calibrate the celebrity-likeness classifier when the definition of ‘likeness’ is legally ambiguous?” (answer: separate technical threshold from policy threshold; the classifier emits a confidence score; policy — not engineering — sets the threshold above which generation is blocked, and that threshold is different for a private individual vs a public figure vs a politician). Interviewers will also probe the invite-only rollout strategy — not as a PR decision but as an infra one: “What does invite-only buy you that gradual percentage rollout doesn't?” Answer: invite-only gives you a controlled cohort of users whose prompts are known to be creative and constructive (the red-team can study the realistic distribution before adversarial users have access), and it limits the absolute QPS so GPU fleet scaling can be validated before demand is uncapped. A 1% traffic rollout of an open product still exposes you to the full adversarial internet at 1% scale.

Google / DeepMind loop

Google has Veo (announced Google I/O 2024) using a similar latent-DiT architecture. Expect questions on scaling video generation to web-scale: “How do you serve 1M concurrent video generations on a Borg-managed TPU fleet?” Key differences from H100 GPU serving: TPUs require XLA-compiled model graphs rather than PyTorch eager mode; the latent DiT forward pass must be fully static-shape for efficient TPU compilation, which means variable-duration/resolution inputs need padding to fixed shapes before dispatch, not native variable-length token sequences. Also expect questions on multi-modal grounding: “Veo accepts both text and image prompts (image-to-video). How does that change the architecture?” Answer: the text prompt is encoded by a text encoder (T5 or similar); the image prompt is encoded by a vision encoder (ViT); both conditioning signals are concatenated as cross-attention context for the DiT denoising loop. The safety stack also changes: the image pre-filter now scans the input image for policy violations before generation, adding a third pre-filter layer (text + image + generated frames post).

Competitive landscape (2024–2025)

Sora's Dec 2024 public launch was preceded and followed by strong competitive entries. Veo 2 (Google DeepMind, Dec 2024) targets 4K resolution with explicit physics understanding and camera motion control — differentiating on cinematic realism over raw clip duration. MovieGen (Meta, Oct 2024) adds joint video+audio generation via a 30B video + 13B audio model pair, targeting social-media content creation where background music and ambient sound are as important as visuals. Wan2.1 (Alibaba, Jan 2025) is the first open-weight model to match or exceed proprietary SOTA on VBench benchmarks (community report), trained on 70M video clips — the open-source catch-up that took ~12 months in image diffusion happened in under 12 months for video. Interview framing: when asked “how does Sora defend its competitive moat?” the honest answer is that the architectural moat is thin (spacetime DiT + flow matching is replicable); the defensible advantages are OpenAI's safety review infrastructure, C2PA provenance pipeline, and integration with ChatGPT's distribution.

Meta loop

Meta's Movie Gen (announced October 2024) is a 30B-parameter video foundation model (per Meta research blog, community estimate on deployment details). Expect questions on open-source tradeoffs: “If Meta open-sources the video model weights, how does the safety architecture change?” Answer: pre-filter and post-filter must shift from being server-side gates to being model-weight-baked safety — fine-tuning the model to refuse harmful prompts rather than relying on an external classifier that users can bypass by running the weights locally. This connects to RLHF-based safety fine-tuning, not infrastructure-level content filtering. Also expect questions on social-media-scale serving: “Reels serves 500M+ users. How would you integrate video generation into that serving stack?” Key insight: AI-generated video in a social feed has the same CDN-delivery path as user-uploaded video, so the marginal infrastructure cost is in the generation step, not the serving step. Meta's existing video-transcoding and CDN infrastructure (hundreds of PoPs globally) can serve generated clips with essentially no incremental engineering effort — the differentiated challenge is generation fleet management and safety at Reels scale (billions of generations/day is many orders of magnitude beyond Sora's current scale).

🧠

Key Takeaways

What to remember for interviews

1Latent diffusion (32× compression via video VAE) is the prerequisite for video generation — pixel-space diffusion at 1080p would require ~1.5 GB per denoising step, exceeding H100 VRAM capacity. The VAE bottleneck introduces a fidelity ceiling for fine detail (text, fast motion) that cannot be fixed by adding denoising steps.
2A 50-step DiT generation that fails at step 48 costs $0.117 in raw GPU time per clip (community estimate: 120 GPU-s × $3.50/hr). Step-level checkpointing every 10 steps reduces the worst-case retry cost by 83%. At scale, this is the difference between $132/hr and $660/hr in wasted GPU time during hardware instability events.
3Video safety requires five distinct classifier types (CSAM, celebrity likeness, NCII, violence, disinformation deepfake) versus two for image-gen. CSAM must scan every frame; others can be sampled. C2PA provenance signing is the only mitigation for disinformation deepfakes — undetectable at generation time by classifiers.
4Detection-window sensitivity dominates video incident cost. A 15× difference in cost between 2-minute and 30-minute detection windows for the same underlying failure. Invest in alarm sensitivity before incident-response speed.
5Invite-only rollout buys infra validation, not just PR management: it limits absolute QPS while the GPU fleet scaling is confirmed, and exposes only constructive users (not the adversarial internet) to the model before red-team coverage of the full adversarial distribution is complete.
6Cache hit rate for video generation is near-zero (~5% from exact-match retries). Never apply LLM prefix-cache sizing assumptions to a video fleet — the SLO-cost tradeoffs are driven entirely by GPU-seconds per clip, not token cache efficiency.

🎯

Interview Questions

Difficulty:

Company:

Showing 4 of 4

A Sora generation fails at step 48 of 50, consuming almost full GPU budget with no deliverable. Walk through two structural mitigations and quantify the expected wasted GPU-seconds saved by each.

★★★

OpenAIGoogle

Why is a 120-second p99 generation latency for Sora not directly comparable to a 120-second p99 for a long-document LLM response, and how should you design the UX and SLO differently?

★★☆

OpenAIAnthropic

Design the safety stack for a service that generates realistic human faces in video. What are the three hardest failure modes, and how do you detect each before a public incident?

★★★

OpenAIMeta

The Sora team proposes shipping a free-tier that allows unlimited generations but enforces a 480p resolution cap and a 5-second duration cap. As the infra lead, what do you push back on, and what do you add?

★★☆

OpenAIGoogle

🧠

Recap quiz

📚

Transformer Math

🎬 Case: Design Sora

Requirements & SLOs

Eval Harness (design first)

Back-of-Envelope

Architecture

Deep Dives

Break It

What does a bad day cost?

Sora On-call Runbook

Company Lens

Key Takeaways

Interview Questions

Recap quiz

Sora Design Review recap

Further Reading