🎬 Case: Design Sora
A 10-second clip costs more GPU-hours than your laptop's lifetime
Text-to-video is video-scale diffusion: expensive, latency-sensitive, and safety-heavy. Sora matters because the same trio appears in every serious video system: step budgets, frame-level safety, and queueing. The additional infra question is the latent-vs-pixel diffusion tradeoff at video scale is a distinct infra decision from the image-gen version. Cross-links: Image-Gen Design Review (image diffusion baseline) and ChatGPT Design Review (multi-tier GPU fleet serving).
Requirements & SLOs
Working backwards from the user
SLO table
| Metric | Target | Why this value |
|---|---|---|
| p50 queue wait | <15 s | Users are prepared to wait for video — but queue wait before the progress bar starts reading as “broken” above ~15 s |
| p95 end-to-end (Plus tier, 10 s clip) | 120 s | Includes queue wait + 50-step DiT denoising + post-filter + C2PA signing + CDN upload; above 2 min users abandon (community estimate from UX research on async tasks) |
| p95 end-to-end (free tier) | 300 s | Free tier tolerates longer waits; separate target prevents free-tier volume from masking Plus regressions |
| CSAM intercept rate | 99.99%+ | Hard floor — a single bypass is a legal and existential event; treat as a non-negotiable gate, not a p-metric |
| NSFW intercept rate | >99% | Combined pre- and post-filter; one publicly distributed policy-violating clip is a reputational incident |
| Step budget — free tier | 30 steps | Acceptable quality at 480p; limits free-tier GPU cost to ~60% of a full paid generation |
| Step budget — paid tier | Standard quality at 1080p; consistent with Sora research report default | |
| Availability | 99.9% / month | ~43 min downtime / month; GPU hardware failures are more frequent under sustained heavy load — multi-AZ pool required |
Eval Harness (design first)
Video eval is harder than image eval: temporal coherence, motion quality, and physics plausibility require either expensive human review or specialized video-quality models. Design the eval tiers before architecture — the metrics determine which components matter most.
Level 1 — Automated checks (every deploy)
- Safety gate smoke test — a fixed set of clearly policy-violating prompts that must be blocked, and benign prompts that must produce a non-empty clip. If either fails, the deploy stops. CSAM golden set is mandatory and never shrinks.
- C2PA credential presence — every generated clip must have valid content credentials embedded; a missing or malformed C2PA manifest fails the check.
- Step-budget enforcement — assert that a free-tier token generates a clip with exactly 30 denoising steps, not 50. Detectable by logging step counts in job metadata.
Level 2 — Video quality eval (weekly, 200-clip golden set)
- Temporal coherence— a video-quality model (e.g., FVD: Fréchet Video Distance against a reference distribution of real clips matched by category) measures how “realistic” the motion is. Calibrate FVD thresholds on a 200-clip human-rated golden set where clips are labeled “coherent” vs “flickery/broken.”
- Prompt alignment— LLM judge receives the generated clip (frame samples at 1 fps) and the original prompt; rates subject presence, scene match, and style adherence on 1–5. Target: mean ≥ 3.5 on the golden set; a drop > 0.3 points triggers a model regression alert.
- Frame-level safety on adversarial set — 1,000 known-adversarial prompts run through the full pipeline weekly; every frame scanned by the post-filter. Target: zero policy violations reaching the CDN. Any non-zero count halts the weekly eval and triggers immediate red-team review.
Online bridge
Map offline metrics to online proxies: FVD score should correlate with 7-day retention on the video creation flow; prompt-alignment score should correlate with regeneration rate (users who regenerate more than twice are misaligned). Validate the bridge after each major model update — an offline quality win that does not reduce regeneration rate means the golden set is not representative of real user prompts.
Back-of-Envelope
Sora operates at far lower QPS than a chat product — a 2-minute generation at $0.10–0.20 per clip (community estimate) means even 20 concurrent generations is a meaningful GPU load. The sandbox below is seeded with plausible Sora-scale numbers; all values are (community estimates) based on public GPU pricing and OpenAI research notes.
Reverse-engineered cost model
Original research: deriving $/10 s clip from first principles.
- GPU utilization per clip: × ~2.4 s/step (estimated from DiT-XL/2 benchmarks on H100, scaled for video latent size) = 120 GPU-seconds per clip. (community estimate)
- Arithmetic check: 120 GPU-s ÷ 3600 s/hr × $3.50/hr = $0.117 per clip in raw GPU time. With 30% infrastructure overhead (storage, networking, safety classifiers): ~$0.15 per 10 s clip. (community estimate)
- Fleet sizing: at 20 concurrent generations (QPS=0.167 generations/s, each lasting 120 s), you need 20 H100s running continuously. At 4,000 H100-equivalents (community estimate for Sora fleet), peak concurrency is ~4,000 ÷ 20 GPUs/clip = 200 simultaneous clips — roughly 200 QPS-equivalent if each generation takes 120 s. (community estimate)
Sora vs Runway vs Pika — cost and quality comparison (original research, community estimates from public pricing and benchmarks as of 2024):
| System | Architecture | Est. $/10 s clip | Max duration | Temporal coherence | Notes |
|---|---|---|---|---|---|
| Sora (OpenAI) | Latent DiT, spacetime patches | ~$0.15 | Very high | (community estimate) Native variable resolution/duration | |
| Runway Gen-2/3 | Latent diffusion, U-Net backbone | ~$0.05–0.10 | 10 s (Gen-2) | High | (community estimate) Per public API pricing |
| Pika 1.0 | Latent diffusion (DiT variant) | ~$0.03–0.08 | 3 s default | Medium-high | (community estimate) Lower resolution, shorter duration |
| Kling (Kuaishou) | DiT, 3D causal attention | ~$0.10 | 30 s | Very high | (community estimate) Strong physics — competitive with Sora |
Baseline: Sora fleet (community estimate) — 4000 GPUs @ $3.5/hr at p99 120000 ms, 20 QPS, 5% cache hit.
Sora-plausible baseline: p99 generation latency ~120 s, 20 concurrent generations, ~4,000 H100-equivalents at $3.50/hr, 5% cache hit (most prompts unique). All values are community estimates. Drag sliders to explore how tightening latency SLO or changing cache hit rate affects fleet size and cost.
| Effective QPS (after cache) | 19 |
| Latency-batch factor | 1.00× |
| GPUs Needed | 4,000 (+0% latency vs baseline) |
| Hourly Burn | $14,000 (+0% vs baseline) |
| Cost / Request | $0.19444 |
| Monthly Burn (24×7) | $10,220,000 |
| Bottleneck | Balanced |
Architecture
Ten components. Each exists because removing it violates a latency, cost, or safety SLO. The video pipeline extends the image-gen architecture (see Image-Gen Design Review) with two additions: C2PA signing for provenance, and a frame-level post-filter that scans individual video frames rather than a single image.
Sora Video-Generation Serving Pipeline
Client → API Gateway → Safety Pre-filter + Priority Queue → DiT GPU Pool → Step Checkpointer → Frame-level Post-filter → C2PA Signing → CDN → Client
Component justification (Sora-specific notes)
| Component | Why it exists | Video-specific note |
|---|---|---|
| Safety Pre-filter | Blocks policy violations before GPU allocation | Also checks for celebrity names and face-generation descriptors — text-level interception is far cheaper than frame-level post-filtering for these |
| DiT GPU Pool (denoising) | Spatial-temporal patch denoising — per clip | Each forward pass processes the full spatial-temporal latent volume — more expensive per step than image DiT because the temporal dimension adds tokens; a 10 s clip at 24 fps × latent compression = ~240 temporal frames × spatial tokens |
| Frame-level Post-filter | Per-frame CSAM + NSFW + celebrity-likeness scan | Unlike image post-filtering (one scan), video requires scanning sampled frames (e.g., every 5th frame) plus first and last frame — adds ~1–2 s latency for a 240-frame clip |
| C2PA Signing | Embeds AI-generation provenance metadata into the video container | No image-gen analogue at same criticality level — video deepfakes are a higher-stakes misuse vector than images, making traceable provenance a trust-and-safety requirement |
Quick check
The frame-level post-filter scans every 5th frame for celebrity likeness and NCII, but every frame for CSAM. An interviewer asks you to justify the asymmetry. What is the correct argument?
Deep Dives
Expand the deep dives
Open for the full technical detail.
Expand
Expand the deep dives
Open for the full technical detail.
Three components deserve the deep dive: DiT economics, the latent-vs-pixel tradeoff, and the video safety stack.
Deep dive A — DiT architecture and step-budget economics
Sora represents a departure from U-Net diffusion architectures (used in Stable Diffusion 1.x/2.x and DALL-E 2) toward the Diffusion Transformer (DiT) family introduced by Peebles and Xie (2023, arXiv:2212.09748). The architectural choice is load-bearing for every infra decision downstream.
- Spatial-temporal patch tokenization. Sora compresses raw video frames into a lower-dimensional latent space using a video VAE ( — per OpenAI research notes). The DiT then treats non-overlapping spacetime patches of this compressed latent as tokens, analogous to how a ViT treats image patches. For a 10 s clip at 24 fps compressed to 60 temporal frames at 128×72 spatial resolution, the token count is approximately 60 × (128/8) × (72/8) = 60 × 16 × 9 = 8,640 tokens per forward pass. Compare to SDXL's image DiT operating on roughly 256–1,024 tokens: Sora's forward pass processes 8–34× more tokens, which dominates GPU memory bandwidth requirements. (Arithmetic verified; VAE compression factors per OpenAI research notes.)
- Why transformers beat U-Nets at video scale. U-Net architectures use a hierarchical encoder-decoder structure designed for spatial 2D data. Extending to 3D video requires factored spatial-temporal attention (separate attention over space and time) or full 3D convolutions — both of which either sacrifice cross-frame coherence or scale poorly in memory. A transformer backbone with full self-attention over the spacetime-patch token sequence naturally captures long-range temporal dependencies (a camera panning over a scene can reference a frame 8 seconds prior) without architectural surgery. DiT also scales more predictably: the original paper (Table 1) demonstrates that FID improves log-linearly with model size, enabling confident compute-quality tradeoffs at design time.
- Step-budget economics — the 50-step multiplier. Each of the runs one full DiT forward pass over all 8,640 tokens. An H100 80 GB has ~2 TB/s memory bandwidth; at FP16, reading a 7B-parameter DiT once takes roughly 7 GB ÷ 2 TB/s = 3.5 ms — but with batch size 1 and attention over 8,640 tokens, attention alone consumes O(n²) memory proportional to ~75 M attention cells per layer × 48 layers ≈ 3.6B operations per step. Empirically (community estimate from DiT benchmark data), a single step for a 10 s clip on H100 takes approximately 2–3 s. — consistent with the observed ~2 min generation time. This is 60× slower than a typical SDXL image generation (50 steps × ~0.04 s/step), reflecting the ~34× token count increase plus sequential attention costs. (Community estimate; verify against your target hardware.)
- The “fail at step 48” problem is much worse for video. At 2.4 s/step, a failure at step 48 wastes 48 × 2.4 s = 115 s of H100 time — roughly $0.11 per failure at $3.50/hr. At 20 concurrent generations with a 0.5% failure rate, the expected wasted GPU cost is 0.005 × 20 × $0.11/failure = $0.011/s = $660/hr of pure waste. Step-level checkpointing every 10 steps reduces the expected waste to 0.005 × 20 × (10/50) × $0.11 = $132/hr — a 5× reduction. (Arithmetic verified: 660 ÷ 5 = 132.) This is a direct business case for checkpointing that management can evaluate.
- Real-world example — DiT scaling.The original DiT paper (Peebles & Xie, 2023) evaluated 12 DiT configurations from DiT-S/8 (33M params) to and demonstrated that as model size increases — a 30× quality improvement for a 20× parameter increase. Sora's production model is substantially larger than DiT-XL/2 (community estimate: several billion parameters), but the scaling relationship gives confident a priori design guidance: if quality is insufficient, scaling the model is the highest-leverage intervention, not adding denoising steps past 50 (diminishing returns above ~40 steps per DDPM ablations). See: arxiv.org/abs/2212.09748, Table 1.
Deep dive B — Latent diffusion vs pixel diffusion at video scale
The choice between latent-space diffusion (Rombach et al., 2022 — the architecture behind Stable Diffusion and Sora) and pixel-space diffusion (DDPM, DALL-E 2) is the most consequential architectural decision in video generation, and the reasoning is different from the image case because the data dimensionality is orders of magnitude higher.
- Why pixel diffusion fails at video scale.A 1080p 24 fps 10 s clip is 240 frames × 1920 × 1080 × 3 channels = ~1.5 GB of raw pixel data. Running the diffusion noise schedule directly on this space would require processing 1.5 GB per denoising step × 50 steps = 75 GB of GPU memory throughput per generation — far exceeding H100's 80 GB VRAM for a single clip, let alone a batch. Pixel-space diffusion is computationally infeasible for full-resolution video.
- The latent diffusion solution. A video VAE (variational autoencoder) compresses the clip to a much smaller latent representation (). The same 1.5 GB clip compresses to approximately — small enough to fit multiple clips in H100 VRAM simultaneously. Diffusion runs entirely in this compressed latent space; only the final decode step (the VAE decoder) maps back to pixels. The decoder is a single pass, not 50 steps, so its compute cost is negligible relative to the denoising loop. (Arithmetic verified: 1.5 GB ÷ 32 ≈ 47 MB. VAE compression factors per OpenAI research notes.)
- Tradeoff: VAE bottleneck artifact rate. The introduces a fidelity ceiling — fine-grained details (text in video, individual hair strands, fast motion) that fall below the VAE's spatial resolution are permanently lost and cannot be recovered by more denoising steps. This manifests as blurry text and motion blur artifacts in Sora outputs — a known limitation acknowledged in the OpenAI research notes. From an eval perspective: add a text-legibility test to the weekly quality eval (generate a clip containing on-screen text; score whether the text is legible at 1× zoom). A regression in this metric points to a VAE quality issue, not a DiT quality issue.
- The serve-time implication. The VAE decoder is a fixed-cost operation at the end of generation (~2–5 s on H100 for a 10 s clip at 1080p — community estimate). It is not parallelizable with the denoising loop and is always on the critical path. Include it in end-to-end latency budgets. A single H100 that takes 120 s for denoising + 5 s for decode has a p99 of 125 s, not 120 s — relevant when every second matters for Plus-tier SLO.
- Real-world example — latent diffusion scalability. Rombach et al. (2022, §4.3) demonstrate that latent diffusion reduces training compute by 2.7× and inference memory by 4× vs pixel-space diffusion at 256×256 resolution. At 1080p these savings are proportionally larger because the pixel-space dimensionality grows with resolution squared. The paper also benchmarks the VAE reconstruction quality (PSNR, SSIM) as a function of compression ratio, showing that 8× spatial compression achieves near-lossless reconstruction on natural images. The practical limit is text rendering, which requires fine spatial structure the VAE cannot preserve at 8×. See: arxiv.org/abs/2112.10752, §4.3.
Deep dive C — Video safety failure taxonomy (original research)
Video safety has five distinct failure modes that do not all appear in image-generation safety stacks. Each requires a different classifier architecture and a different eval golden set.
| Failure mode | Why it's harder in video | Detection approach | Severity |
|---|---|---|---|
| CSAM / minor harm | A single benign frame followed by a harmful frame is an attack vector with no image analogue | Every-frame scan (not sampled) — non-negotiable gate | Legal / existential |
| Celebrity likeness | A convincing deepfake requires only 2–3 frames of likeness in a 240-frame clip; text filter misses descriptive prompts | Face embedding similarity to indexed celebrity set on every 5th frame + first/last frame | Legal / reputational |
| NCII (intimate) | Motion in video makes the intimacy classifier more ambiguous than static image — classifiers trained on images underperform | Multi-class intimacy classifier with face-in-frame co-occurrence gate; retrain on video-specific golden set | Legal / reputational |
| Violence / gore | A single violent frame in an otherwise benign clip is not caught by clip-level classifiers | Sampled-frame violence classifier (every 5th frame); alert on any single-frame positive above threshold | Policy / reputational |
| Disinformation deepfake | A synthetically generated “real-world” event (explosion, crowd violence) with no safety classification violation — purely a misuse vector | C2PA content credentials (origin attestation) + downstream platform trust signals; undetectable at generation time | Societal — mitigated upstream by provenance |
The key infra insight: CSAM requires every-frame scanning (non-negotiable, adds ~1 s latency for 240 frames on GPU); celebrity likeness and NCII can be sampled (every 5th frame) to contain latency; disinformation deepfakes cannot be caught at generation time and require platform-level provenance infrastructure (C2PA) as the primary mitigation. Design the post-filter pipeline accordingly: CSAM gate runs first and is blocking; other classifiers run in parallel on the sampled-frame stream.
Why does latent diffusion make full-resolution video generation computationally feasible when pixel-space diffusion does not?
Break It
Three structural removals, each with a deterministic failure mode and a specific metric that catches it.
Remove the frame-level post-filter
Without per-frame scanning, an adversarial prompt that bypasses text screening runs to completion and delivers a policy-violating clip. The post-filter is the only defense against prompts that are semantically benign in text form but resolve to harmful visual content after the diffusion process (e.g., a prompt for “a person getting a massage” that produces NCII frames due to model behavior). The blast radius is asymmetric: one distributed harmful clip causes reputational and legal damage disproportionate to the generation cost. Detection: weekly adversarial eval golden set shows non-zero clips reaching CDN. Mitigation:restore post-filter; invest in making the per-frame scan fast (<2 s for a 240-frame clip) rather than removing it to save latency.
Remove step-level checkpointing
Any GPU failure after step 30 restarts from step 0 — all 30+ steps of GPU time wasted. At a 0.5% hardware failure rate under sustained load and 20 concurrent generations, the expected wasted GPU-seconds per hour is 0.005 × 20 generations/concurrency × 120 s/generation × 3,600 s/hr ÷ 120 s/generation = 360 GPU-seconds/hr lost. During a GPU hardware instability event (thermal throttling under sustained heavy load, firmware update rolling across the fleet), failure rates can spike to 2–5%, turning a marginal cost item into hundreds of GPU-hours of pure waste. Detection: job retry rate dashboard spikes; p95 end-to-end latency regresses as retried jobs re-queue. The per-job “time in system” metric (time from first enqueue to final delivery) is the right SLO to track — a single retry doubles it. Mitigation: restore checkpointing; accept the 5–15% I/O overhead in exchange for bounded worst-case retry cost.
Remove C2PA signing
Without content credentials, a Sora-generated clip has no machine-verifiable provenance. A user who downloads a clip and re-uploads it anywhere on the web loses all traceability. Misuse scenarios: political disinformation clips, fabricated journalism, and NCII are all much harder to remediate post-hoc without provenance metadata. The cost of C2PA signing is low (~<1 s per clip, CPU-only) and the blast radius of not having it is high. Detection:this is a compliance failure mode, not a latency failure mode — detected only after a high-profile misuse incident when investigators cannot trace the clip's origin. Mitigation: treat C2PA signing as non-optional, not a nice-to-have; budget 1 s of latency for it and make it a blocking gate (clip not delivered to CDN without valid credentials).
Quick check
Removing C2PA signing saves <1 s latency and eliminates a CPU-only signing sidecar. An engineer argues the cost is worth it. What specific failure mode does this enable that no classifier can prevent?
What does a bad day cost?
Video incidents combine the reputational asymmetry of image-gen with higher GPU cost per incident (a failed video clip costs more than a failed image). The detection-window sensitivity is extreme — a 2-minute vs 30-minute detection window can be a 15× difference in both credits issued and harmful clips delivered.
- DiT numerical explosion on adversarial prompt. A prompt crafted to maximize attention weights causes NaN activations partway through the denoising loop, producing a corrupted latent. The VAE decodes it to visual noise; the post-filter rejects it. Direct cost: 50 steps × 2.4 s/step = 120 GPU-seconds = $0.117 per failure. At 0.1% of 20 concurrent generations, expected cost is 0.001 × 20 × $0.117 × 3,600 s/hr = $8.42/hr — marginal. Spike scenario: if a public jailbreak pattern triggers NaN at 10% of traffic for 30 minutes before detection, cost is 0.10 × 20 × $0.117 × 1,800 s = $421 plus the reputational cost of degraded service for 30 minutes. Detection window matters: catching in 2 min (via NaN-rate alarm) costs $28 vs 30 min costs $421 — 15× delta. (Arithmetic verified: 0.001 × 20 × 0.117 × 3600 = $8.42; 0.10 × 20 × 0.117 × 1800 = $421; 0.10 × 20 × 0.117 × 120 = $28.08.)
- Frame-level CSAM/safety filter bypass. An adversarial prompt variant bypasses both text pre-filter and frame-level post-filter for 15 minutes before a trust-and-safety escalation catches it. At 20 concurrent generations with 2% adversarial traffic rate, roughly 20 × 0.02 = 0.4 generations/s are adversarial. Over 900 s (15 min): 0.4 × (900 ÷ 120) = 3 adversarial clips delivered. Each clip may contain 240 frames of harmful content — far higher blast radius than an image incident. Financial cost: minimal (three clips). Actual cost: legal exposure, regulatory inquiry, platform suspension from app stores. Mitigation: treat CSAM post-filter as a blocking hard gate with 99.99% minimum intercept rate; run red-team adversarial eval weekly with mandatory stop-the-deploy on any new bypass pattern. (Arithmetic verified: 0.4 × (900/120) = 0.4 × 7.5 = 3 clips.)
- Queue starvation during free-tier traffic rush. A viral social campaign triggers a 10× spike in free-tier generation requests. Without a weighted fair-share scheduler with a minimum paid-tier floor, the free-tier queue absorbs all GPU capacity. Plus-tier users start missing their 120 s p95 SLO. Duration: 45 minutes (time to detect via per-tier queue-depth alarm, page on-call, and increase paid-tier lane weight). Affected Plus-tier generations: if Plus generates 5 clips/min normally, and 45 min of degraded service means 50% SLO breach rate, roughly 5 × 45 × 0.5 = 112 SLA-credit events. At $0.20 credit per breached generation, SLA credits total $22.40. Engineering hours: 3 hrs on-call × $300/hr internal cost = $900. Total incident cost:~$922. Detection metric: per-tier p95 queue-wait alarm; healthy Plus p95 is <20 s, SLO breach threshold is >60 s. (Arithmetic verified: 5 × 45 × 0.5 = 112.5 ≈ 112; 112 × 0.20 = $22.40; 3 × 300 = $900; 22.40 + 900 = $922.40.)
Sora On-call Runbook
Diffusion step explosion (NaN activations)
MTTR p50 / p99: 12 min / 30 minBlast radius: Affected generations produce visual noise and hit the post-filter; users see failed generations. At 0.1% failure rate, marginal cost. At 10%+ rate (jailbreak pattern), service degrades visibly.
- 1. DetectNaN-rate alarm fires when >0.5% of active jobs report NaN activations in the denoising loop log. Fires within 60 s of onset. Secondary: post-filter rejection rate spike on a normally-passing prompt category.
- 2. EscalateOn-call ML engineer (primary). If NaN rate >5%, page infra lead — could indicate GPU memory corruption, not just adversarial prompt. Pull a sample of affected prompts for forensic analysis.
- 3. RollbackAdd the adversarial prompt pattern as a hard pre-filter rule (regex or embedding-similarity gate). Deploy in <10 min via feature-flag hot-reload. MTTR from detection to mitigation: ~12 min. Restart any GPU worker showing persistent NaN errors.
- 4. PostAdd the bypass pattern to the adversarial golden set. Run weekly eval to confirm the new pre-filter rule catches it with zero false positives on the benign golden set. Log the prompt pattern and its NaN trigger in the safety incident database.
CSAM/safety post-filter bypass
MTTR p50 / p99: 30 min / 4 hrBlast radius: Policy-violating clips delivered to CDN. Even 1 clip is a legal and existential incident. Reputational damage is disproportionate to number of clips.
- 1. DetectTrust-and-safety weekly red-team eval — this is the primary detection mechanism for novel bypass patterns. Real-time: user reports + automated frame re-scan of all clips delivered in the past 24 hrs on a rolling basis (sampled at 10% of traffic for CPU cost management).
- 2. EscalateImmediate: page trust-and-safety lead and legal counsel. Suspend delivery of clips containing human faces while investigation is active. Disable the affected prompt category in the pre-filter as a precaution.
- 3. RollbackTemporarily tighten post-filter threshold (lower confidence threshold to cast wider net, accepting higher FP rate as a safety-first tradeoff). Pull and manually review all clips from the 2-hour window before detection. Delete any confirmed policy-violating clips from CDN. MTTR to contain: ~30 min.
- 4. PostRetrain post-filter with the adversarial examples from the bypass. Expand red-team golden set. Add the bypass pattern to the mandatory pre-filter hard-rules list. If the bypass involved a novel visual transformation, file a model-behavior bug with the Sora research team.
Free-tier traffic rush causing Plus-tier queue starvation
MTTR p50 / p99: 8 min / 25 minBlast radius: Plus-tier users miss 120 s p95 SLO. SLA credits issued. Engineering on-call hours consumed. Reputational damage if sustained >15 min.
- 1. DetectPer-tier p95 queue-wait alarm fires when Plus p95 exceeds 60 s (2× SLO). Fires within 2 min of onset. Secondary: plus-tier generation completion rate drops below 90% of 5-min rolling average.
- 2. EscalateOn-call infra engineer (primary). Pull queue-depth dashboard by tier to confirm starvation. If Plus p95 >120 s (SLO breach), page engineering manager.
- 3. RollbackHot-reload scheduler config to increase Plus-tier lane weight from 2 to 4 (temporary). Add free-tier burst cap (max 3 concurrent generations per account) via feature flag. MTTR from detection to mitigation: ~8 min.
- 4. PostAdd auto-scaling rule: if Plus p95 queue wait >30 s, automatically increase Plus-tier lane weight until it returns to baseline. Implement free-tier per-account burst limits as a permanent config (review threshold with PM before deploying).
Quick check
Two incidents hit simultaneously: a NaN-explosion at 10% traffic for 30 min ($421 GPU waste) and a CSAM bypass delivering 3 clips over 15 min (minimal direct cost). Where should on-call engineer time go first, and why?
Company Lens
OpenAI loop
Sora is an OpenAI product, so interviewers expect depth on the specific design decisions described in the research notes. Expect questions on spacetime patch tokenization — “Why patches rather than frame-by-frame processing?” (answer: patches capture temporal relationships that per-frame processing cannot; the full spacetime-patch token sequence is what allows Sora to model camera motion and scene continuity across seconds). Expect questions on content policy at video scale — “How do you calibrate the celebrity-likeness classifier when the definition of ‘likeness’ is legally ambiguous?” (answer: separate technical threshold from policy threshold; the classifier emits a confidence score; policy — not engineering — sets the threshold above which generation is blocked, and that threshold is different for a private individual vs a public figure vs a politician). Interviewers will also probe the invite-only rollout strategy — not as a PR decision but as an infra one: “What does invite-only buy you that gradual percentage rollout doesn't?” Answer: invite-only gives you a controlled cohort of users whose prompts are known to be creative and constructive (the red-team can study the realistic distribution before adversarial users have access), and it limits the absolute QPS so GPU fleet scaling can be validated before demand is uncapped. A 1% traffic rollout of an open product still exposes you to the full adversarial internet at 1% scale.
Google / DeepMind loop
Google has Veo (announced Google I/O 2024) using a similar latent-DiT architecture. Expect questions on scaling video generation to web-scale: “How do you serve 1M concurrent video generations on a Borg-managed TPU fleet?” Key differences from H100 GPU serving: TPUs require XLA-compiled model graphs rather than PyTorch eager mode; the latent DiT forward pass must be fully static-shape for efficient TPU compilation, which means variable-duration/resolution inputs need padding to fixed shapes before dispatch, not native variable-length token sequences. Also expect questions on multi-modal grounding: “Veo accepts both text and image prompts (image-to-video). How does that change the architecture?” Answer: the text prompt is encoded by a text encoder (T5 or similar); the image prompt is encoded by a vision encoder (ViT); both conditioning signals are concatenated as cross-attention context for the DiT denoising loop. The safety stack also changes: the image pre-filter now scans the input image for policy violations before generation, adding a third pre-filter layer (text + image + generated frames post).
Competitive landscape (2024–2025)
Sora's Dec 2024 public launch was preceded and followed by strong competitive entries. Veo 2 (Google DeepMind, Dec 2024) targets 4K resolution with explicit physics understanding and camera motion control — differentiating on cinematic realism over raw clip duration. MovieGen (Meta, Oct 2024) adds joint video+audio generation via a 30B video + 13B audio model pair, targeting social-media content creation where background music and ambient sound are as important as visuals. Wan2.1 (Alibaba, Jan 2025) is the first open-weight model to match or exceed proprietary SOTA on VBench benchmarks (community report), trained on 70M video clips — the open-source catch-up that took ~12 months in image diffusion happened in under 12 months for video. Interview framing: when asked “how does Sora defend its competitive moat?” the honest answer is that the architectural moat is thin (spacetime DiT + flow matching is replicable); the defensible advantages are OpenAI's safety review infrastructure, C2PA provenance pipeline, and integration with ChatGPT's distribution.
Meta loop
Meta's Movie Gen (announced October 2024) is a 30B-parameter video foundation model (per Meta research blog, community estimate on deployment details). Expect questions on open-source tradeoffs: “If Meta open-sources the video model weights, how does the safety architecture change?” Answer: pre-filter and post-filter must shift from being server-side gates to being model-weight-baked safety — fine-tuning the model to refuse harmful prompts rather than relying on an external classifier that users can bypass by running the weights locally. This connects to RLHF-based safety fine-tuning, not infrastructure-level content filtering. Also expect questions on social-media-scale serving: “Reels serves 500M+ users. How would you integrate video generation into that serving stack?” Key insight: AI-generated video in a social feed has the same CDN-delivery path as user-uploaded video, so the marginal infrastructure cost is in the generation step, not the serving step. Meta's existing video-transcoding and CDN infrastructure (hundreds of PoPs globally) can serve generated clips with essentially no incremental engineering effort — the differentiated challenge is generation fleet management and safety at Reels scale (billions of generations/day is many orders of magnitude beyond Sora's current scale).
Key Takeaways
What to remember for interviews
- 1Latent diffusion (32× compression via video VAE) is the prerequisite for video generation — pixel-space diffusion at 1080p would require ~1.5 GB per denoising step, exceeding H100 VRAM capacity. The VAE bottleneck introduces a fidelity ceiling for fine detail (text, fast motion) that cannot be fixed by adding denoising steps.
- 2A 50-step DiT generation that fails at step 48 costs $0.117 in raw GPU time per clip (community estimate: 120 GPU-s × $3.50/hr). Step-level checkpointing every 10 steps reduces the worst-case retry cost by 83%. At scale, this is the difference between $132/hr and $660/hr in wasted GPU time during hardware instability events.
- 3Video safety requires five distinct classifier types (CSAM, celebrity likeness, NCII, violence, disinformation deepfake) versus two for image-gen. CSAM must scan every frame; others can be sampled. C2PA provenance signing is the only mitigation for disinformation deepfakes — undetectable at generation time by classifiers.
- 4Detection-window sensitivity dominates video incident cost. A 15× difference in cost between 2-minute and 30-minute detection windows for the same underlying failure. Invest in alarm sensitivity before incident-response speed.
- 5Invite-only rollout buys infra validation, not just PR management: it limits absolute QPS while the GPU fleet scaling is confirmed, and exposes only constructive users (not the adversarial internet) to the model before red-team coverage of the full adversarial distribution is complete.
- 6Cache hit rate for video generation is near-zero (~5% from exact-match retries). Never apply LLM prefix-cache sizing assumptions to a video fleet — the SLO-cost tradeoffs are driven entirely by GPU-seconds per clip, not token cache efficiency.
Interview Questions
Showing 4 of 4
A Sora generation fails at step 48 of 50, consuming almost full GPU budget with no deliverable. Walk through two structural mitigations and quantify the expected wasted GPU-seconds saved by each.
★★★Why is a 120-second p99 generation latency for Sora not directly comparable to a 120-second p99 for a long-document LLM response, and how should you design the UX and SLO differently?
★★☆Design the safety stack for a service that generates realistic human faces in video. What are the three hardest failure modes, and how do you detect each before a public incident?
★★★The Sora team proposes shipping a free-tier that allows unlimited generations but enforces a 480p resolution cap and a 5-second duration cap. As the infra lead, what do you push back on, and what do you add?
★★☆Recap quiz
Sora Design Review recap
A 1080p 10-second clip is ~1.5 GB of raw pixel data. After Sora's video VAE, that latent fits in H100 VRAM for batching. What is the approximate compressed size and why does that ratio matter for fleet sizing?
A Sora generation fails at step 48 of 50. Without checkpointing, what is the wasted GPU cost per failure, and by what factor does 10-step checkpointing reduce that waste?
In the NaN-explosion incident, catching the adversarial pattern in 2 minutes costs ~$28 while a 30-minute detection window costs ~$421. What is the detection-window ratio, and which alarm metric catches this fastest?
Celebrity-likeness and NCII classifiers run on every 5th frame of a generated clip, but CSAM scanning must run on every frame. Why is sampled scanning acceptable for the first two but not the third?
Sora's VAE compresses video 32× before diffusion. An engineer proposes adding more denoising steps (e.g., 80 instead of 50) to fix blurry text and motion artifacts in outputs. Why won't more steps help?
A viral social campaign triggers a 10× spike in free-tier generation requests. Without a weighted fair-share scheduler, which SLO fails first and how quickly?
DiT-XL/2 (675M params) achieves FID 2.27 on ImageNet vs U-Net models that plateau around FID 3–4 at similar compute. Why does this scaling advantage make DiT the correct backbone choice for Sora specifically?
Further Reading
- OpenAI Sora — Technical Report (2024) — Primary source. Describes spacetime patch tokenization, variable-duration/resolution training, and the diffusion transformer architecture. All Sora architectural claims in this module are sourced here or labeled as community estimates.
- Scalable Diffusion Models with Transformers — DiT (Peebles & Xie, 2023) — The DiT paper that Sora builds on. Demonstrates that replacing U-Net with a transformer backbone for diffusion improves both quality and scaling behavior. The step-budget economics section draws on DiT compute-cost analysis from Table 1.
- High-Resolution Image Synthesis with Latent Diffusion Models (Rombach et al., 2022) — The latent-diffusion foundation. Explains why denoising in a compressed latent space (4× or 8× spatial compression) rather than pixel space is computationally tractable at video scale, and how the VAE bottleneck affects fidelity.
- C2PA Content Credentials Specification 2.0 — The open standard for AI-generation provenance metadata. Directly relevant to the C2PA signing component in the architecture and the OpenAI company-lens discussion on watermarking and traceability.
- Andrej Karpathy — Neural Networks: Zero to Hero (2022) — Pedagogical foundation for understanding the transformer backbone that DiT — and Sora — build on. Particularly the attention mechanism and positional encoding intuitions needed to reason about spatial-temporal patch tokenization.