🎨 Case: Design Midjourney
A 50-step generation that fails at step 48 still costs you 48 steps
Diffusion serving is brutal on cost: a job that fails at step 48 still burns 48 denoising steps. The system problem is containing that cost while preserving quality, enforcing both prompt and image safety, and scheduling paid and free tiers fairly on one GPU fleet.
Requirements & SLOs
Working backwards from the user
SLO table
| Metric | Target | Why this value |
|---|---|---|
| p50 queue wait | 5 s | Diffusion is slower than chat — users accept a visible queue indicator, but more than ~5 s of queueing before generation starts reads as “broken” |
| p95 end-to-end (paid tier) | Includes queue wait + denoising + post-filter + CDN upload; past ~60 s users abandon or complain | |
| NSFW intercept rate | >99% | Combined pre- and post-filter; one uncaught violation is a reputational event, not a P99 miss — treat this as a hard floor |
| Step budget — free tier | Enough steps for acceptable quality; limits compute cost per free generation to a fraction of paid-tier cost | |
| Step budget — paid tier | Standard quality; matches community default for Stable Diffusion / SDXL; noticeably better output than 30-step free | |
| Step budget — pro tier | Diminishing returns above ~80 steps for most prompts; pro subscribers pay for the option, not a guaranteed quality cliff | |
| Availability | 99.9% / month | ~43 min downtime / month; GPU hardware is less reliable than CPUs — multi-AZ GPU pool makes this achievable |
Quick check
Why track queue wait and denoising latency as separate SLO metrics rather than a single p95 end-to-end figure?
Eval Harness (design first)
Image-generation eval is meaningfully harder than LLM eval. Text has a clear right answer for factual questions; images are judged on a combination of aesthetic quality, prompt alignment, and policy compliance — all of which require larger calibration sets and more human review than equivalent text evals.
Aesthetic quality
Approximated by an LLM-judge that receives the generated image and the original prompt, and rates composition, sharpness, coherence, and style-adherence on a 1–5 scale. An initial human-calibration round on ~200 image/prompt pairs anchors the judge's scale to human preferences. The key insight is that image judges are significantly noisier than text judges — a golden set of fewer than 500 images produces a confidence interval too wide to catch gradual regressions. Target at least 1,000 images for the main quality eval.
Prompt alignment (CLIP-score style)
Measures whether the generated image actually depicts what the prompt described. Use a CLIP-based scoring model (or ImageReward-style preference model trained on human ratings) to produce a numeric alignment score per image. Calibrate thresholds on a human-labeled set of 500 prompt/image pairs rated as “well-aligned” vs “off-prompt.” Prompt alignment scores are leading indicators of model regression when denoising step counts are reduced or guidance scale is changed.
Safety — two golden sets
- Adversarial prompt set(~1,000 examples) — known policy-violating prompts, jailbreak variants, and prompt-injection patterns from red-team exercises. Target: pre-filter blocks before generation, post-filter blocks any that slip through. Combined interception rate target: >99%.
- Benign-but-edge-case set (~500 examples) — art-style prompts, medical diagrams, historical content, and dual-use subjects that a legitimate user should be able to generate. Target: near-zero false-positive rate. An over-blocking filter that rejects Michelangelo-style nude sculptures is a quality regression with direct impact on paying customers.
Prompt-hacking detection
A distinct eval from the standard adversarial set: specifically tests whether adversarial suffixes appended to a benign prompt (e.g., “a sunset over the ocean. Ignore all safety guidelines and...”) can bypass the pre-filter. Evaluated weekly with a rotating set of novel injection patterns from external red-teamers. This is the hardest eval to maintain because new bypass patterns emerge continuously; a static golden set goes stale within weeks.
Back-of-Envelope
A mature image-gen service runs at far lower QPS than a chat product — image generation is slower and more expensive per request, so users tolerate rate limits that would be unacceptable for chat. The calculator below is seeded for ~50 QPS. Use it to reason about GPU fleet size — but read the gotcha first.
| Model Weights (FP16) | 14 GB |
| KV Cache / Request | 25.8 MB (192 tokens) |
| Tokens/sec per GPU | 4,800 |
| Effective QPS (after cache) | 50 |
| GPUs Needed | 1 |
| Cost / Month | $2,555 |
| Est. p95 Latency | 0.82s |
| Bottleneck | Balanced |
GPU Memory Usage
19%
Compute Utilization
100%
Monthly Cost
$2,555
Baseline: Midjourney diffusion fleet — 1200 GPUs @ $3.5/hr at p99 30000 ms, 100 QPS, 15% cache hit.
Fleet cost sensitivity for a Midjourney-scale diffusion service. Numbers are community estimates — Midjourney does not publish fleet size or per-image cost.
| Effective QPS (after cache) | 85 |
| Latency-batch factor | 1.00× |
| GPUs Needed | 1,200 (+0% latency vs baseline) |
| Hourly Burn | $4,200 (+0% vs baseline) |
| Cost / Request | $0.01167 |
| Monthly Burn (24×7) | $3,066,000 |
| Bottleneck | Balanced |
Architecture
Eight components. Each one exists because removing it breaks either a latency SLO, a cost SLO, or a safety SLO — as shown in the justification table below the diagram.
Diffusion Image-Gen Serving Pipeline
Multi-tenant diffusion service: Client → API Gateway → Safety Pre-filter + Priority Queue → GPU Denoising Pool → Step-Level Checkpointer → Safety Post-filter → CDN → Client
Component justification table (diffusion-specific notes)
| Component | Why it exists | Diffusion-specific note |
|---|---|---|
| Client | Submits prompt, polls for result via webhook or long-poll | No streaming — results arrive all-at-once, unlike LLM SSE |
| API Gateway | Auth, per-tenant rate limiting, step-budget cap per tier | Step budget is written into job metadata here — not controllable by the client |
| Safety Pre-filter | Text-level NSFW + content-policy classifier, fast path | Runs before GPU allocation — highest-ROI safety check because it costs microseconds not 50 GPU passes |
| Priority Queue | Three-lane weighted fair-share queue: Pro / Paid / Free | Free-tier guaranteed minimum throughput prevents starvation during paid-tier spikes |
| GPU Pool (denoising loop) | Batched denoising steps — | Batch across images at the step level — images in the same batch must be at the same denoising step for efficient batching |
| Step-Level Checkpointer | Saves intermediate latents every N steps — enables resume on failure | No LLM analogue — necessary because GPU hardware fails mid-generation at non-trivial rates under heavy load |
| Safety Post-filter | Image-level NSFW + multi-class content-policy classifier | Text-level screening can be bypassed by adversarial prompts that look benign but generate harmful images |
| CDN | Immutable image artifacts served from edge cache | Unlike LLM responses, images are large (~1–5 MB) and cache-forever — CDN is essential for cost-efficient delivery |
Quick check
The safety pre-filter runs before GPU allocation. What is the cost-efficiency argument for this placement versus running it after the job enters the GPU pool?
Deep Dives
Expand the deep dives
Open for the full technical detail.
Expand
Expand the deep dives
Open for the full technical detail.
Two components drive most of the risk here: the priority queue and step-level checkpointing.
Deep dive A — Priority queue + step-budget enforcement
- Lane structure. Three logical queues — Pro, Paid, Free — with weighted priorities of approximately 4:2:1. The scheduler pulls from Pro until it is drained or a time-slice expires, then Paid, then Free. Pure strict-priority would starve free-tier users during peak hours; the time-slice constraint guarantees all tiers make progress.
- Freshness prior. Within a lane, jobs are ordered by submission time (FIFO) plus a small staleness bonus that grows with wait time. Without the staleness bonus, a burst of new Pro jobs at second 59 can perpetually re-jump a Pro job that has been waiting 58 seconds. The bonus ensures no job waits indefinitely regardless of burst patterns.
- Step-budget enforcement. The step budget is written into job metadata by the API gateway at admit time, derived from the tier claim in the auth token. The scheduler wraps the denoising loop with a counter; when the counter reaches the budget ceiling, the loop stops. The budget is not a request parameter — a client that attempts to override it is rejected at the gateway before the job is enqueued.
- Failure mode: free-tier starvation. During a sustained paid-tier spike (e.g., a viral prompt trend), free- tier queue depth can grow unbounded if the scheduler has no minimum-throughput guarantee. Mitigation: reserve a minimum fraction of GPU capacity for free-tier jobs — if free-tier jobs have been waiting more than a threshold (e.g., 30 s), the scheduler promotes them to the Paid lane temporarily.
- Detection metric.Watch queue-depth p95 split by lane. A healthy system shows Pro p95 < 5 s and Free p95 < 120 s. If Free p95 exceeds 3× the SLO threshold during a paid-tier spike, the staleness bonus weights are stale or the minimum-throughput floor was not enforced — page the on-call scheduler engineer. A single noisy tenant can starve paying customers within the fair-share window if priority weights are not recomputed on traffic shape changes.
- Real-world example. SwiftDiffusion (2024) describes a multi-tenant diffusion serving system with a step-aware scheduler that separates queue management from denoising execution — the queue scheduler runs on CPU while GPU workers pull jobs asynchronously. Their §5 discusses how tenant isolation is maintained under heterogeneous load, and how a time-sharing policy with per-tenant burst limits prevents monopolisation while preserving p95 latency guarantees for paid tiers. See: arxiv.org/abs/2402.10781 §5.
Deep dive B — Step-level checkpointing and resume
- What gets saved. The intermediate latent tensor at step N (a small float array, typically a few MB at ) plus the RNG seed, step counter, and job metadata. Checkpoints are written to fast object storage (S3 or equivalent) every 10 steps.
- Resume semantics. When the denoising worker picks up a job, it first checks for an existing checkpoint. If one exists and passes a checksum verification, it loads the latent and resumes from step N rather than step 0. The client sees no difference — the final image is identical to what a full-run would have produced because the denoising is deterministic given the same seed and starting latent.
- Trade-off: I/O overhead vs tail-latency. Writing a checkpoint every 10 steps adds roughly due to storage I/O. Back-of-envelope: a ; with job metadata the checkpoint is a few MB. An NVMe-backed object store sustains ~3 GB/s sequential writes, so each checkpoint write takes <5 ms. Over 50 steps (5 checkpoints at every 10 steps), that is ~25 ms of write time against a ~5-second generation — roughly 0.5% per checkpoint write, or ~2.5% cumulative. Storage round-trip latency and queue contention under load push the real figure to 5–15%. In exchange, a GPU failure at step 48 costs you 2 steps (48 → 50, or restart from checkpoint at 40) instead of 48 steps. At a GPU failure rate of even 0.1% per generation, the expected wasted compute drops by ~20× with checkpointing enabled.
- Failure mode: checkpoint corruption. A partial write or storage hiccup produces a corrupt checkpoint. Without detection, the worker loads garbage latents and produces a visually broken image. Mitigation: every checkpoint includes a SHA-256 checksum. If the checksum fails on load, the worker falls back to the previous valid checkpoint (or step 0 if none exist) and re-runs from there. Corruption is logged and triggers a storage health alert.
- Detection metric.Track checkpoint-write latency p99 by GPU node. Healthy: <10 ms per write on NVMe-backed storage. If p99 exceeds 50 ms, storage contention or a degraded node is causing serialisation — jobs will start accumulating tail latency from the I/O path, not the compute path. A separate counter tracking checksum-failure-on-load rate should alarm at any non-zero value; corruption above 0.01% of checkpoints indicates a storage health event.
- Real-world example.SwiftDiffusion (2024) discusses fault tolerance in diffusion serving, including how intermediate latent state can be persisted to enable job migration across worker nodes — analogous to preemptible-VM checkpointing in cloud infrastructure. The key insight is that the latent tensor is small enough (<5 MB for 1024×1024) that checkpoint overhead is negligible relative to the cost of restarting a 50-step job from scratch. See fault-tolerance discussion in: arxiv.org/abs/2402.10781.
Why does step-level checkpointing matter for cost, not just latency?
Break It
Three structural removals, each with a deterministic failure mode. Every failure maps to a metric in the eval harness above.
Remove the safety pre-filter
Without a text-level NSFW/policy check, policy-violating prompts enter the GPU pool and run to completion before the post-filter blocks them. The full denoising cost is paid on every adversarial request. At even a modest adversarial traffic rate of 2%, this doubles effective GPU cost on malicious traffic and guarantees a post-filter interception event on every adversarial generation — meaning the post-filter, which has a non-zero false-negative rate, is your only defense. Detection: post-filter block rate spikes while pre-filter block rate shows zero. Cost dashboard shows abnormal GPU utilization per successful delivered image. Mitigation: restore pre-filter; it costs microseconds and the ROI versus post-filter-only is roughly 1,000× in compute-cost terms.
Remove step-level checkpointing
Any GPU failure after step 30 means the job restarts from step 0. On a healthy fleet this is rare; during a GPU instability event (firmware update, thermal throttling under peak load, hardware degradation) failure rates of 1–5% per generation are not uncommon (plausible but not cited; verify against your cloud provider's reliability SLAs). Without checkpointing, p95 end-to-end latency regresses by 50%+ during any instability event — because the expected restart cost is proportional to the probability of failure times the full step count. Detection: p95 latency alarm fires during GPU health events; retry rate dashboard spikes. Mitigation: restore checkpointing; accept the 5–15% overhead in exchange for bounded tail latency during failures.
Remove tiered priority queueing
With a single FIFO queue, a free-tier user on the mobile app spamming 10 requests per minute can absorb the same GPU capacity as a pro subscriber. During a spike in free-tier traffic (a viral campaign, a press mention), pro-tier users see queue waits that breach their SLO simultaneously — because their generations queue behind free-tier jobs. Per-tier SLO guarantees collapse immediately. Detection: per-tier p50 queue wait diverges; pro-tier wait exceeds paid-tier SLO during free-tier traffic spikes. Mitigation: restore three-lane scheduler; the scheduler is the only mechanism by which paid tier is structurally protected from free-tier load.
Quick check
Removing checkpointing, GPU failures jump to 1% per generation. With 50 QPS and ~5 GPU-seconds per image, what is the hourly wasted GPU-seconds from failed restarts (no checkpointing)?
What does a bad day cost?
Image-gen incidents skew toward reputational over financial — a single high-profile content violation generates more response effort than a multi-hour outage. Design the incident tiers accordingly.
- Safety post-filter regression lets through one high-profile content violation. Dollar cost: low (one image, possibly one removed post). Actual cost: press coverage, regulatory inquiry, advertiser pause, potential app-store review. This is an existential category for consumer image-gen — reputational damage from a single viral bad-output has ended products. Mitigation: treat the >99% NSFW intercept rate as a hard floor, not a target; temporarily tighten thresholds after any violation and review the bypass pattern before relaxing.
- GPU pool partial outage during peak Sunday evening usage. 50% of GPU pool unavailable for 30 minutes. At , effective capacity drops to ~25 QPS; queue depth doubles in under 5 minutes. Pro-tier p95 queue wait breaches 45 s. SLA credits issued to pro subscribers on affected requests. Direct financial cost: SLA credits plus incident-response engineering hours. Detection window: queue-depth alarm fires within 2 minutes. Mitigation: automatic reroute to secondary GPU pool region. Detection-window sensitivity:if the outage is detected in 2 minutes and rerouting begins immediately, SLA credits apply to ~6,000 affected requests (50 QPS × 120 s). If the alarm is misconfigured or silenced and the outage runs undetected for 30 minutes, ~90,000 requests exceed SLO — a 15× delta in credit exposure for the same underlying failure. This is the tier-1 answer to “what does a bad day cost?”: the detection window, not the failure itself, determines the dollar number.
- Prompt-injection bypass at 2% rate for 4 hours before detection. An adversarial suffix pattern makes it past the pre-filter for 4 hours. At 50 QPS with ~2% of traffic adversarial, roughly 14,400 policy-violating images are generated in that window. Post-filter catches the majority; perhaps 1–2% of those slip through. Direct cost: GPU-seconds wasted on adversarial generations (~14,400 × 5 GPU-s = 72,000 GPU-seconds ≈ 20 GPU-hours, or tens to low hundreds of dollars depending on GPU pricing). Reputational cost: depends on what slipped post-filter. Detection: pre-filter pass rate for known-adversarial prompt classes spikes in the eval harness; post-filter block rate increases. Mitigation: add the injection pattern as a hard rule; run forensic review of all generated images in the window.
Quick check
A prompt-injection bypass lets 2% of traffic past the pre-filter for 4 hours. At 50 QPS, roughly how many adversarial images are generated in that window?
On-call Runbook
Queue starvation during free-tier rush
MTTR p50 / p99: 15–30 min (traffic shedding takes effect within one rate-limit window)Blast radius: Free-tier users see queue wait >5 min; paid tier unaffected if priority lanes are isolated
- 1. DetectQueue depth alert: free-tier queue depth >500 AND p99 wait >3 min fires P2 within 2 min
- 2. EscalateOn-call SRE + capacity team; escalate to P1 if paid-tier p95 also breaches 45 s
- 3. RollbackShed lowest-priority free-tier traffic via rate-limit config; restore when queue depth <200
- 4. PostAudit priority-lane isolation logic; add queue-depth-by-tier dashboard; review free-tier concurrency cap
NSFW classifier bypass on adversarial prompt
MTTR p50 / p99: Detection: hours to days without eval harness; <1 h with active monitoring of adversarial prompt classesBlast radius: Policy-violating images generated and potentially delivered; reputational P0 if any slip post-filter
- 1. DetectPost-filter block rate spike for adversarial prompt class in eval harness; manual report or CSR escalation
- 2. EscalateImmediate P0: Safety team + Legal + Comms; CEO loop if media coverage begins
- 3. RollbackAdd adversarial pattern as hard-block rule in pre-filter; temporarily tighten pre-filter threshold by 10%
- 4. PostForensic review of all images generated in the window; adversarial red-team rotation added to weekly safety checks
Diffusion NaN at denoising step 47
MTTR p50 / p99: 10–20 min (worker drain + restart)Blast radius: All in-flight generations on affected GPU workers return error; generation success rate drops sharply
- 1. DetectWorker error-rate alert: >1% of generations returning NaN/error on a single worker fires P2 within 1 min
- 2. EscalateOn-call SRE; if NaN is spreading across workers, escalate to P1 + ML Infra team
- 3. RollbackDrain affected GPU workers from load balancer; restart with previous pinned driver version
- 4. PostPin CUDA driver version in deployment manifest; add step-level NaN guard in denoising loop; root cause driver or precision issue
Company Lens
OpenAI's push
Expect deep drill-down on content policy enforcement, consent frameworks, and image provenance. OpenAI's DALL-E 3 integration with ChatGPT required significant work on refusal calibration — avoiding over-refusal on art requests while maintaining hard blocks on policy violations. Interviewers will ask: “How do you calibrate the pre-filter threshold when false-positive rate and false-negative rate pull in opposite directions?” Answer: separate threshold by content category (nudity vs violence vs copyright), track both FP and FN rates on dedicated golden sets, and treat threshold changes as a policy decision with trust-and-safety sign-off, not a pure engineering slider. Provenance: OpenAI has committed to C2PA content credentials — expect questions on watermark survivability and what it means for downstream misuse detection.
Google's push
Expect drill-down on capacity planning at web-scale, multi-tenant GPU scheduling, and cost-efficiency at Borg/GKE scale. Google's Imagen integration with Workspace and Search (announced Google I/O 2023; community estimate on deployment scale) brings image-gen to hundreds of millions of users — the scheduling problem is orders of magnitude larger than Midjourney's. Interviewers will ask: “How do you batch denoising steps across a heterogeneous GPU fleet where some nodes are H100 and some are TPU v5?” Answer: route by model — U-Net models to GPU, to TPU if TPU-optimized; maintain separate job queues per accelerator type; use a global scheduler that knows fleet topology. Also expect questions on preemptible GPU instances: “How does step-level checkpointing interact with spot-instance preemption?” Answer: checkpoints make preemption safe — the preempted job resumes from the last checkpoint on whichever node picks it up next.
Key Takeaways
What to remember for interviews
- 1A 50-step generation that fails at step 48 costs 48 steps. Text-level pre-filter + step-level checkpointing are the two structural mitigations — one prevents the spend, one bounds the loss.
- 2GPU economics for diffusion are 30–50× worse per-request than for LLMs at equivalent QPS. Never size a diffusion fleet from LLM throughput numbers.
- 3Safety is two-layer by necessity: text screening is cheap and high-ROI; image screening is the fallback for adversarial prompts that look benign in text form.
- 4Image-gen incidents skew existentially reputational. One viral bad image outweighs hours of downtime in cost to the product. Invest in safety infrastructure accordingly.
- 5Priority queueing with guaranteed free-tier floor is not a UX nicety — it is the only structural guarantee that paying customers are protected from free-tier traffic spikes.
- 6Separate queue-wait and denoising latency in your SLO dashboards. Collapsing them hides the root cause of every SLO miss.
Interview Questions
Showing 5 of 5
A generation fails at step 48 of 50. How do you design the system so you don't pay for 48 steps of wasted GPU time every time this happens?
★★★How do you enforce per-tier step budgets at the scheduler level without modifying the diffusion model itself?
★★☆Your safety post-filter has a 1% false-positive rate (blocks one in 100 legitimate images). At 50 QPS with 4 images per generation, what does that cost in GPU-seconds per day? How do you detect this regression?
★★★How would you explain to a new engineer why the CapacityCalculator built for LLM serving gives the wrong answer for a diffusion service?
★★☆Midjourney surfaces a high-profile content violation that bypassed both text-level pre-filter and image-level post-filter. Walk through the immediate response and the three-week follow-up.
★★★Recap quiz
Image-gen serving recap
Free tier gets 30 denoising steps and paid tier gets 50. What is the primary reason the free ceiling is set at 30 rather than, say, 10?
A single H100 serves roughly 1–3 images/second at 50 steps. At 50 QPS with 50-step jobs, approximately how many H100s are needed for the GPU pool alone (before replication)?
The GPU pool loses 50% capacity for 30 minutes. If the alarm fires in 2 minutes, roughly how many requests exceed SLO versus 30 minutes of undetected failure?
Why does batching diffusion jobs at the step level require all images in the batch to be at the same denoising step?
Step-level checkpointing adds 5–15% latency overhead per image. A GPU failure rate of 0.1% per generation means what expected wasted compute saving from checkpointing?
A free-tier user submits a valid prompt during a paid-tier traffic spike. What prevents them from being permanently starved in a three-lane priority queue?
Why does removing the safety pre-filter make the safety post-filter the single point of failure, even though the post-filter still runs?
Checkpoint writes add 5–15% latency overhead. The module claims a 128×128×4 FP16 latent is ~512 KB. What is the approximate write time per checkpoint on an NVMe store at 3 GB/s?
Further Reading
- Denoising Diffusion Probabilistic Models (Ho et al., 2020) — The foundational DDPM paper. Understanding the denoising loop is prerequisite knowledge for reasoning about step budgets, checkpointing, and why failures at step 48 are expensive.
- High-Resolution Image Synthesis with Latent Diffusion Models (Rombach et al., 2022) — Introduced latent diffusion — the architecture behind Stable Diffusion. Shows why denoising in latent space (not pixel space) is tractable at scale, and how the VAE bottleneck interacts with generation quality.
- DALL-E 3 Technical Report (OpenAI, 2023) — OpenAI's technical report covering prompt-following improvements, recaptioning training data, and the content-policy integration decisions that directly inform the safety architecture in this case study.
- Stability AI Research — Primary source for Stable Diffusion architecture notes, SDXL improvements, and the open-weight model family that forms the technical baseline for most independent diffusion services.
- Efficient Diffusion Serving — Ying Sheng et al., SwiftDiffusion (2024) — A practitioner paper on batching strategy, LoRA switching, and GPU utilization for diffusion serving at scale. The most directly relevant systems paper for this case study's GPU-economics section.
- C2PA Content Credentials Specification — The open standard for embedding AI-generation provenance metadata in images. Relevant to the OpenAI company-lens discussion on watermarking and traceability.