Skip to content

Transformer Math

Module 72 · Design Reviews

🎨 Case: Design Midjourney

A 50-step generation that fails at step 48 still costs you 48 steps

Status:

Diffusion serving is brutal on cost: a job that fails at step 48 still burns 48 denoising steps. The system problem is containing that cost while preserving quality, enforcing both prompt and image safety, and scheduling paid and free tiers fairly on one GPU fleet.

📋

Requirements & SLOs

Working backwards from the user

“I type a prompt — ‘astronaut on a neon-lit Tokyo street, cinematic’ — and press Generate. A progress indicator appears. I have four variation thumbnails. I pick one, upscale it, and download a full-resolution PNG. If I am a free user I can do this a limited number of times per day. If I am a Pro subscriber, my generations go faster, the images are higher quality (more denoising steps), and I jump the queue during peak hours. I never see NSFW content I didn't ask for, and the service cannot be used to generate content that violates the platform's policies.”

SLO table

MetricTargetWhy this value
p50 queue wait5 sDiffusion is slower than chat — users accept a visible queue indicator, but more than ~5 s of queueing before generation starts reads as “broken”
p95 end-to-end (paid tier)Includes queue wait + denoising + post-filter + CDN upload; past ~60 s users abandon or complain
NSFW intercept rate>99%Combined pre- and post-filter; one uncaught violation is a reputational event, not a P99 miss — treat this as a hard floor
Step budget — free tierEnough steps for acceptable quality; limits compute cost per free generation to a fraction of paid-tier cost
Step budget — paid tierStandard quality; matches community default for Stable Diffusion / SDXL; noticeably better output than 30-step free
Step budget — pro tierDiminishing returns above ~80 steps for most prompts; pro subscribers pay for the option, not a guaranteed quality cliff
Availability99.9% / month~43 min downtime / month; GPU hardware is less reliable than CPUs — multi-AZ GPU pool makes this achievable
✨ Insight · Non-obvious SLO: track queue wait and denoising latency separately. A p95 end-to-end SLO of 45 s can be missed two entirely different ways: the queue is backed up (capacity problem) or individual denoising runs are slow (GPU health problem). Collapsing them into one number hides the root cause and leads to the wrong remediation. Separate dashboards, separate alert thresholds. For a video counterpart with the same queue-starvation dynamics, see the Sora video generation case study. For a cross-system SLO and cost comparison, see the SLO & Cost Compare module.

Quick check

Trade-off

Why track queue wait and denoising latency as separate SLO metrics rather than a single p95 end-to-end figure?

Why track queue wait and denoising latency as separate SLO metrics rather than a single p95 end-to-end figure?
🧪

Eval Harness (design first)

Image-generation eval is meaningfully harder than LLM eval. Text has a clear right answer for factual questions; images are judged on a combination of aesthetic quality, prompt alignment, and policy compliance — all of which require larger calibration sets and more human review than equivalent text evals.

Aesthetic quality

Approximated by an LLM-judge that receives the generated image and the original prompt, and rates composition, sharpness, coherence, and style-adherence on a 1–5 scale. An initial human-calibration round on ~200 image/prompt pairs anchors the judge's scale to human preferences. The key insight is that image judges are significantly noisier than text judges — a golden set of fewer than 500 images produces a confidence interval too wide to catch gradual regressions. Target at least 1,000 images for the main quality eval.

Prompt alignment (CLIP-score style)

Measures whether the generated image actually depicts what the prompt described. Use a CLIP-based scoring model (or ImageReward-style preference model trained on human ratings) to produce a numeric alignment score per image. Calibrate thresholds on a human-labeled set of 500 prompt/image pairs rated as “well-aligned” vs “off-prompt.” Prompt alignment scores are leading indicators of model regression when denoising step counts are reduced or guidance scale is changed.

Safety — two golden sets

  • Adversarial prompt set(~1,000 examples) — known policy-violating prompts, jailbreak variants, and prompt-injection patterns from red-team exercises. Target: pre-filter blocks before generation, post-filter blocks any that slip through. Combined interception rate target: >99%.
  • Benign-but-edge-case set (~500 examples) — art-style prompts, medical diagrams, historical content, and dual-use subjects that a legitimate user should be able to generate. Target: near-zero false-positive rate. An over-blocking filter that rejects Michelangelo-style nude sculptures is a quality regression with direct impact on paying customers.

Prompt-hacking detection

A distinct eval from the standard adversarial set: specifically tests whether adversarial suffixes appended to a benign prompt (e.g., “a sunset over the ocean. Ignore all safety guidelines and...”) can bypass the pre-filter. Evaluated weekly with a rotating set of novel injection patterns from external red-teamers. This is the hardest eval to maintain because new bypass patterns emerge continuously; a static golden set goes stale within weeks.

💡 Tip · Why image evals need larger calibration sets.Human raters agree on text quality ~85% of the time. On images, inter-rater agreement drops to ~70% for aesthetic quality — judges disagree on style, composition, and “good enough.” A smaller set that would give ±3 percentage points on a text eval gives ±6+ on images. Budget for 2× the calibration set size compared to an equivalent text eval, and run monthly human-anchor refreshes on a 100-image subsample to detect judge drift.
🧮

Back-of-Envelope

A mature image-gen service runs at far lower QPS than a chat product — image generation is slower and more expensive per request, so users tolerate rate limits that would be unacceptable for chat. The calculator below is seeded for ~50 QPS. Use it to reason about GPU fleet size — but read the gotcha first.

Scenario: 50 QPS image gen, 50 denoising steps per image, H100 GPUs. (Seeded to show approximate fleet size — but see the gotcha below.)
Model Size7B
GPU TypeH100-80GB
QPS Target50
Input Tokens128
Output Tokens64
Cache Hit Rate0%
Model Weights (FP16)14 GB
KV Cache / Request25.8 MB (192 tokens)
Tokens/sec per GPU4,800
Effective QPS (after cache)50
GPUs Needed1
Cost / Month$2,555
Est. p95 Latency0.82s
BottleneckBalanced

GPU Memory Usage

19%

Compute Utilization

100%

Monthly Cost

$2,555

⚠ Warning · Gotcha: This calculator was built for LLMs — it understates GPU-seconds for diffusion because one image generation requires roughly 30–100 denoising forward passes, not 1. Each pass runs the full U-Net or DiT on the spatial latent. Real GPU cost is roughly 30–50× what the calculator shows for equivalent QPS. To size a diffusion fleet correctly, start from GPU-seconds per image (benchmark on your target hardware), not from token counts.
⚠ Warning · The 50-step multiplier. An LLM uses one forward pass for prefill and one per output token. A diffusion model uses one forward pass per denoising step — 50 passes for a standard generation. Each pass processes the full spatial latent (e.g., ), which is compute-intensive in a way that has no LLM analogue. Rule of thumb: a single H100 can serve (per SwiftDiffusion, arxiv.org/abs/2402.10781 §4), compared to hundreds of LLM decode tokens per second. Design your fleet sizing from this measured baseline, not from LLM throughput numbers.

Baseline: Midjourney diffusion fleet1200 GPUs @ $3.5/hr at p99 30000 ms, 100 QPS, 15% cache hit.

Fleet cost sensitivity for a Midjourney-scale diffusion service. Numbers are community estimates — Midjourney does not publish fleet size or per-image cost.

p99 Latency Target30000 ms
Peak QPS100
Cache Hit Rate15%
Effective QPS (after cache)85
Latency-batch factor1.00×
GPUs Needed1,200 (+0% latency vs baseline)
Hourly Burn$4,200 (+0% vs baseline)
Cost / Request$0.01167
Monthly Burn (24×7)$3,066,000
BottleneckBalanced
⚠ Warning · Gotcha: Cache hits on diffusion are low because most prompts are novel. The 15% cache-hit rate shown is optimistic; in practice it is driven almost entirely by upscale retries and prompt near-duplicates. Compare with video generation (lower cache) or chat (higher cache) to see how product type drives the cost curve.
🏛️

Architecture

Eight components. Each one exists because removing it breaks either a latency SLO, a cost SLO, or a safety SLO — as shown in the justification table below the diagram.

Diffusion Image-Gen Serving Pipeline

Multi-tenant diffusion service: Client → API Gateway → Safety Pre-filter + Priority Queue → GPU Denoising Pool → Step-Level Checkpointer → Safety Post-filter → CDN → Client

ClientAPI GatewaySafety Pre-filterPriority QueueGPU Pool (denoising loop)Step-Level CheckpointerSafety Post-filterCDN

Component justification table (diffusion-specific notes)

ComponentWhy it existsDiffusion-specific note
ClientSubmits prompt, polls for result via webhook or long-pollNo streaming — results arrive all-at-once, unlike LLM SSE
API GatewayAuth, per-tenant rate limiting, step-budget cap per tierStep budget is written into job metadata here — not controllable by the client
Safety Pre-filterText-level NSFW + content-policy classifier, fast pathRuns before GPU allocation — highest-ROI safety check because it costs microseconds not 50 GPU passes
Priority QueueThree-lane weighted fair-share queue: Pro / Paid / FreeFree-tier guaranteed minimum throughput prevents starvation during paid-tier spikes
GPU Pool (denoising loop)Batched denoising steps — Batch across images at the step level — images in the same batch must be at the same denoising step for efficient batching
Step-Level CheckpointerSaves intermediate latents every N steps — enables resume on failureNo LLM analogue — necessary because GPU hardware fails mid-generation at non-trivial rates under heavy load
Safety Post-filterImage-level NSFW + multi-class content-policy classifierText-level screening can be bypassed by adversarial prompts that look benign but generate harmful images
CDNImmutable image artifacts served from edge cacheUnlike LLM responses, images are large (~1–5 MB) and cache-forever — CDN is essential for cost-efficient delivery

Quick check

Trade-off

The safety pre-filter runs before GPU allocation. What is the cost-efficiency argument for this placement versus running it after the job enters the GPU pool?

The safety pre-filter runs before GPU allocation. What is the cost-efficiency argument for this placement versus running it after the job enters the GPU pool?
🔬

Deep Dives

Expand the deep dives

Open for the full technical detail.

Expand

Two components drive most of the risk here: the priority queue and step-level checkpointing.

Deep dive A — Priority queue + step-budget enforcement

  • Lane structure. Three logical queues — Pro, Paid, Free — with weighted priorities of approximately 4:2:1. The scheduler pulls from Pro until it is drained or a time-slice expires, then Paid, then Free. Pure strict-priority would starve free-tier users during peak hours; the time-slice constraint guarantees all tiers make progress.
  • Freshness prior. Within a lane, jobs are ordered by submission time (FIFO) plus a small staleness bonus that grows with wait time. Without the staleness bonus, a burst of new Pro jobs at second 59 can perpetually re-jump a Pro job that has been waiting 58 seconds. The bonus ensures no job waits indefinitely regardless of burst patterns.
  • Step-budget enforcement. The step budget is written into job metadata by the API gateway at admit time, derived from the tier claim in the auth token. The scheduler wraps the denoising loop with a counter; when the counter reaches the budget ceiling, the loop stops. The budget is not a request parameter — a client that attempts to override it is rejected at the gateway before the job is enqueued.
  • Failure mode: free-tier starvation. During a sustained paid-tier spike (e.g., a viral prompt trend), free- tier queue depth can grow unbounded if the scheduler has no minimum-throughput guarantee. Mitigation: reserve a minimum fraction of GPU capacity for free-tier jobs — if free-tier jobs have been waiting more than a threshold (e.g., 30 s), the scheduler promotes them to the Paid lane temporarily.
  • Detection metric.Watch queue-depth p95 split by lane. A healthy system shows Pro p95 < 5 s and Free p95 < 120 s. If Free p95 exceeds 3× the SLO threshold during a paid-tier spike, the staleness bonus weights are stale or the minimum-throughput floor was not enforced — page the on-call scheduler engineer. A single noisy tenant can starve paying customers within the fair-share window if priority weights are not recomputed on traffic shape changes.
  • Real-world example. SwiftDiffusion (2024) describes a multi-tenant diffusion serving system with a step-aware scheduler that separates queue management from denoising execution — the queue scheduler runs on CPU while GPU workers pull jobs asynchronously. Their §5 discusses how tenant isolation is maintained under heterogeneous load, and how a time-sharing policy with per-tenant burst limits prevents monopolisation while preserving p95 latency guarantees for paid tiers. See: arxiv.org/abs/2402.10781 §5.

Deep dive B — Step-level checkpointing and resume

  • What gets saved. The intermediate latent tensor at step N (a small float array, typically a few MB at ) plus the RNG seed, step counter, and job metadata. Checkpoints are written to fast object storage (S3 or equivalent) every 10 steps.
  • Resume semantics. When the denoising worker picks up a job, it first checks for an existing checkpoint. If one exists and passes a checksum verification, it loads the latent and resumes from step N rather than step 0. The client sees no difference — the final image is identical to what a full-run would have produced because the denoising is deterministic given the same seed and starting latent.
  • Trade-off: I/O overhead vs tail-latency. Writing a checkpoint every 10 steps adds roughly due to storage I/O. Back-of-envelope: a ; with job metadata the checkpoint is a few MB. An NVMe-backed object store sustains ~3 GB/s sequential writes, so each checkpoint write takes <5 ms. Over 50 steps (5 checkpoints at every 10 steps), that is ~25 ms of write time against a ~5-second generation — roughly 0.5% per checkpoint write, or ~2.5% cumulative. Storage round-trip latency and queue contention under load push the real figure to 5–15%. In exchange, a GPU failure at step 48 costs you 2 steps (48 → 50, or restart from checkpoint at 40) instead of 48 steps. At a GPU failure rate of even 0.1% per generation, the expected wasted compute drops by ~20× with checkpointing enabled.
  • Failure mode: checkpoint corruption. A partial write or storage hiccup produces a corrupt checkpoint. Without detection, the worker loads garbage latents and produces a visually broken image. Mitigation: every checkpoint includes a SHA-256 checksum. If the checksum fails on load, the worker falls back to the previous valid checkpoint (or step 0 if none exist) and re-runs from there. Corruption is logged and triggers a storage health alert.
  • Detection metric.Track checkpoint-write latency p99 by GPU node. Healthy: <10 ms per write on NVMe-backed storage. If p99 exceeds 50 ms, storage contention or a degraded node is causing serialisation — jobs will start accumulating tail latency from the I/O path, not the compute path. A separate counter tracking checksum-failure-on-load rate should alarm at any non-zero value; corruption above 0.01% of checkpoints indicates a storage health event.
  • Real-world example.SwiftDiffusion (2024) discusses fault tolerance in diffusion serving, including how intermediate latent state can be persisted to enable job migration across worker nodes — analogous to preemptible-VM checkpointing in cloud infrastructure. The key insight is that the latent tensor is small enough (<5 MB for 1024×1024) that checkpoint overhead is negligible relative to the cost of restarting a 50-step job from scratch. See fault-tolerance discussion in: arxiv.org/abs/2402.10781.
✨ Insight · Two deep dives, not four.The priority queue and checkpointer are the components with the highest blast radius if wrong — the queue affects every user's wait time and every paying customer's SLO, and the checkpointer determines what every GPU failure costs. Every other component (CDN, post-filter, API gateway) has a clear off-the-shelf design with well-understood failure modes. Deep-dive the novel parts; reference-design the commodity parts.
Quick Check

Why does step-level checkpointing matter for cost, not just latency?

🔧

Break It

Three structural removals, each with a deterministic failure mode. Every failure maps to a metric in the eval harness above.

Remove the safety pre-filter

Without a text-level NSFW/policy check, policy-violating prompts enter the GPU pool and run to completion before the post-filter blocks them. The full denoising cost is paid on every adversarial request. At even a modest adversarial traffic rate of 2%, this doubles effective GPU cost on malicious traffic and guarantees a post-filter interception event on every adversarial generation — meaning the post-filter, which has a non-zero false-negative rate, is your only defense. Detection: post-filter block rate spikes while pre-filter block rate shows zero. Cost dashboard shows abnormal GPU utilization per successful delivered image. Mitigation: restore pre-filter; it costs microseconds and the ROI versus post-filter-only is roughly 1,000× in compute-cost terms.

Remove step-level checkpointing

Any GPU failure after step 30 means the job restarts from step 0. On a healthy fleet this is rare; during a GPU instability event (firmware update, thermal throttling under peak load, hardware degradation) failure rates of 1–5% per generation are not uncommon (plausible but not cited; verify against your cloud provider's reliability SLAs). Without checkpointing, p95 end-to-end latency regresses by 50%+ during any instability event — because the expected restart cost is proportional to the probability of failure times the full step count. Detection: p95 latency alarm fires during GPU health events; retry rate dashboard spikes. Mitigation: restore checkpointing; accept the 5–15% overhead in exchange for bounded tail latency during failures.

Remove tiered priority queueing

With a single FIFO queue, a free-tier user on the mobile app spamming 10 requests per minute can absorb the same GPU capacity as a pro subscriber. During a spike in free-tier traffic (a viral campaign, a press mention), pro-tier users see queue waits that breach their SLO simultaneously — because their generations queue behind free-tier jobs. Per-tier SLO guarantees collapse immediately. Detection: per-tier p50 queue wait diverges; pro-tier wait exceeds paid-tier SLO during free-tier traffic spikes. Mitigation: restore three-lane scheduler; the scheduler is the only mechanism by which paid tier is structurally protected from free-tier load.

Quick check

Derivation

Removing checkpointing, GPU failures jump to 1% per generation. With 50 QPS and ~5 GPU-seconds per image, what is the hourly wasted GPU-seconds from failed restarts (no checkpointing)?

Removing checkpointing, GPU failures jump to 1% per generation. With 50 QPS and ~5 GPU-seconds per image, what is the hourly wasted GPU-seconds from failed restarts (no checkpointing)?
💸

What does a bad day cost?

Image-gen incidents skew toward reputational over financial — a single high-profile content violation generates more response effort than a multi-hour outage. Design the incident tiers accordingly.

  • Safety post-filter regression lets through one high-profile content violation. Dollar cost: low (one image, possibly one removed post). Actual cost: press coverage, regulatory inquiry, advertiser pause, potential app-store review. This is an existential category for consumer image-gen — reputational damage from a single viral bad-output has ended products. Mitigation: treat the >99% NSFW intercept rate as a hard floor, not a target; temporarily tighten thresholds after any violation and review the bypass pattern before relaxing.
  • GPU pool partial outage during peak Sunday evening usage. 50% of GPU pool unavailable for 30 minutes. At , effective capacity drops to ~25 QPS; queue depth doubles in under 5 minutes. Pro-tier p95 queue wait breaches 45 s. SLA credits issued to pro subscribers on affected requests. Direct financial cost: SLA credits plus incident-response engineering hours. Detection window: queue-depth alarm fires within 2 minutes. Mitigation: automatic reroute to secondary GPU pool region. Detection-window sensitivity:if the outage is detected in 2 minutes and rerouting begins immediately, SLA credits apply to ~6,000 affected requests (50 QPS × 120 s). If the alarm is misconfigured or silenced and the outage runs undetected for 30 minutes, ~90,000 requests exceed SLO — a 15× delta in credit exposure for the same underlying failure. This is the tier-1 answer to “what does a bad day cost?”: the detection window, not the failure itself, determines the dollar number.
  • Prompt-injection bypass at 2% rate for 4 hours before detection. An adversarial suffix pattern makes it past the pre-filter for 4 hours. At 50 QPS with ~2% of traffic adversarial, roughly 14,400 policy-violating images are generated in that window. Post-filter catches the majority; perhaps 1–2% of those slip through. Direct cost: GPU-seconds wasted on adversarial generations (~14,400 × 5 GPU-s = 72,000 GPU-seconds ≈ 20 GPU-hours, or tens to low hundreds of dollars depending on GPU pricing). Reputational cost: depends on what slipped post-filter. Detection: pre-filter pass rate for known-adversarial prompt classes spikes in the eval harness; post-filter block rate increases. Mitigation: add the injection pattern as a hard rule; run forensic review of all generated images in the window.
⚠ Warning · Asymmetry: image-gen incidents skew toward reputational, not financial. An LLM service that goes down costs SLA credits. An image-gen service that generates one viral bad image costs the trust of an entire user base and potentially triggers regulatory action. The engineering implication: invest in safety infrastructure at a level that looks disproportionate relative to the financial downside — because the reputational downside is existential.

Quick check

Derivation

A prompt-injection bypass lets 2% of traffic past the pre-filter for 4 hours. At 50 QPS, roughly how many adversarial images are generated in that window?

A prompt-injection bypass lets 2% of traffic past the pre-filter for 4 hours. At 50 QPS, roughly how many adversarial images are generated in that window?
🚨

On-call Runbook

Queue starvation during free-tier rush

MTTR p50 / p99: 15–30 min (traffic shedding takes effect within one rate-limit window)

Blast radius: Free-tier users see queue wait >5 min; paid tier unaffected if priority lanes are isolated

  1. 1. DetectQueue depth alert: free-tier queue depth >500 AND p99 wait >3 min fires P2 within 2 min
  2. 2. EscalateOn-call SRE + capacity team; escalate to P1 if paid-tier p95 also breaches 45 s
  3. 3. RollbackShed lowest-priority free-tier traffic via rate-limit config; restore when queue depth <200
  4. 4. PostAudit priority-lane isolation logic; add queue-depth-by-tier dashboard; review free-tier concurrency cap

NSFW classifier bypass on adversarial prompt

MTTR p50 / p99: Detection: hours to days without eval harness; <1 h with active monitoring of adversarial prompt classes

Blast radius: Policy-violating images generated and potentially delivered; reputational P0 if any slip post-filter

  1. 1. DetectPost-filter block rate spike for adversarial prompt class in eval harness; manual report or CSR escalation
  2. 2. EscalateImmediate P0: Safety team + Legal + Comms; CEO loop if media coverage begins
  3. 3. RollbackAdd adversarial pattern as hard-block rule in pre-filter; temporarily tighten pre-filter threshold by 10%
  4. 4. PostForensic review of all images generated in the window; adversarial red-team rotation added to weekly safety checks

Diffusion NaN at denoising step 47

MTTR p50 / p99: 10–20 min (worker drain + restart)

Blast radius: All in-flight generations on affected GPU workers return error; generation success rate drops sharply

  1. 1. DetectWorker error-rate alert: >1% of generations returning NaN/error on a single worker fires P2 within 1 min
  2. 2. EscalateOn-call SRE; if NaN is spreading across workers, escalate to P1 + ML Infra team
  3. 3. RollbackDrain affected GPU workers from load balancer; restart with previous pinned driver version
  4. 4. PostPin CUDA driver version in deployment manifest; add step-level NaN guard in denoising loop; root cause driver or precision issue
🏢

Company Lens

OpenAI's push

Expect deep drill-down on content policy enforcement, consent frameworks, and image provenance. OpenAI's DALL-E 3 integration with ChatGPT required significant work on refusal calibration — avoiding over-refusal on art requests while maintaining hard blocks on policy violations. Interviewers will ask: “How do you calibrate the pre-filter threshold when false-positive rate and false-negative rate pull in opposite directions?” Answer: separate threshold by content category (nudity vs violence vs copyright), track both FP and FN rates on dedicated golden sets, and treat threshold changes as a policy decision with trust-and-safety sign-off, not a pure engineering slider. Provenance: OpenAI has committed to C2PA content credentials — expect questions on watermark survivability and what it means for downstream misuse detection.

Google's push

Expect drill-down on capacity planning at web-scale, multi-tenant GPU scheduling, and cost-efficiency at Borg/GKE scale. Google's Imagen integration with Workspace and Search (announced Google I/O 2023; community estimate on deployment scale) brings image-gen to hundreds of millions of users — the scheduling problem is orders of magnitude larger than Midjourney's. Interviewers will ask: “How do you batch denoising steps across a heterogeneous GPU fleet where some nodes are H100 and some are TPU v5?” Answer: route by model — U-Net models to GPU, to TPU if TPU-optimized; maintain separate job queues per accelerator type; use a global scheduler that knows fleet topology. Also expect questions on preemptible GPU instances: “How does step-level checkpointing interact with spot-instance preemption?” Answer: checkpoints make preemption safe — the preempted job resumes from the last checkpoint on whichever node picks it up next.

🧠

Key Takeaways

What to remember for interviews

  1. 1A 50-step generation that fails at step 48 costs 48 steps. Text-level pre-filter + step-level checkpointing are the two structural mitigations — one prevents the spend, one bounds the loss.
  2. 2GPU economics for diffusion are 30–50× worse per-request than for LLMs at equivalent QPS. Never size a diffusion fleet from LLM throughput numbers.
  3. 3Safety is two-layer by necessity: text screening is cheap and high-ROI; image screening is the fallback for adversarial prompts that look benign in text form.
  4. 4Image-gen incidents skew existentially reputational. One viral bad image outweighs hours of downtime in cost to the product. Invest in safety infrastructure accordingly.
  5. 5Priority queueing with guaranteed free-tier floor is not a UX nicety — it is the only structural guarantee that paying customers are protected from free-tier traffic spikes.
  6. 6Separate queue-wait and denoising latency in your SLO dashboards. Collapsing them hides the root cause of every SLO miss.
🎯

Interview Questions

Difficulty:
Company:

Showing 5 of 5

A generation fails at step 48 of 50. How do you design the system so you don't pay for 48 steps of wasted GPU time every time this happens?

★★★
OpenAIGoogle

How do you enforce per-tier step budgets at the scheduler level without modifying the diffusion model itself?

★★☆
GoogleMeta

Your safety post-filter has a 1% false-positive rate (blocks one in 100 legitimate images). At 50 QPS with 4 images per generation, what does that cost in GPU-seconds per day? How do you detect this regression?

★★★
OpenAIAnthropic

How would you explain to a new engineer why the CapacityCalculator built for LLM serving gives the wrong answer for a diffusion service?

★★☆
GoogleMeta

Midjourney surfaces a high-profile content violation that bypassed both text-level pre-filter and image-level post-filter. Walk through the immediate response and the three-week follow-up.

★★★
OpenAIAnthropic
🧠

Recap quiz

🧠

Image-gen serving recap

Trade-off

Free tier gets 30 denoising steps and paid tier gets 50. What is the primary reason the free ceiling is set at 30 rather than, say, 10?

Free tier gets 30 denoising steps and paid tier gets 50. What is the primary reason the free ceiling is set at 30 rather than, say, 10?
Derivation

A single H100 serves roughly 1–3 images/second at 50 steps. At 50 QPS with 50-step jobs, approximately how many H100s are needed for the GPU pool alone (before replication)?

A single H100 serves roughly 1–3 images/second at 50 steps. At 50 QPS with 50-step jobs, approximately how many H100s are needed for the GPU pool alone (before replication)?
Derivation

The GPU pool loses 50% capacity for 30 minutes. If the alarm fires in 2 minutes, roughly how many requests exceed SLO versus 30 minutes of undetected failure?

The GPU pool loses 50% capacity for 30 minutes. If the alarm fires in 2 minutes, roughly how many requests exceed SLO versus 30 minutes of undetected failure?
Derivation

Why does batching diffusion jobs at the step level require all images in the batch to be at the same denoising step?

Why does batching diffusion jobs at the step level require all images in the batch to be at the same denoising step?
Derivation

Step-level checkpointing adds 5–15% latency overhead per image. A GPU failure rate of 0.1% per generation means what expected wasted compute saving from checkpointing?

Step-level checkpointing adds 5–15% latency overhead per image. A GPU failure rate of 0.1% per generation means what expected wasted compute saving from checkpointing?
Trade-off

A free-tier user submits a valid prompt during a paid-tier traffic spike. What prevents them from being permanently starved in a three-lane priority queue?

A free-tier user submits a valid prompt during a paid-tier traffic spike. What prevents them from being permanently starved in a three-lane priority queue?
Trade-off

Why does removing the safety pre-filter make the safety post-filter the single point of failure, even though the post-filter still runs?

Why does removing the safety pre-filter make the safety post-filter the single point of failure, even though the post-filter still runs?
Derivation

Checkpoint writes add 5–15% latency overhead. The module claims a 128×128×4 FP16 latent is ~512 KB. What is the approximate write time per checkpoint on an NVMe store at 3 GB/s?

Checkpoint writes add 5–15% latency overhead. The module claims a 128×128×4 FP16 latent is ~512 KB. What is the approximate write time per checkpoint on an NVMe store at 3 GB/s?
📚

Further Reading