🎨 Diffusion Basics
Add Gaussian noise for 1000 steps until the image is pure static. Now learn to reverse it. Somehow this beats GANs at photorealism.
Diffusion models generate images by learning to reverse a noise process. Start with clean data, gradually add noise until it becomes pure Gaussian noise, then train a network to reverse each step. DDPM established the foundations, Latent Diffusion made it practical by working in a compressed latent space, and DiT replaced U-Nets with Transformers for better scaling.
Noise → Image → Noise
Forward and Reverse Diffusion Process
What you're seeing: The forward process (top row, left → right) gradually destroys a clean image by adding Gaussian noise at each step until only pure noise remains. The reverse process (bottom row, right → left) runs backwards — a neural network learns to undo each noise step, recovering a clean image from pure noise. What to try: Trace x₀ → x_T along the top, then follow the bottom row right-to-left to see how inference works.
Forward and Reverse Diffusion
What you’re seeing: the two-phase diffusion process — the forward pass progressively adds Gaussian noise until the image is pure noise (left to right), and the learned reverse pass denoises step-by-step to recover a sample (right to left). What to try: notice how the forward and reverse processes are symmetric in number of steps, but only the reverse pass requires a neural network.
Forward Process (Add Noise) and Reverse Process (Denoise)
Use the buttons to step through the forward diffusion process. The top row shows the forward process (clean to noise), the bottom row shows the reverse (noise to clean). At each step, the forward process adds Gaussian noise, and the reverse process learns to remove it.
The Intuition
Forward process — just add Gaussian noise. At each timestep , we add a small amount of noise to the image. After steps (typically ), the image becomes indistinguishable from pure Gaussian noise. This is fixed, not learned.
Reverse process — learn to denoise. A neural network takes a noisy image and the timestep, and predicts the noise that was added. By subtracting the predicted noise, we recover a slightly cleaner image. Repeating this for all steps generates a clean image from pure noise.
Training objective— predict the noise. Sample a clean image, pick a random timestep, add noise, and train the network to predict that noise. That's it. The loss is just MSE between the true noise and the predicted noise.
Latent diffusion — work in VAE latent space, not pixel space. A 512x512 RGB image has 786K dimensions. A VAE compresses this to a 64x64x4 latent (16K dimensions) — . Diffusion in this compact space is dramatically cheaper, which is why Stable Diffusion can run on consumer GPUs.
DiT (Diffusion Transformer) — replace U-Net with a Transformer. Patchify the latent into tokens, inject timestep via adaptive layer norm, and use standard transformer blocks. DiT scales more predictably and leverages the transformer ecosystem (FlashAttention, tensor parallelism).
DDIM (Denoising Diffusion Implicit Models)— reformulates DDPM's stochastic reverse SDE as a deterministic ODE, enabling large step skips without accumulating error. achieves image quality close to 1000-step DDPM — a 20–50x inference speedup with no retraining.
Classifier-free guidance (CFG) — controls how closely the output follows the text prompt. During training, randomly drop the text conditioning . At inference, amplify the difference between conditional and unconditional predictions. Higher guidance scale = more text adherence but less diversity.
Flow Matching (2023–2024) — replaces the score-based diffusion objective with a simpler regression target: directly predict the velocity field that transports noise to data along straight paths. Instead of learning to denoise at every timestep, the model learns a vector field v(x, t) such that integrating it carries pure noise to a real image in one continuous flow. Flow matching is faster to train and generates better samples with fewer steps than DDPM — . Stable Diffusion 3 and FLUX both use rectified flow, a special case where the paths between noise and data are straight lines (hence “rectified”), making the ODE easier to solve with large steps.
Consistency Models (Song et al., 2023) — map any noisy point directly to the clean image in a single step, without iterative denoising. The key insight: all noisy versions of the same image lie on the same ODE trajectory, so a consistency model is trained to output the same clean image regardless of which noise level it receives. Training enforces this via a consistency loss: two adjacent noise levels for the same image should produce the same output. The result is while maintaining quality close to 50-step diffusion — a 25–50x speedup at inference time with minimal quality loss.
Quick check
Stable Diffusion compresses a 512×512 image to a 64×64×4 latent before diffusion. If instead you ran diffusion in pixel space, what is the biggest bottleneck?
Why does Stable Diffusion work in latent space instead of pixel space?
Step-by-Step Derivation
Forward Process q(x_t | x_0)
Add noise at each step with variance schedule . The key insight: we can jump directly to any timestep using :
Equivalently via reparameterization: where .
Reverse Process p_theta
The learned reverse process denoises one step at a time. The network predicts the noise added at step :
Noise Prediction Loss (DDPM)
The simplified training objective — just predict the noise:
Classifier-Free Guidance (CFG)
At inference, interpolate between unconditional and conditional predictions. Guidance scale amplifies the text signal:
When , pure conditional (no guidance). When (typical), outputs strongly follow the prompt. Higher = more adherence but less diversity and risk of artifacts.
PyTorch: DDPM Training Step
def ddpm_training_step(model, x0, noise_schedule):
"""One DDPM training step — predict the noise."""
batch_size = x0.shape[0]
T = len(noise_schedule.alphas_cumprod)
# 1. Sample random timesteps
t = torch.randint(0, T, (batch_size,), device=x0.device)
# 2. Sample noise
epsilon = torch.randn_like(x0)
# 3. Create noisy image: x_t = sqrt(alpha_bar_t) * x0 + sqrt(1 - alpha_bar_t) * eps
alpha_bar_t = noise_schedule.alphas_cumprod[t][:, None, None, None]
x_t = torch.sqrt(alpha_bar_t) * x0 + torch.sqrt(1 - alpha_bar_t) * epsilon
# 4. Predict noise
epsilon_pred = model(x_t, t)
# 5. Simple MSE loss
loss = F.mse_loss(epsilon_pred, epsilon)
return lossPyTorch implementation
# DDPM noise schedule and single denoising step
import torch
import torch.nn.functional as F
def make_cosine_schedule(T: int = 1000):
"""Cosine beta schedule (Improved DDPM)."""
steps = torch.arange(T + 1, dtype=torch.float64)
f = torch.cos((steps / T + 0.008) / 1.008 * torch.pi / 2) ** 2
alphas_cumprod = f / f[0]
betas = 1 - alphas_cumprod[1:] / alphas_cumprod[:-1]
return betas.clamp(0, 0.999).float()
def ddpm_reverse_step(model, x_t, t, betas):
"""One DDPM denoising step: x_t -> x_{t-1}.
Uses the σ_t² = β_t variance choice from Ho et al., 2020."""
alpha_bar = (1 - betas[:t+1]).prod()
alpha = 1 - betas[t]
t_tensor = torch.tensor([t], device=x_t.device)
eps_pred = model(x_t, t_tensor)
# Predicted x_{t-1} mean
mu = (x_t - betas[t] / (1 - alpha_bar).sqrt() * eps_pred) / alpha.sqrt()
noise = torch.randn_like(x_t) if t > 0 else 0
sigma_t = betas[t].sqrt() # fixed-variance choice σ_t = sqrt(β_t)
return mu + sigma_t * noiseQuick check
In the CFG formula eps_guided = eps_uncond + w*(eps_cond - eps_uncond), setting w=7.5 requires two U-Net forward passes per step. If a single pass takes 20 ms on your GPU, how long does DDIM at 50 steps take with CFG?
Break It — See What Happens
Quick check
You reduce DDPM inference from 1000 to just 5 steps (no DDIM, just naive subsampling). The output is blurry and artifact-heavy. Why does DDPM specifically fail at 5 steps while DDIM succeeds at 20?
Real-World Numbers
| Model | Params | Architecture | Details |
|---|---|---|---|
| Stable Diffusion 1.5 | ~1B total | U-Net + VAE | Latent diffusion, 64×64 latent, 512×512 output, open-source. U-Net denoiser ≈ ; full pipeline (U-Net + CLIP-L text encoder + VAE) ≈ 1B. |
| DALL-E 3 | Not disclosed | Not disclosed | Recaptioned training data for prompt following; OpenAI has not published parameter count or architecture. |
| Imagen 3 | Not disclosed | Not disclosed | Google; earlier Imagen family used cascaded U-Nets, but Imagen 3 architecture/params are not public. |
| Stable Diffusion 3 | 0.45B – 8B | MMDiT (Esser et al., 2024) | Rectified-flow transformer with joint text-image attention; paper reports a scaling study from 450M to 8B parameters. |
| FLUX.1 [dev] | Rectified flow transformer | Black Forest Labs; model card lists 12B parameters, rectified flow training. | |
| DiT-XL/2 | ViT | First DiT, , proved transformer scaling |
Quick check
DiT-XL/2 (675M params) achieved SOTA FID on ImageNet 256×256. FLUX.1 [dev] uses 12B parameters — roughly 18x more. What does this scaling buy?
Video Diffusion Transformers (2024–2025)
The same DiT + latent diffusion recipe that powers image generation was extended to video in 2024. The core architectural extension: replace 2D spatial patches with 3D spacetime patches (height × width × time). This lets the transformer model temporal relationships the same way it models spatial ones — without a separate temporal attention stack.
| Model | Released | Max Res / Duration | Key Contribution |
|---|---|---|---|
| (OpenAI) | Dec 2024 (public) | 1080p / 60s | DiT on spacetime patches; flow matching; 32× latent compression |
| (Google DeepMind) | Dec 2024 | 4K / ~60s | Physics understanding; camera motion control; 4K resolution |
| (Meta) | Oct 2024 | 1080p HD / 16s | 30B video + 13B audio model; joint video+audio generation |
| (Alibaba) | Jan 2025 | — | Open-source SOTA; 70M clip training set; beats proprietary on VBench (community report) |
Deep dive: Sora — spacetime patches and flow matching
Sora compresses video into latent space using a video VAE achieving ~32× spatial + temporal compression. The latent is then split into non-overlapping 3D spacetime patches (e.g. 2 frames × 8×8 spatial) which become tokens for the DiT.
Unlike per-frame models, the attention mechanism sees temporal and spatial context simultaneously. A token representing frame 5, position (64, 32) directly attends to tokens at frame 3, position (64, 40) — enabling coherent camera motion and object tracking across seconds without explicit temporal encoding tricks.
Sora uses flow matching (not DDPM) for the denoising process: the model learns to predict a straight-line velocity field from noise to the clean video latent, enabling generation in ~20–50 steps rather than 1000.
Deep dive: MovieGen — joint video + audio generation
Meta's MovieGen (Oct 2024) pairs a 30B video transformer with a separate 13B audio transformer. The two models are trained jointly with a shared temporal alignment objective: audio tokens must be synchronised to video frames at 25 fps.
The key architectural challenge: video and audio have very different token rates. At 25 fps, 16 seconds of video produces 400 frame-level tokens; audio at 24 kHz with a compression codec produces ~600 audio tokens per second, or 9,600 for the same clip. Aligning these requires a cross-modal attention mechanism with explicit temporal position embeddings shared between the video and audio streams.
Key Takeaways
What to remember for interviews
- 1The forward process gradually adds Gaussian noise over T steps until the image becomes pure noise — it is fixed, not learned, and can be computed in closed form for any timestep t.
- 2The reverse process is a neural network trained to predict the noise added at each step; subtracting the predicted noise iteratively recovers a clean image from pure noise.
- 3Latent diffusion (Stable Diffusion) runs the entire process in VAE latent space — ~48× fewer dimensions than pixels — making high-resolution generation practical on consumer GPUs.
- 4Classifier-free guidance (CFG) amplifies the gap between conditional and unconditional predictions at inference: higher guidance scale means stronger text adherence but less diversity. Sweet spot is typically w = 5–9.
- 5Flow matching (SD3, FLUX) replaces iterative denoising with learning a straight-line velocity field from noise to data, enabling high-quality generation in ~20 steps instead of 1000.
- 6Video diffusion (Sora, Veo 2, MovieGen) extends the DiT + latent recipe to 3D spacetime patches — the same attention mechanism models temporal coherence that spatial attention models layout, without a separate temporal stack.
Diffusion LLMs (2024–2025)
Diffusion models have dominated image and video generation — but can the same masked-denoising recipe replace autoregressive (AR) transformers for text? Two 2025 papers make the strongest case yet: LLaDA (language diffusion) and Mercury Coder (commercial code model).
| Axis | Autoregressive LLM | Diffusion LLM |
|---|---|---|
| Decoding | Left-to-right, sequential tokens | Parallel denoising of all positions |
| Context | Causal mask (past only) | Bidirectional (full sequence visible) |
| Inference latency | O(N) serial steps | with fewer denoising steps |
| Quality at scale | Strong, well-understood scaling laws | Competitive on benchmarks; harder to scale to frontier |
| Training stability | Mature; cross-entropy loss | Harder: masked token prediction, multi-step objective |
| KV cache | Yes — amortises past computation | No standard KV cache; full-sequence attention each step |
LLaDA — Large Language Diffusion with mAsking (Feb 2025)
Renmin University + Ant Group trained a that is competitive with LLaMA-3 8B on general reasoning and instruction following, despite using no autoregressive objective. Training uses a simple masked token prediction loss: randomly mask a fraction of tokens, predict them all in parallel. At inference, the model iteratively unmasks tokens over T denoising steps, with each step predicting all masked positions simultaneously.
Source: arxiv:2502.09992 (GSAI, Renmin University + Ant Group, Feb 2025)
Mercury Coder — Inception Labs (Feb 2025)
The first commercial diffusion LLM, targeting code generation. Mercury Coder achieves , roughly 10× the throughput of comparable autoregressive models, by decoding all output positions in parallel across a small number of denoising steps. Quality on HumanEval and code benchmarks reaches competitive levels with similarly-sized AR models (per Inception Labs blog, Feb 2025).
Source: inceptionlabs.ai/news/introducing-mercury (Feb 2025)
PyTorch: masked diffusion training objective (LLaDA-style)
import torch
import torch.nn.functional as F
def masked_diffusion_loss(
model, # bidirectional transformer
input_ids, # (B, L) token ids
mask_token_id, # id of the [MASK] token
mask_rate=0.15, # fraction of tokens to mask (sampled uniformly in [0,1])
):
"""
LLaDA-style masked diffusion training objective.
Randomly mask a fraction of tokens, then predict all masked positions
with cross-entropy. The mask rate is sampled per batch for robustness.
"""
B, L = input_ids.shape
# Sample a random mask rate in [0, 1] for each sequence
rates = torch.rand(B, 1, device=input_ids.device) # (B, 1)
mask = torch.rand(B, L, device=input_ids.device) < rates # (B, L) bool
noisy_ids = input_ids.clone()
noisy_ids[mask] = mask_token_id # replace with [MASK]
# Model predicts logits over vocab at every position
logits = model(noisy_ids) # (B, L, vocab_size)
# Only compute loss on masked positions
loss = F.cross_entropy(
logits[mask], # (N_masked, vocab_size)
input_ids[mask], # (N_masked,)
)
return lossRecap quiz
DDPM requires 1000 denoising steps at inference. Which technique achieves comparable quality in 20–50 steps without retraining?
Why does running diffusion in the VAE latent space (64×64×4) rather than pixel space (512×512×3) reduce compute so dramatically?
A model team wants to ship a text-to-image API with p50 latency under 2 s on a single A100. DDPM with T=1000 takes ~40 s. Which change alone is sufficient?
Stable Diffusion outputs show a specific type of artifact: fine details like hair, text, and fingers are blurry even when large structures are sharp. Which component is most responsible?
During training, CFG randomly drops text conditioning 10% of the time. At inference with w=7.5, the guided noise prediction is: eps_guided = eps_uncond + 7.5*(eps_cond - eps_uncond). What does w=0 produce?
Why does the DDPM training objective predict the noise epsilon rather than directly predicting the clean image x_0?
An interviewer asks: how does DiT (Diffusion Transformer) differ architecturally from the U-Net used in Stable Diffusion 1.x, and what does DiT gain?
Further Reading
- Denoising Diffusion Probabilistic Models — (Ho et al., 2020) — the foundational DDPM paper that revived diffusion models for image generation.
- High-Resolution Image Synthesis with Latent Diffusion Models — (Rombach et al., 2022) — introduced latent diffusion, the architecture behind Stable Diffusion.
- Scalable Diffusion Models with Transformers (DiT) — (Peebles & Xie, 2023) — replaced U-Net with a Vision Transformer, showing clean scaling laws for diffusion.
- Scaling Rectified Flow Transformers for High-Resolution Image Synthesis — (Esser et al., 2024) — Stability's Stable Diffusion 3 paper: rectified flow training with MMDiT. FLUX.1 uses the same rectified-flow family but is a separate model from Black Forest Labs with no equivalent paper.
- Classifier-Free Diffusion Guidance — (Ho & Salimans, 2022) — the CFG paper. Explains how to trade sample diversity for prompt fidelity without a separate classifier model.
- Lilian Weng — What are Diffusion Models? — Comprehensive deep-dive covering DDPM, score matching, DDIM, and the connection to stochastic differential equations.
- Yannic Kilcher — DALL·E 2 / Diffusion Models Explained (YouTube) — Visual walkthrough of latent diffusion, CLIP guidance, and how modern text-to-image systems combine these ideas.
- The Illustrated Stable Diffusion — Jay Alammar — step-by-step visual breakdown of the full Stable Diffusion pipeline: text encoder, UNet denoiser, VAE decoder, and CLIP guidance.
Interview Questions
Showing 5 of 5
Explain the forward and reverse processes in DDPM. Why is the forward process fixed (not learned)?
★★☆Why does noise prediction work? Intuitively, why predict the noise rather than the clean image directly?
★★☆Why does Stable Diffusion work in latent space instead of pixel space? What are the tradeoffs?
★★☆How does DiT (Diffusion Transformer) differ from the U-Net architecture traditionally used in diffusion models?
★★★What is classifier-free guidance (CFG) and what happens when guidance scale is too low or too high?
★★★