Diffusion Basics — Transformer Math

Module 34 · Architectures

🎨 Diffusion Basics

Add Gaussian noise for 1000 steps until the image is pure static. Now learn to reverse it. Somehow this beats GANs at photorealism.

Status:

Diffusion models generate images by learning to reverse a noise process. Start with clean data, gradually add noise until it becomes pure Gaussian noise, then train a network to reverse each step. DDPM established the foundations, Latent Diffusion made it practical by working in a compressed latent space, and DiT replaced U-Nets with Transformers for better scaling.

🎨

Noise → Image → Noise

Forward and Reverse Diffusion Process

What you're seeing: The forward process (top row, left → right) gradually destroys a clean image by adding Gaussian noise at each step until only pure noise remains. The reverse process (bottom row, right → left) runs backwards — a neural network learns to undo each noise step, recovering a clean image from pure noise. What to try: Trace x₀ → x_T along the top, then follow the bottom row right-to-left to see how inference works.

🎮

Forward and Reverse Diffusion

What you’re seeing: the two-phase diffusion process — the forward pass progressively adds Gaussian noise until the image is pure noise (left to right), and the learned reverse pass denoises step-by-step to recover a sample (right to left). What to try: notice how the forward and reverse processes are symmetric in number of steps, but only the reverse pass requires a neural network.

Forward Process (Add Noise) and Reverse Process (Denoise)

Use the buttons to step through the forward diffusion process. The top row shows the forward process (clean to noise), the bottom row shows the reverse (noise to clean). At each step, the forward process adds Gaussian noise, and the reverse process learns to remove it.

Step 0 / 5

💡

The Intuition

Forward process — just add Gaussian noise. At each timestep , we add a small amount of noise to the image. After steps (typically ), the image becomes indistinguishable from pure Gaussian noise. This is fixed, not learned.

Reverse process — learn to denoise. A neural network takes a noisy image and the timestep, and predicts the noise that was added. By subtracting the predicted noise, we recover a slightly cleaner image. Repeating this for all steps generates a clean image from pure noise.

Training objective— predict the noise. Sample a clean image, pick a random timestep, add noise, and train the network to predict that noise. That's it. The loss is just MSE between the true noise and the predicted noise.

Latent diffusion — work in VAE latent space, not pixel space. A 512x512 RGB image has 786K dimensions. A VAE compresses this to a 64x64x4 latent (16K dimensions) — . Diffusion in this compact space is dramatically cheaper, which is why Stable Diffusion can run on consumer GPUs.

DiT (Diffusion Transformer) — replace U-Net with a Transformer. Patchify the latent into tokens, inject timestep via adaptive layer norm, and use standard transformer blocks. DiT scales more predictably and leverages the transformer ecosystem (FlashAttention, tensor parallelism).

DDIM (Denoising Diffusion Implicit Models)— reformulates DDPM's stochastic reverse SDE as a deterministic ODE, enabling large step skips without accumulating error. achieves image quality close to 1000-step DDPM — a 20–50x inference speedup with no retraining.

Classifier-free guidance (CFG) — controls how closely the output follows the text prompt. During training, randomly drop the text conditioning . At inference, amplify the difference between conditional and unconditional predictions. Higher guidance scale = more text adherence but less diversity.

Flow Matching (2023–2024) — replaces the score-based diffusion objective with a simpler regression target: directly predict the velocity field that transports noise to data along straight paths. Instead of learning to denoise at every timestep, the model learns a vector field v(x, t) such that integrating it carries pure noise to a real image in one continuous flow. Flow matching is faster to train and generates better samples with fewer steps than DDPM — . Stable Diffusion 3 and FLUX both use rectified flow, a special case where the paths between noise and data are straight lines (hence “rectified”), making the ODE easier to solve with large steps.

Consistency Models (Song et al., 2023) — map any noisy point directly to the clean image in a single step, without iterative denoising. The key insight: all noisy versions of the same image lie on the same ODE trajectory, so a consistency model is trained to output the same clean image regardless of which noise level it receives. Training enforces this via a consistency loss: two adjacent noise levels for the same image should produce the same output. The result is while maintaining quality close to 50-step diffusion — a 25–50x speedup at inference time with minimal quality loss.

✨ Insight · The genius of diffusion: the forward process is trivial (just add noise), which gives a simple training target (predict the noise). All the complexity lives in the neural network — and we know how to scale those.

Quick check

Derivation

Stable Diffusion compresses a 512×512 image to a 64×64×4 latent before diffusion. If instead you ran diffusion in pixel space, what is the biggest bottleneck?

The noise schedule requires more timesteps in pixel space because pixel values change more slowly.Pixel-space models cannot condition on text because CLIP embeddings are incompatible.Self-attention cost scales with spatial positions squared — 64x more positions means ~4,000x more attention pairs.The MSE loss in pixel space overfits to color instead of shape, making training unstable.

Quick Check

Why does Stable Diffusion work in latent space instead of pixel space?

📐

Step-by-Step Derivation

Forward Process q(x_t | x_0)

Add noise at each step with variance schedule . The key insight: we can jump directly to any timestep using :

Equivalently via reparameterization: where .

Reverse Process p_theta

The learned reverse process denoises one step at a time. The network predicts the noise added at step :

Noise Prediction Loss (DDPM)

The simplified training objective — just predict the noise:

💡 Tip · This is a simple MSE loss. Sample from data, sample , sample , compute , predict , minimize MSE.

Classifier-Free Guidance (CFG)

At inference, interpolate between unconditional and conditional predictions. Guidance scale amplifies the text signal:

When , pure conditional (no guidance). When (typical), outputs strongly follow the prompt. Higher = more adherence but less diversity and risk of artifacts.

PyTorch: DDPM Training Step

python

def ddpm_training_step(model, x0, noise_schedule):
    """One DDPM training step — predict the noise."""
    batch_size = x0.shape[0]
    T = len(noise_schedule.alphas_cumprod)

    # 1. Sample random timesteps
    t = torch.randint(0, T, (batch_size,), device=x0.device)

    # 2. Sample noise
    epsilon = torch.randn_like(x0)

    # 3. Create noisy image: x_t = sqrt(alpha_bar_t) * x0 + sqrt(1 - alpha_bar_t) * eps
    alpha_bar_t = noise_schedule.alphas_cumprod[t][:, None, None, None]
    x_t = torch.sqrt(alpha_bar_t) * x0 + torch.sqrt(1 - alpha_bar_t) * epsilon

    # 4. Predict noise
    epsilon_pred = model(x_t, t)

    # 5. Simple MSE loss
    loss = F.mse_loss(epsilon_pred, epsilon)
    return loss

PyTorch implementation

# DDPM noise schedule and single denoising step
import torch
import torch.nn.functional as F

def make_cosine_schedule(T: int = 1000):
    """Cosine beta schedule (Improved DDPM)."""
    steps = torch.arange(T + 1, dtype=torch.float64)
    f = torch.cos((steps / T + 0.008) / 1.008 * torch.pi / 2) ** 2
    alphas_cumprod = f / f[0]
    betas = 1 - alphas_cumprod[1:] / alphas_cumprod[:-1]
    return betas.clamp(0, 0.999).float()

def ddpm_reverse_step(model, x_t, t, betas):
    """One DDPM denoising step: x_t -> x_{t-1}.
    Uses the σ_t² = β_t variance choice from Ho et al., 2020."""
    alpha_bar = (1 - betas[:t+1]).prod()
    alpha = 1 - betas[t]
    t_tensor = torch.tensor([t], device=x_t.device)
    eps_pred = model(x_t, t_tensor)
    # Predicted x_{t-1} mean
    mu = (x_t - betas[t] / (1 - alpha_bar).sqrt() * eps_pred) / alpha.sqrt()
    noise = torch.randn_like(x_t) if t > 0 else 0
    sigma_t = betas[t].sqrt()  # fixed-variance choice σ_t = sqrt(β_t)
    return mu + sigma_t * noise

Quick check

Derivation

In the CFG formula eps_guided = eps_uncond + w*(eps_cond - eps_uncond), setting w=7.5 requires two U-Net forward passes per step. If a single pass takes 20 ms on your GPU, how long does DDIM at 50 steps take with CFG?

2 seconds — 50 steps × 2 passes/step × 20 ms/pass = 2,000 ms.3 seconds — CFG adds a classifier forward pass on top of the two U-Net passes.1 second — w scales one pass only; the other pass is free from the cache.4 seconds — DDIM doubles effective step count compared to DDPM stochastic sampling.

🔧

Break It — See What Happens

Too few diffusion steps (e.g., 5 instead of 50)

No classifier-free guidance (w = 1)

Quick check

Trade-off

You reduce DDPM inference from 1000 to just 5 steps (no DDIM, just naive subsampling). The output is blurry and artifact-heavy. Why does DDPM specifically fail at 5 steps while DDIM succeeds at 20?

5 steps is not divisible into 1000, causing integer division errors in the denoising loop.DDPM adds stochastic noise at each step; with 5 large steps each adds 200x the intended noise variance.The U-Net was trained with a fixed batch normalization schedule that assumes 1000 steps.DDPM uses a fixed noise schedule that is undefined outside integer timesteps 1–1000.

📊

Real-World Numbers

Model	Params	Architecture	Details
Stable Diffusion 1.5	~1B total	U-Net + VAE	Latent diffusion, 64×64 latent, 512×512 output, open-source. U-Net denoiser ≈ ; full pipeline (U-Net + CLIP-L text encoder + VAE) ≈ 1B.
DALL-E 3	Not disclosed	Not disclosed	Recaptioned training data for prompt following; OpenAI has not published parameter count or architecture.
Imagen 3	Not disclosed	Not disclosed	Google; earlier Imagen family used cascaded U-Nets, but Imagen 3 architecture/params are not public.
Stable Diffusion 3	0.45B – 8B	MMDiT (Esser et al., 2024)	Rectified-flow transformer with joint text-image attention; paper reports a scaling study from 450M to 8B parameters.
FLUX.1 [dev]		Rectified flow transformer	Black Forest Labs; model card lists 12B parameters, rectified flow training.
DiT-XL/2		ViT	First DiT, , proved transformer scaling

✨ Insight · The trend: U-Net → Transformer, pixel space → latent space, . Each generation made diffusion faster and higher quality. DiT + latent + CFG is the current winning formula.

Quick check

Trade-off

DiT-XL/2 (675M params) achieved SOTA FID on ImageNet 256×256. FLUX.1 [dev] uses 12B parameters — roughly 18x more. What does this scaling buy?

Richer semantic understanding and text-image alignment across complex compositional prompts.18x fewer denoising steps, making FLUX.1 operate at ~55 steps vs. DiT's 1000.18x more parameters reduce the noise schedule from 1000 to 55 steps proportionally.VAE compression ratio improves with scale — FLUX.1 uses f=16 instead of f=8.

🎬

Video Diffusion Transformers (2024–2025)

The same DiT + latent diffusion recipe that powers image generation was extended to video in 2024. The core architectural extension: replace 2D spatial patches with 3D spacetime patches (height × width × time). This lets the transformer model temporal relationships the same way it models spatial ones — without a separate temporal attention stack.

Model	Released	Max Res / Duration	Key Contribution
(OpenAI)	Dec 2024 (public)	1080p / 60s	DiT on spacetime patches; flow matching; 32× latent compression
(Google DeepMind)	Dec 2024	4K / ~60s	Physics understanding; camera motion control; 4K resolution
(Meta)	Oct 2024	1080p HD / 16s	30B video + 13B audio model; joint video+audio generation
(Alibaba)	Jan 2025	—	Open-source SOTA; 70M clip training set; beats proprietary on VBench (community report)

Deep dive: Sora — spacetime patches and flow matching

Sora compresses video into latent space using a video VAE achieving ~32× spatial + temporal compression. The latent is then split into non-overlapping 3D spacetime patches (e.g. 2 frames × 8×8 spatial) which become tokens for the DiT.

Unlike per-frame models, the attention mechanism sees temporal and spatial context simultaneously. A token representing frame 5, position (64, 32) directly attends to tokens at frame 3, position (64, 40) — enabling coherent camera motion and object tracking across seconds without explicit temporal encoding tricks.

Sora uses flow matching (not DDPM) for the denoising process: the model learns to predict a straight-line velocity field from noise to the clean video latent, enabling generation in ~20–50 steps rather than 1000.

Source: OpenAI Sora technical overview, Dec 2024

Deep dive: MovieGen — joint video + audio generation

Meta's MovieGen (Oct 2024) pairs a 30B video transformer with a separate 13B audio transformer. The two models are trained jointly with a shared temporal alignment objective: audio tokens must be synchronised to video frames at 25 fps.

The key architectural challenge: video and audio have very different token rates. At 25 fps, 16 seconds of video produces 400 frame-level tokens; audio at 24 kHz with a compression codec produces ~600 audio tokens per second, or 9,600 for the same clip. Aligning these requires a cross-modal attention mechanism with explicit temporal position embeddings shared between the video and audio streams.

Source: MovieGen, arxiv:2410.13720, Oct 2024

✨ Insight · Wan2.1 (Jan 2025) is the first open-source model to outperform proprietary video generators on VBench (community report, arxiv:2412.09626). This follows the same open-source catch-up trajectory as image diffusion: Stable Diffusion matched DALL-E 2 quality roughly 12 months after DALL-E 2 launched. For video, the gap closed in under 12 months from Sora's Feb 2024 announcement to Wan2.1's Jan 2025 release.

🧠

Key Takeaways

What to remember for interviews

1The forward process gradually adds Gaussian noise over T steps until the image becomes pure noise — it is fixed, not learned, and can be computed in closed form for any timestep t.
2The reverse process is a neural network trained to predict the noise added at each step; subtracting the predicted noise iteratively recovers a clean image from pure noise.
3Latent diffusion (Stable Diffusion) runs the entire process in VAE latent space — ~48× fewer dimensions than pixels — making high-resolution generation practical on consumer GPUs.
4Classifier-free guidance (CFG) amplifies the gap between conditional and unconditional predictions at inference: higher guidance scale means stronger text adherence but less diversity. Sweet spot is typically w = 5–9.
5Flow matching (SD3, FLUX) replaces iterative denoising with learning a straight-line velocity field from noise to data, enabling high-quality generation in ~20 steps instead of 1000.
6Video diffusion (Sora, Veo 2, MovieGen) extends the DiT + latent recipe to 3D spacetime patches — the same attention mechanism models temporal coherence that spatial attention models layout, without a separate temporal stack.

🧬

Diffusion LLMs (2024–2025)

Diffusion models have dominated image and video generation — but can the same masked-denoising recipe replace autoregressive (AR) transformers for text? Two 2025 papers make the strongest case yet: LLaDA (language diffusion) and Mercury Coder (commercial code model).

Axis	Autoregressive LLM	Diffusion LLM
Decoding	Left-to-right, sequential tokens	Parallel denoising of all positions
Context	Causal mask (past only)	Bidirectional (full sequence visible)
Inference latency	O(N) serial steps	with fewer denoising steps
Quality at scale	Strong, well-understood scaling laws	Competitive on benchmarks; harder to scale to frontier
Training stability	Mature; cross-entropy loss	Harder: masked token prediction, multi-step objective
KV cache	Yes — amortises past computation	No standard KV cache; full-sequence attention each step

LLaDA — Large Language Diffusion with mAsking (Feb 2025)

Renmin University + Ant Group trained a that is competitive with LLaMA-3 8B on general reasoning and instruction following, despite using no autoregressive objective. Training uses a simple masked token prediction loss: randomly mask a fraction of tokens, predict them all in parallel. At inference, the model iteratively unmasks tokens over T denoising steps, with each step predicting all masked positions simultaneously.

Source: arxiv:2502.09992 (GSAI, Renmin University + Ant Group, Feb 2025)

Mercury Coder — Inception Labs (Feb 2025)

The first commercial diffusion LLM, targeting code generation. Mercury Coder achieves , roughly 10× the throughput of comparable autoregressive models, by decoding all output positions in parallel across a small number of denoising steps. Quality on HumanEval and code benchmarks reaches competitive levels with similarly-sized AR models (per Inception Labs blog, Feb 2025).

Source: inceptionlabs.ai/news/introducing-mercury (Feb 2025)

✨ Insight · Interview angle:Why can't diffusion LLMs just replace AR models? Three gaps remain: (1) No KV cache — each denoising step attends to the full sequence, so memory scales with steps × sequence length. (2) Harder long-context scaling — AR models benefit from causal masking and position-independent KV reuse; diffusion models re-attend globally each step. (3) Training instability at scale — masked diffusion loss optimises over all masking rates simultaneously, making learning dynamics harder to tune than a single cross-entropy objective. The throughput win (10×) is real for fixed-length generation (code, structured output) — less so for open-ended chat where output length is unpredictable.

PyTorch: masked diffusion training objective (LLaDA-style)

python

import torch
import torch.nn.functional as F

def masked_diffusion_loss(
    model,          # bidirectional transformer
    input_ids,      # (B, L) token ids
    mask_token_id,  # id of the [MASK] token
    mask_rate=0.15, # fraction of tokens to mask (sampled uniformly in [0,1])
):
    """
    LLaDA-style masked diffusion training objective.
    Randomly mask a fraction of tokens, then predict all masked positions
    with cross-entropy. The mask rate is sampled per batch for robustness.
    """
    B, L = input_ids.shape
    # Sample a random mask rate in [0, 1] for each sequence
    rates = torch.rand(B, 1, device=input_ids.device)          # (B, 1)
    mask = torch.rand(B, L, device=input_ids.device) < rates   # (B, L) bool

    noisy_ids = input_ids.clone()
    noisy_ids[mask] = mask_token_id  # replace with [MASK]

    # Model predicts logits over vocab at every position
    logits = model(noisy_ids)  # (B, L, vocab_size)

    # Only compute loss on masked positions
    loss = F.cross_entropy(
        logits[mask],           # (N_masked, vocab_size)
        input_ids[mask],        # (N_masked,)
    )
    return loss

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 5 of 5

Explain the forward and reverse processes in DDPM. Why is the forward process fixed (not learned)?

★★☆

GoogleOpenAI

Why does noise prediction work? Intuitively, why predict the noise rather than the clean image directly?

★★☆

GoogleMeta

Why does Stable Diffusion work in latent space instead of pixel space? What are the tradeoffs?

★★☆

MetaGoogle

How does DiT (Diffusion Transformer) differ from the U-Net architecture traditionally used in diffusion models?

★★★

OpenAIMeta

What is classifier-free guidance (CFG) and what happens when guidance scale is too low or too high?

★★★

GoogleOpenAI

←

✅ Verifiers & Process Reward