GPU & Mixed Precision — Transformer Math

🔺

GPU Memory Hierarchy

What you're seeing: The five memory tiers on a modern GPU (e.g., A100). Each level is faster but smaller. The key insight: attention spends most of its time moving data between HBM and compute cores — not doing arithmetic.

✨ Insight · Attention is memory-bandwidth-bound, not compute-bound. Standard self-attention reads/writes O(N²) elements for O(N²) ops — . An A100's ridge point is . Flash Attention fixes this by keeping tiles in , achieving a .

🔢

Number Format Bit Layouts

Each format divides 32, 16, or 8 bits into sign, exponent, and mantissa. The exponent controls range; the mantissa controls precision.

🎯 Interview · Interview trap: bf16 has less mantissa precisionthan fp16 (7 vs 10 bits), yet it's preferred for training. Why? Because the 8-bit exponent gives it fp32's dynamic range — , overflow is extremely rare for bf16. Precision matters far less than avoiding NaN explosions during training.

Quick check

Trade-off

bf16 has fewer mantissa bits than fp16 (7 vs 10), yet all major LLMs are trained in bf16. What property makes bf16 the safer choice?

bf16 uses more bits than fp16 total, giving strictly better precision.bf16 supports denormalized numbers, while fp16 flushes them to zero.bf16 tensor cores run faster than fp16 on modern NVIDIA GPUs.bf16 shares fp32's 8-bit exponent, giving the same dynamic range.

💡

Mixed Precision Training

What breaks without it?

Training purely in fp32 wastes 2× memory and 2× bandwidth — tensor cores (the fast SIMD units on modern GPUs) are optimized for bf16/fp16 matmuls. Training purely in fp16 causes gradients to underflow to zero or overflow to NaN. Mixed precision gets the best of both.

The 3-step recipe

① Master weights: fp32

Stored in full precision. Small gradient updates (e.g., 1e-7) need fp32 to avoid rounding to zero when added to weights. Updated once per optimizer step.

② Forward + backward: bf16

Activations, intermediate tensors, and gradients computed in bf16. , 2× smaller activation memory.

③ Optimizer step: fp32

Gradients cast to fp32 before AdamW moment accumulation. fp32 master weights updated. Downcast copy sent back to bf16 for next forward pass.

Quick Check

A model trained with fp16 mixed precision suddenly produces NaN losses on step 1000. What is the most likely cause?

Quick check

Trade-off

An A100 80GB delivers 312 TFLOP/s BF16 but only 19.5 TFLOP/s FP32 on tensor cores. How does mixed-precision training exploit this gap without sacrificing convergence?

Forward+backward run in bf16 for 16× faster matmuls; optimizer step uses fp32 master weights to accumulate small gradient updates correctly.Only the embedding layer is kept in fp32; all other layers are permanently in bf16.It rounds all weights to bf16 permanently, accepting a small accuracy penalty.Autocast quantizes every layer independently to the minimum precision needed based on gradient magnitude.

📐

Loss Scaling (fp16 Only)

. Gradients in deep networks are often tiny (1e-5 to 1e-8). Without scaling, they flush to zero — the model stops learning.

⚠ Warning · Why bf16 doesn't need this:bf16 shares fp32's 8-bit exponent, so it can represent values down to ~1.2×10⁻³⁸. Gradients of 1e-8 are representable. fp16's 5-bit exponent only reaches ~6×10⁻⁸ — borderline at best, zero at worst.

How loss scaling works:

Multiply the loss by a large scalar (e.g., 2¹⁵ = 32768) before calling .backward()
All gradients are scaled by — they're now in fp16's safe range
Before the optimizer step, divide all gradients by to recover true gradient magnitudes
If any gradient is NaN/Inf (overflow still occurred), skip the optimizer step and halve

Scaled loss and gradient

Optimizer step with unscaled gradient

PyTorch implementation:

torch.autocast + GradScaler (fp16)

python

import torch
from torch.cuda.amp import autocast, GradScaler

model = MyTransformer().cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)

# GradScaler handles loss scaling automatically for fp16
# Not needed for bf16 (same exponent range as fp32)
scaler = GradScaler()

for batch in dataloader:
    inputs, labels = batch
    optimizer.zero_grad()

    # Forward + backward in fp16 automatically
    with autocast(dtype=torch.float16):
        logits = model(inputs)
        loss = criterion(logits, labels)

    # Scale loss → backward → unscale → optimizer step
    scaler.scale(loss).backward()   # grads are scaled
    scaler.unscale_(optimizer)      # unscale before clip
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)          # optimizer updates fp32 master weights
    scaler.update()                 # adjust scale factor for next step

bf16 — no GradScaler needed

torch.autocast bf16 — simpler

python

# bf16 rarely needs GradScaler — fp32-range exponent makes overflow very unlikely
with autocast(dtype=torch.bfloat16):
    logits = model(inputs)
    loss = criterion(logits, labels)

loss.backward()  # gradients in bf16, accumulated in fp32
optimizer.step() # master weights always fp32

Quick check

Derivation

GradScaler multiplies loss by S = 2¹⁵ = 32,768 before backward(). A gradient that would have been 1e-6 without scaling is now 32,768 × 1e-6 = 0.033 in fp16. Why is 0.033 safe but 1e-6 was not?

0.033 is representable as an exact fraction in fp16; 1e-6 is not.0.033 is above fp16's minimum normal value (~6×10⁻⁵); 1e-6 is below it and flushes to zero.GradScaler converts fp16 gradients to int16 internally, gaining extra range.Multiplying by S forces all gradients to be positive, avoiding sign-bit precision loss.

🔧

Break It — See What Happens

Train in fp32 only (no mixed precision)

Train in fp16 without loss scaling (GradScaler disabled)

📊

Real GPU Specs

All numbers from NVIDIA datasheets. Arithmetic intensity at ridge point = peak TFLOP/s ÷ peak HBM bandwidth.

GPU	HBM	HBM BW	FP32	TF32	BF16	FP8	Ridge (BF16)
A100 SXM 80GB	80 GB		19.5 TF	156 TF	312 TF	—	156 FLOPs/B
H100 SXM 80GB	80 GB		67 TF	756 TF	989 TF		295 FLOPs/B
H200 SXM 141GB	141 GB	4.8 TB/s	67 TF	756 TF	989 TF	1,979 TF	206 FLOPs/B

✨ Insight · H200 vs H100: Same compute (identical die), but . This directly speeds up memory-bound workloads like attention and token generation. Compute-bound workloads (large matmuls) see no improvement.

BF16 TFLOP/s comparison (tensor core peak)

A100 80GB

312 TF

H100 80GB

989 TF

H200 141GB

989 TF

Quick check

Derivation

The H200 has the same bf16 TFLOP/s as the H100 (989 TF) but its ridge point drops from 295 to 206 FLOP/byte due to higher bandwidth. What does a lower ridge point mean operationally?

The ridge point has no operational meaning; it is only relevant for textbook analysis.Kernels with intensity below 206 now run slower on H200 than H100.Fewer kernels reach peak compute; the H200 is a regression for training workloads.More kernels become compute-bound because the bandwidth ceiling is higher relative to peak compute.

🧠

Key Takeaways

What to remember for interviews

1Mixed precision uses bf16 for forward/backward (2x faster tensor cores, 2x less memory) and fp32 master weights + optimizer states to prevent gradient underflow.
2bf16 is preferred over fp16 for training: same 8-bit exponent as fp32 gives the same dynamic range (~3.4e38 max), making overflow extremely rare in practice and loss scaling unnecessary.
3The roofline model bounds performance by min(peak FLOPS, bandwidth × arithmetic intensity). Standard attention has intensity ~1 — far below the ridge — making it memory-bound.
4Flash Attention tiles computation into SRAM blocks to avoid materializing the N×N matrix in HBM, moving attention from memory-bound toward compute-bound.
5Activation checkpointing (gradient checkpointing) trades ~33% extra compute for O(sqrt(L)) activation memory instead of O(L) — essential for training large models.

📚

Transformer Math