Skip to content

Transformer Math

Module 26 · Inference

📦 Quantization

4-bit Llama-70B fits in 35 GB — down from 140 GB

Status:

LLM inference is memory-bandwidth bound, not compute-bound. Each generated token reads the entire model from memory but does relatively little math. Quantization shrinks weights from 32-bit floats to 8-bit or even 4-bit integers — cutting memory 4-8x and speeding up inference proportionally.

📦

The Quantization Pipeline

What you're seeing:The full quantization pipeline — from raw FP16 weights through calibration to packed INT8/INT4 integers, then dequantized back to FP16 at inference time. The key insight: you only pay the memory cost of integers, but the model still "computes" in floating point.

FP16Weights2 bytes/paramCalibrationmeasure activationrangesScale & Zero-ptq = round(x/scale+ zero_point)INT8/INT4Weights0.5–1 byte/paramDequantizeat inferencex̂ ≈ (q - zp)·sRange Mapping: FP16 [-3.2, 4.1] → INT8 [-128, 127]FP16-3.20.45 (zero)4.1INT8-1280 (zero_pt)127scale = 7.3 / 255 = 0.0286zero_point = round(3.2 / 0.0286) = 112

scale = (max - min) / 255 = 7.3 / 255 ≈ 0.0286. zero_point = round(-min / scale) = round(3.2 / 0.0286) = 112. This is the uint8 affine scheme (range [0, 255]). Writing = scale and = zero_point, quantize via and dequantize via .

🎮

Weight Precision: FP32 → FP16 → INT8 → INT4

What you're seeing: The same weight value (0.7832) stored at four precision levels. The bar shows relative memory usage; the error shows precision loss.

What to notice: INT4 uses 8x less memory than FP32, but the weight can only take on 16 discrete values — the error grows.

FP32(4 bytes)
1 sign + 8 exp + 23 mantissa
Stored: 0.7832000Error: 0 (exact)
FP16(2 bytes)
1 sign + 5 exp + 10 mantissa
Stored: 0.7830Error: 0.000200
INT8(1 byte)
256 discrete levels
Stored: 99 (0.7795)Error: 0.003672
INT4(0.5 bytes)
16 discrete levels
Stored: 5 (0.7143)Error: 0.068914
ModelFP32FP16INT8INT4
Llama-2 7B28 GB14 GB7 GB3.5 GB
Llama-2 70B280 GB140 GB70 GB35 GB
💡

The Intuition

The bottleneck is memory bandwidth, not compute. A. A single A100 has 80 GB HBM. You can't even load it. INT4 quantization: 70B × 4 bits = — fits on one GPU. The model runs 2–3× faster because it reads 4× less data from memory. The quality loss? Less than 1% perplexity increase.

FormatBits70B model sizeFits on A100 (80 GB)?Perplexity impact
FP3232280 GBNo (needs 4 GPUs)Baseline
FP1616140 GBNo (needs 2 GPUs)~0%
INT8870 GBYes (barely)<0.5%
INT4435 GBYes (with room)<1%

Why quantize?During autoregressive generation, each token reads every weight from memory exactly once. An A100 has 2 TB/s bandwidth and 312 TFLOPS — at 2 bytes per FP16 parameter, that's ~1T FP16 params/sec of memory reads, while compute delivers ~156T operations/sec. The GPU is starving for data, not math. Smaller weights = more tokens/sec.

Weight-only vs. activation quantization. Weight-only quantization (GPTQ, AWQ) keeps weights in INT4/INT8 and dequantizes on-the-fly during computation. Activation quantization (LLM.int8(), SmoothQuant) also quantizes the input activations, enabling pure integer matmul on specialized hardware. Weight-only is more popular because LLM activations have extreme outliers that are hard to quantize.

PTQ vs. QAT. Post-training quantization (PTQ) quantizes a pre-trained model — fast and easy, works great at INT8. Quantization-aware training (QAT) inserts fake quantization during training so the model learns to tolerate the noise — essential for INT4 and below.

GPTQ uses the inverse Hessian to optimally round each weight column, compensating for rounding errors in remaining weights. One-shot, layer-by-layer — the GPTQ paper reports .

AWQ (Activation-Aware) observes that only — they correspond to channels with large activations. AWQ scales these channels up before quantization, effectively giving them more precision. Simpler and faster than GPTQ, better generalization.

FP8 training (DeepSeek-V3, H100 native): two formats — E4M3 for forward pass (more precision, narrower range), E5M2 for gradients (wider range, less precision). Cuts memory ~50% vs FP16, .

FP8 Training (H100): H100 GPUs include native FP8 tensor cores that achieve . Two formats are used together: E4M3 (4 exponent bits, 3 mantissa) for forward pass weights and activations — higher precision, narrower range — and E5M2 (5 exponent bits, 2 mantissa) for gradients — wider range to capture small gradient values. DeepSeek-V3 trained its 671B MoE model with FP8, cutting training memory ~50% vs BF16 with negligible quality loss. The key techniques: per-tensor scaling factors, loss scaling to prevent gradient underflow, and keeping normalization layers in higher precision.

1-bit LLMs — BitNet b1.58 (2024):Microsoft Research's BitNet b1.58 uses ternary weights {-1, 0, 1}, where the 1.58 refers to log₂(3) ≈ 1.58 bits per weight. The paper reports parity with same-size full-precision LLaMA baselines while substantially reducing memory and matmul cost — because weight-activation multiplication reduces to addition and subtraction. The tradeoff: BitNet requires training from scratch with ternary-aware optimization; you cannot quantize an existing FP16 model to ternary weights without severe quality loss.

AQLM / QuIP# (2024): Lattice-based codebook quantization methods that push toward sub-2-bit weights. QuIP# (Tseng et al.) applies randomized Hadamard transforms to weight matrices before quantization — this incoherence processing makes the weight distribution more uniform, so a small codebook (e.g., 256 entries for 2-bit) can cover the range with less error. AQLM (Egiazarian et al.) uses additive quantization with learned codebooks applied recursively. Both achieve near-lossless 2-bit compression on Llama-2 70B — something round-to-nearest INT2 completely fails at — at the cost of slower quantization setup and slightly higher decode overhead.

✨ Insight · The reason INT4 works at all: neural network weights are highly redundant. Most of the information lives in a small number of salient weights — the rest can be aggressively rounded with minimal quality loss.

NF4 (Normalized Float 4-bit) is a non-uniform quantization format introduced by QLoRA (Dettmers et al., 2023). Standard INT4 uses 16 evenly spaced levels. NF4 instead places quantization levels at the quantiles of a standard normal distribution — matching the empirical distribution of neural network weights, which are approximately Gaussian. This means each level represents an equal fraction of the weight values, minimizing information loss. Concretely, . NF4 also supports double quantization: the quantization constants (scales) are themselves quantized to FP8, saving an additional 0.5 bits per parameter. Together, NF4 + double quantization enables fitting a — on a single A100 80GB with room for activations and KV cache.

Quick check

Derivation

INT4 quantization reduces Llama-2 70B from 140 GB to 35 GB. An A100 has 2 TB/s HBM bandwidth. How many more tokens/sec does INT4 generate vs FP16 at batch=1?

INT4 quantization reduces Llama-2 70B from 140 GB to 35 GB. An A100 has 2 TB/s HBM bandwidth. How many more tokens/sec does INT4 generate vs FP16 at batch=1?
Quick Check

Why is LLM inference memory-bandwidth bound rather than compute-bound?

📐

Step-by-Step Derivation

Uniform Quantization (Asymmetric)

Map a floating-point value to a -bit integer:

Where (scale) and (zero point) are:

Memory Savings

For a model with parameters:

A 70B parameter model: FP16 = GB, INT4 = GB — fits on a single 48GB GPU.

GPTQ Layer-wise Objective

Minimize reconstruction error for each layer using the Hessian :

GPTQ processes columns sequentially: quantize column , then update remaining columns using the inverse Hessian row to compensate for the rounding error.

PyTorch: 4-bit Loading with bitsandbytes

python
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# NF4 quantization (QLoRA style)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # normalized float 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,      # quantize the quantization constants
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)
# 70B model now fits in ~35GB VRAM (single A100)

PyTorch: GPTQ via auto-gptq

python
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b-hf")
gptq_config = GPTQConfig(
    bits=4,
    dataset="c4",              # calibration dataset
    tokenizer=tokenizer,       # required for calibration
    group_size=128,            # quantize in groups of 128 weights
    desc_act=True,             # Hessian-based column ordering
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=gptq_config,
    device_map="auto",
)
PyTorch implementation
# Manual uint8 affine quantization: compute scale/zero-point, quantize, dequantize
import torch

def quantize_uint8(x: torch.Tensor):
    """Asymmetric per-tensor 8-bit affine quantization (uint8, range [0, 255])."""
    x_min, x_max = x.min().item(), x.max().item()
    scale = (x_max - x_min) / 255.0          # Δ: maps [x_min, x_max] → [0, 255]
    zero_point = round(-x_min / scale)        # z: offset so x_min maps to 0

    q = torch.clamp(torch.round(x / scale + zero_point), 0, 255).to(torch.uint8)
    return q, scale, zero_point

def dequantize_uint8(q, scale, zero_point):
    return (q.float() - zero_point) * scale

# torch.quantization.quantize_dynamic (PTQ for linear layers)
import torch.nn as nn
model = nn.Linear(512, 512)
quantized = torch.quantization.quantize_dynamic(model, {nn.Linear}, dtype=torch.qint8)

Quick check

Derivation

Asymmetric quantization (zero_point != 0) adds a zero_point parameter per tensor. For what weight distribution does symmetric quantization waste representable levels, and how much?

Asymmetric quantization (zero_point != 0) adds a zero_point parameter per tensor. For what weight distribution does symmetric quantization waste representable levels, and how much?
🔧

Break It — See What Happens

Quantize attention layers too aggressively (INT2)
No calibration data for PTQ

Quick check

Derivation

You run GPTQ on Llama-2 70B without providing any calibration data. What is the quantization quality, and why?

You run GPTQ on Llama-2 70B without providing any calibration data. What is the quantization quality, and why?
📊

Real-World Numbers

ModelMethodMemoryQuality (Perplexity)
Llama-2 70BFP16 (baseline)140 GB3.32 (WikiText-2)
Llama-2 70BGPTQ 4-bit
Llama-2 70BAWQ 4-bit35 GB
Llama-2 70BLLM.int8()70 GB
DeepSeek-V3FP8 training~50% of FP16Negligible loss vs BF16 training
✨ Insight · AWQ consistently outperforms GPTQ at 4-bit despite being simpler. The key insight: protecting the 1% of salient weights matters more than optimal rounding of all weights. At INT8, LLM.int8() is nearly lossless thanks to mixed-precision outlier handling.

Quick check

Trade-off

AWQ 4-bit achieves 3.74 perplexity on Llama-2 70B while GPTQ 4-bit achieves 3.85. AWQ is also faster to quantize. What tradeoff makes GPTQ potentially preferable despite lower quality?

AWQ 4-bit achieves 3.74 perplexity on Llama-2 70B while GPTQ 4-bit achieves 3.85. AWQ is also faster to quantize. What tradeoff makes GPTQ potentially preferable despite lower quality?
🚀

SOTA 2024–2025: Beyond INT4 — 1-bit and FP8

BitNet b1.58 — ternary weights trained from scratch (Feb 2024, arxiv:2402.17764)

BitNet b1.58 trains models with ternary weights — 1.58 bits per weight (log₂3 ≈ 1.58). Critically, this is not PTQ(post-training quantization applied to an FP16 model) — the model is trained from scratch with ternary weights via straight-through estimators. Results at 3B parameters: , , and vs an equivalent FP16 model — because ternary weights replace floating-point multiplications with additions and sign flips. Quality matches BF16 baselines at the same model size.

Why 1.58 bits? The ternary quantization math

Given a weight matrix , BitNet quantizes each weight to where (per-tensor absolute mean). The result is always in .

At inference, the weight matrix is stored as 2-bit integers (1 byte per 4 weights). The matmul becomes a lookup + accumulate (no floating-point multiply) — this is what enables the CPU speedup.

import torch

def quantize_ternary(W: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
    """BitNet b1.58 per-tensor ternary quantization (arxiv:2402.17764)."""
    gamma = W.abs().mean()          # per-tensor scale
    W_hat = (W / gamma).round().clamp(-1, 1).to(torch.int8)
    return W_hat, gamma             # store (ternary weights, scale)

def dequant_matmul(x: torch.Tensor, W_hat: torch.Tensor, gamma: float) -> torch.Tensor:
    # W_hat is {-1,0,+1} int8 — cast to float for matmul
    return x @ W_hat.float().T * gamma

FP8 — production standard on H100 (2024–2025)

vLLM defaults to FP8 weight quantization on H100 (vllm docs). DeepSeek-V3 (671B MoE) was trained in FP8 — the first large open model to demonstrate FP8 pre-training viability. Quality impact: . The H100 supports two FP8 formats: E4M3 (4 exponent bits, 3 mantissa) for forward pass activations and weights (wider dynamic range), and E5M2 (5 exponent, 2 mantissa) for gradients (more precision for small gradient values). Combined with FP8 KV cache, FP8 end-to-end roughly doubles effective throughput vs FP16 on H100.

FP4 — Blackwell (B200/GB200) next frontier

NVIDIA's B200/GB200 (Blackwell, 2025) adds native FP4 Tensor Core support — the first generation where 4-bit floating-point inference is a first-class hardware feature rather than a software workaround. FP4 offers 2× the throughput of FP8 at the same chip area. Early results from NVIDIA show FP4 inference for LLMs within 1% perplexity of FP8 with per-group scaling. This makes FP4 the likely production default for the 2026 generation of deployments, following the FP16 → FP8 → FP4 trajectory.

Extended quantization comparison (2024–2025)

FormatBits/weightMethodQuality vs FP16Hardware support
INT4 (AWQ/GPTQ)4PTQ<1% ppl increaseA100, H100 (emulated)
FP8 (E4M3)8PTQ or QATH100 native tensor cores
BitNet b1.581.58Train from scratchMatches BF16
FP44PTQ + per-group scale~1% ppl (early)B200/GB200 native (2025)
✨ Insight · Interview framing: The quantization frontier is moving in two directions simultaneously. (1) Lower bits for inference: FP8 → FP4 as hardware adds native support; each generation roughly doubles throughput. (2) Train-time quantization: BitNet b1.58 shows that ternary-weight models trained from scratch match FP16 quality — avoiding the accuracy cliff of aggressive PTQ. The practical production choice in 2025 is FP8 on H100/H200, with FP4 on the horizon for Blackwell.
🧠

Key Takeaways

What to remember for interviews

  1. 1LLM inference is memory-bandwidth bound: each token reads all weights from HBM once but does very little math per byte. Smaller weights = more tokens/sec.
  2. 2INT4 quantization fits a 70B model (35 GB) on a single A100 80GB with under 1% perplexity increase — enabling 2-3x faster generation vs FP16.
  3. 3GPTQ uses the inverse Hessian to optimally round weights layer by layer. AWQ identifies the ~1% salient weights (large activation channels) and scales them before quantization.
  4. 4BitNet b1.58 (2024): ternary {-1,0,+1} weights trained from scratch — 2.71× faster, 3.55× less GPU memory at 3B vs FP16. No floating-point multiply needed.
  5. 5FP8 is the 2025 production standard on H100 (vLLM default, DeepSeek-V3 trained in FP8, <0.5% accuracy loss). Blackwell B200/GB200 adds native FP4 tensor cores as the next step.
🧠

Recap quiz

Derivation

An A100 has 2 TB/s HBM bandwidth and 312 TFLOPS. Llama-2 70B in FP16 = 140 GB. At batch=1, roughly how many tokens/sec can the GPU generate, and what limits it?

An A100 has 2 TB/s HBM bandwidth and 312 TFLOPS. Llama-2 70B in FP16 = 140 GB. At batch=1, roughly how many tokens/sec can the GPU generate, and what limits it?
Trade-off

At 4-bit on Llama-2 70B, AWQ achieves 3.74 perplexity vs GPTQ&apos;s 3.85 (WikiText-2). AWQ is also faster to run. What is the core reason AWQ beats GPTQ at 4-bit?

At 4-bit on Llama-2 70B, AWQ achieves 3.74 perplexity vs GPTQ&apos;s 3.85 (WikiText-2). AWQ is also faster to run. What is the core reason AWQ beats GPTQ at 4-bit?
Derivation

Llama-2 70B in FP16 needs 140 GB and doesn&apos;t fit on one A100 80 GB. After GPTQ 4-bit quantization, how much memory does it need, and what is the perplexity cost on WikiText-2?

Llama-2 70B in FP16 needs 140 GB and doesn&apos;t fit on one A100 80 GB. After GPTQ 4-bit quantization, how much memory does it need, and what is the perplexity cost on WikiText-2?
Trade-off

LLM.int8() achieves near-lossless INT8 inference for 175B+ models. What is the key mechanism, and why can&apos;t you just quantize all activations to INT8 naively?

LLM.int8() achieves near-lossless INT8 inference for 175B+ models. What is the key mechanism, and why can&apos;t you just quantize all activations to INT8 naively?
Derivation

GPTQ requires a calibration dataset (e.g., 128 samples from C4). If you run GPTQ with no calibration data at all, what happens and why?

GPTQ requires a calibration dataset (e.g., 128 samples from C4). If you run GPTQ with no calibration data at all, what happens and why?
Trade-off

DeepSeek-V3 used FP8 training with two formats: E4M3 for forward pass weights/activations, E5M2 for gradients. Why the split, and what does each format optimize?

DeepSeek-V3 used FP8 training with two formats: E4M3 for forward pass weights/activations, E5M2 for gradients. Why the split, and what does each format optimize?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 6 of 6

Explain the quantization formula. How do you choose scale and zero_point for asymmetric quantization?

★★☆
GoogleMeta

Compare GPTQ and AWQ. When would you choose one over the other?

★★★
MetaGoogle

What is the difference between weight-only quantization and weight+activation quantization? Why is weight-only more popular for LLMs?

★★☆
GoogleMeta

What is quantization-aware training (QAT) and why is it better than post-training quantization (PTQ) at low bit-widths?

★★☆
GoogleMeta

Explain the LLM.int8() method. What problem does it solve with outlier features?

★★☆
MetaGoogle

How does FP8 training work, and why did DeepSeek-V3 use it? What are the two FP8 formats?

★★★
GoogleMeta