Quantization — Transformer Math

Module 26 · Inference

📦 Quantization

4-bit Llama-70B fits in 35 GB — down from 140 GB

Status:

LLM inference is memory-bandwidth bound, not compute-bound. Each generated token reads the entire model from memory but does relatively little math. Quantization shrinks weights from 32-bit floats to 8-bit or even 4-bit integers — cutting memory 4-8x and speeding up inference proportionally.

📦

The Quantization Pipeline

What you're seeing:The full quantization pipeline — from raw FP16 weights through calibration to packed INT8/INT4 integers, then dequantized back to FP16 at inference time. The key insight: you only pay the memory cost of integers, but the model still "computes" in floating point.

scale = (max - min) / 255 = 7.3 / 255 ≈ 0.0286. zero_point = round(-min / scale) = round(3.2 / 0.0286) = 112. This is the uint8 affine scheme (range [0, 255]). Writing = scale and = zero_point, quantize via and dequantize via .

🎮

Weight Precision: FP32 → FP16 → INT8 → INT4

What you're seeing: The same weight value (0.7832) stored at four precision levels. The bar shows relative memory usage; the error shows precision loss.

What to notice: INT4 uses 8x less memory than FP32, but the weight can only take on 16 discrete values — the error grows.

FP32(4 bytes)

1 sign + 8 exp + 23 mantissa

Stored: 0.7832000Error: 0 (exact)

FP16(2 bytes)

1 sign + 5 exp + 10 mantissa

Stored: 0.7830Error: 0.000200

INT8(1 byte)

256 discrete levels

Stored: 99 (0.7795)Error: 0.003672

INT4(0.5 bytes)

16 discrete levels

Stored: 5 (0.7143)Error: 0.068914

Model	FP32	FP16	INT8	INT4
Llama-2 7B	28 GB	14 GB	7 GB	3.5 GB
Llama-2 70B	280 GB	140 GB	70 GB	35 GB

💡

The Intuition

The bottleneck is memory bandwidth, not compute. A. A single A100 has 80 GB HBM. You can't even load it. INT4 quantization: 70B × 4 bits = — fits on one GPU. The model runs 2–3× faster because it reads 4× less data from memory. The quality loss? Less than 1% perplexity increase.

Format	Bits	70B model size	Fits on A100 (80 GB)?	Perplexity impact
FP32	32	280 GB	No (needs 4 GPUs)	Baseline
FP16	16	140 GB	No (needs 2 GPUs)	~0%
INT8	8	70 GB	Yes (barely)	<0.5%
INT4	4	35 GB	Yes (with room)	<1%

Why quantize?During autoregressive generation, each token reads every weight from memory exactly once. An A100 has 2 TB/s bandwidth and 312 TFLOPS — at 2 bytes per FP16 parameter, that's ~1T FP16 params/sec of memory reads, while compute delivers ~156T operations/sec. The GPU is starving for data, not math. Smaller weights = more tokens/sec.

Weight-only vs. activation quantization. Weight-only quantization (GPTQ, AWQ) keeps weights in INT4/INT8 and dequantizes on-the-fly during computation. Activation quantization (LLM.int8(), SmoothQuant) also quantizes the input activations, enabling pure integer matmul on specialized hardware. Weight-only is more popular because LLM activations have extreme outliers that are hard to quantize.

PTQ vs. QAT. Post-training quantization (PTQ) quantizes a pre-trained model — fast and easy, works great at INT8. Quantization-aware training (QAT) inserts fake quantization during training so the model learns to tolerate the noise — essential for INT4 and below.

GPTQ uses the inverse Hessian to optimally round each weight column, compensating for rounding errors in remaining weights. One-shot, layer-by-layer — the GPTQ paper reports .

AWQ (Activation-Aware) observes that only — they correspond to channels with large activations. AWQ scales these channels up before quantization, effectively giving them more precision. Simpler and faster than GPTQ, better generalization.

FP8 training (DeepSeek-V3, H100 native): two formats — E4M3 for forward pass (more precision, narrower range), E5M2 for gradients (wider range, less precision). Cuts memory ~50% vs FP16, .

FP8 Training (H100): H100 GPUs include native FP8 tensor cores that achieve . Two formats are used together: E4M3 (4 exponent bits, 3 mantissa) for forward pass weights and activations — higher precision, narrower range — and E5M2 (5 exponent bits, 2 mantissa) for gradients — wider range to capture small gradient values. DeepSeek-V3 trained its 671B MoE model with FP8, cutting training memory ~50% vs BF16 with negligible quality loss. The key techniques: per-tensor scaling factors, loss scaling to prevent gradient underflow, and keeping normalization layers in higher precision.

1-bit LLMs — BitNet b1.58 (2024):Microsoft Research's BitNet b1.58 uses ternary weights {-1, 0, 1}, where the 1.58 refers to log₂(3) ≈ 1.58 bits per weight. The paper reports parity with same-size full-precision LLaMA baselines while substantially reducing memory and matmul cost — because weight-activation multiplication reduces to addition and subtraction. The tradeoff: BitNet requires training from scratch with ternary-aware optimization; you cannot quantize an existing FP16 model to ternary weights without severe quality loss.

AQLM / QuIP# (2024): Lattice-based codebook quantization methods that push toward sub-2-bit weights. QuIP# (Tseng et al.) applies randomized Hadamard transforms to weight matrices before quantization — this incoherence processing makes the weight distribution more uniform, so a small codebook (e.g., 256 entries for 2-bit) can cover the range with less error. AQLM (Egiazarian et al.) uses additive quantization with learned codebooks applied recursively. Both achieve near-lossless 2-bit compression on Llama-2 70B — something round-to-nearest INT2 completely fails at — at the cost of slower quantization setup and slightly higher decode overhead.

✨ Insight · The reason INT4 works at all: neural network weights are highly redundant. Most of the information lives in a small number of salient weights — the rest can be aggressively rounded with minimal quality loss.

NF4 (Normalized Float 4-bit) is a non-uniform quantization format introduced by QLoRA (Dettmers et al., 2023). Standard INT4 uses 16 evenly spaced levels. NF4 instead places quantization levels at the quantiles of a standard normal distribution — matching the empirical distribution of neural network weights, which are approximately Gaussian. This means each level represents an equal fraction of the weight values, minimizing information loss. Concretely, . NF4 also supports double quantization: the quantization constants (scales) are themselves quantized to FP8, saving an additional 0.5 bits per parameter. Together, NF4 + double quantization enables fitting a — on a single A100 80GB with room for activations and KV cache.

Quick check

Derivation

INT4 quantization reduces Llama-2 70B from 140 GB to 35 GB. An A100 has 2 TB/s HBM bandwidth. How many more tokens/sec does INT4 generate vs FP16 at batch=1?

8x more: INT4 enables INT4 tensor core GEMM which is 8x faster than FP16.2x more: half-precision arithmetic is replaced by integers.No speedup: compute FLOPS are the bottleneck, not memory bandwidth.4x more: bandwidth limit scales linearly with model size.

Quick Check

Why is LLM inference memory-bandwidth bound rather than compute-bound?

📐

Step-by-Step Derivation

Uniform Quantization (Asymmetric)

Map a floating-point value to a -bit integer:

Where (scale) and (zero point) are:

Memory Savings

For a model with parameters:

A 70B parameter model: FP16 = GB, INT4 = GB — fits on a single 48GB GPU.

GPTQ Layer-wise Objective

Minimize reconstruction error for each layer using the Hessian :

GPTQ processes columns sequentially: quantize column , then update remaining columns using the inverse Hessian row to compensate for the rounding error.

PyTorch: 4-bit Loading with bitsandbytes

python

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# NF4 quantization (QLoRA style)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # normalized float 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,      # quantize the quantization constants
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)
# 70B model now fits in ~35GB VRAM (single A100)

PyTorch: GPTQ via auto-gptq

python

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b-hf")
gptq_config = GPTQConfig(
    bits=4,
    dataset="c4",              # calibration dataset
    tokenizer=tokenizer,       # required for calibration
    group_size=128,            # quantize in groups of 128 weights
    desc_act=True,             # Hessian-based column ordering
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=gptq_config,
    device_map="auto",
)

PyTorch implementation

# Manual uint8 affine quantization: compute scale/zero-point, quantize, dequantize
import torch

def quantize_uint8(x: torch.Tensor):
    """Asymmetric per-tensor 8-bit affine quantization (uint8, range [0, 255])."""
    x_min, x_max = x.min().item(), x.max().item()
    scale = (x_max - x_min) / 255.0          # Δ: maps [x_min, x_max] → [0, 255]
    zero_point = round(-x_min / scale)        # z: offset so x_min maps to 0

    q = torch.clamp(torch.round(x / scale + zero_point), 0, 255).to(torch.uint8)
    return q, scale, zero_point

def dequantize_uint8(q, scale, zero_point):
    return (q.float() - zero_point) * scale

# torch.quantization.quantize_dynamic (PTQ for linear layers)
import torch.nn as nn
model = nn.Linear(512, 512)
quantized = torch.quantization.quantize_dynamic(model, {nn.Linear}, dtype=torch.qint8)

Quick check

Derivation

Asymmetric quantization (zero_point != 0) adds a zero_point parameter per tensor. For what weight distribution does symmetric quantization waste representable levels, and how much?

Asymmetric distributions waste more than symmetric because of the extra zero_point overhead.Uniform distributions: symmetric quantization misaligns the midpoint, wasting 1 level.Normally distributed weights: symmetric wastes the tails, losing ~5% of levels.ReLU activations [0, max]: symmetric wastes the entire negative half of the INT8 range — 128 of 256 levels unused.

🔧

Break It — See What Happens

Quantize attention layers too aggressively (INT2)

No calibration data for PTQ

Quick check

Derivation

You run GPTQ on Llama-2 70B without providing any calibration data. What is the quantization quality, and why?

Naive round-to-nearest quality: the Hessian can't be estimated, so column-error-compensation is lost.Zero quality loss: GPTQ uses the weight matrix itself as a synthetic calibration set.FP16 quality: GPTQ falls back to FP16 storage when calibration data is unavailable.Slightly worse than calibrated, but still better than round-to-nearest due to the Hessian prior.

📊

Real-World Numbers

Model	Method	Memory	Quality (Perplexity)
Llama-2 70B	FP16 (baseline)	140 GB	3.32 (WikiText-2)
Llama-2 70B	GPTQ 4-bit
Llama-2 70B	AWQ 4-bit	35 GB
Llama-2 70B	LLM.int8()	70 GB
DeepSeek-V3	FP8 training	~50% of FP16	Negligible loss vs BF16 training

✨ Insight · AWQ consistently outperforms GPTQ at 4-bit despite being simpler. The key insight: protecting the 1% of salient weights matters more than optimal rounding of all weights. At INT8, LLM.int8() is nearly lossless thanks to mixed-precision outlier handling.

Quick check

Trade-off

AWQ 4-bit achieves 3.74 perplexity on Llama-2 70B while GPTQ 4-bit achieves 3.85. AWQ is also faster to quantize. What tradeoff makes GPTQ potentially preferable despite lower quality?

GPTQ overfits to a specific calibration set — AWQ generalizes better across diverse tasks.GPTQ uses less GPU memory during quantization, so it fits in smaller machines.GPTQ produces smaller model files because it drops non-salient weights entirely.GPTQ allows activation quantization (W4A4) while AWQ only supports weight-only.

🚀

SOTA 2024–2025: Beyond INT4 — 1-bit and FP8

BitNet b1.58 — ternary weights trained from scratch (Feb 2024, arxiv:2402.17764)

BitNet b1.58 trains models with ternary weights — 1.58 bits per weight (log₂3 ≈ 1.58). Critically, this is not PTQ(post-training quantization applied to an FP16 model) — the model is trained from scratch with ternary weights via straight-through estimators. Results at 3B parameters: , , and vs an equivalent FP16 model — because ternary weights replace floating-point multiplications with additions and sign flips. Quality matches BF16 baselines at the same model size.

Why 1.58 bits? The ternary quantization math

Given a weight matrix , BitNet quantizes each weight to where (per-tensor absolute mean). The result is always in .

At inference, the weight matrix is stored as 2-bit integers (1 byte per 4 weights). The matmul becomes a lookup + accumulate (no floating-point multiply) — this is what enables the CPU speedup.

import torch

def quantize_ternary(W: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
    """BitNet b1.58 per-tensor ternary quantization (arxiv:2402.17764)."""
    gamma = W.abs().mean()          # per-tensor scale
    W_hat = (W / gamma).round().clamp(-1, 1).to(torch.int8)
    return W_hat, gamma             # store (ternary weights, scale)

def dequant_matmul(x: torch.Tensor, W_hat: torch.Tensor, gamma: float) -> torch.Tensor:
    # W_hat is {-1,0,+1} int8 — cast to float for matmul
    return x @ W_hat.float().T * gamma

FP8 — production standard on H100 (2024–2025)

vLLM defaults to FP8 weight quantization on H100 (vllm docs). DeepSeek-V3 (671B MoE) was trained in FP8 — the first large open model to demonstrate FP8 pre-training viability. Quality impact: . The H100 supports two FP8 formats: E4M3 (4 exponent bits, 3 mantissa) for forward pass activations and weights (wider dynamic range), and E5M2 (5 exponent, 2 mantissa) for gradients (more precision for small gradient values). Combined with FP8 KV cache, FP8 end-to-end roughly doubles effective throughput vs FP16 on H100.

FP4 — Blackwell (B200/GB200) next frontier

NVIDIA's B200/GB200 (Blackwell, 2025) adds native FP4 Tensor Core support — the first generation where 4-bit floating-point inference is a first-class hardware feature rather than a software workaround. FP4 offers 2× the throughput of FP8 at the same chip area. Early results from NVIDIA show FP4 inference for LLMs within 1% perplexity of FP8 with per-group scaling. This makes FP4 the likely production default for the 2026 generation of deployments, following the FP16 → FP8 → FP4 trajectory.

Extended quantization comparison (2024–2025)

Format	Bits/weight	Method	Quality vs FP16	Hardware support
INT4 (AWQ/GPTQ)	4	PTQ	<1% ppl increase	A100, H100 (emulated)
FP8 (E4M3)	8	PTQ or QAT		H100 native tensor cores
BitNet b1.58	1.58	Train from scratch	Matches BF16
FP4	4	PTQ + per-group scale	~1% ppl (early)	B200/GB200 native (2025)

✨ Insight · Interview framing: The quantization frontier is moving in two directions simultaneously. (1) Lower bits for inference: FP8 → FP4 as hardware adds native support; each generation roughly doubles throughput. (2) Train-time quantization: BitNet b1.58 shows that ternary-weight models trained from scratch match FP16 quality — avoiding the accuracy cliff of aggressive PTQ. The practical production choice in 2025 is FP8 on H100/H200, with FP4 on the horizon for Blackwell.

🧠

Key Takeaways

What to remember for interviews

1LLM inference is memory-bandwidth bound: each token reads all weights from HBM once but does very little math per byte. Smaller weights = more tokens/sec.
2INT4 quantization fits a 70B model (35 GB) on a single A100 80GB with under 1% perplexity increase — enabling 2-3x faster generation vs FP16.
3GPTQ uses the inverse Hessian to optimally round weights layer by layer. AWQ identifies the ~1% salient weights (large activation channels) and scales them before quantization.
4BitNet b1.58 (2024): ternary {-1,0,+1} weights trained from scratch — 2.71× faster, 3.55× less GPU memory at 3B vs FP16. No floating-point multiply needed.
5FP8 is the 2025 production standard on H100 (vLLM default, DeepSeek-V3 trained in FP8, <0.5% accuracy loss). Blackwell B200/GB200 adds native FP4 tensor cores as the next step.

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 6 of 6

Explain the quantization formula. How do you choose scale and zero_point for asymmetric quantization?

★★☆

GoogleMeta

Compare GPTQ and AWQ. When would you choose one over the other?

★★★

MetaGoogle

What is the difference between weight-only quantization and weight+activation quantization? Why is weight-only more popular for LLMs?

★★☆

GoogleMeta

What is quantization-aware training (QAT) and why is it better than post-training quantization (PTQ) at low bit-widths?

★★☆

GoogleMeta

Explain the LLM.int8() method. What problem does it solve with outlier features?

★★☆

MetaGoogle

How does FP8 training work, and why did DeepSeek-V3 use it? What are the two FP8 formats?

★★★

GoogleMeta

←

🎲 Sampling & Decoding

🏎️ Speculative Decoding

→

📦 Quantization

The Quantization Pipeline

Weight Precision: FP32 → FP16 → INT8 → INT4

The Intuition

Step-by-Step Derivation

Break It — See What Happens

Real-World Numbers

SOTA 2024–2025: Beyond INT4 — 1-bit and FP8

Key Takeaways

Recap quiz

Further Reading

Interview Questions