📦 Quantization
4-bit Llama-70B fits in 35 GB — down from 140 GB
LLM inference is memory-bandwidth bound, not compute-bound. Each generated token reads the entire model from memory but does relatively little math. Quantization shrinks weights from 32-bit floats to 8-bit or even 4-bit integers — cutting memory 4-8x and speeding up inference proportionally.
The Quantization Pipeline
What you're seeing:The full quantization pipeline — from raw FP16 weights through calibration to packed INT8/INT4 integers, then dequantized back to FP16 at inference time. The key insight: you only pay the memory cost of integers, but the model still "computes" in floating point.
scale = (max - min) / 255 = 7.3 / 255 ≈ 0.0286. zero_point = round(-min / scale) = round(3.2 / 0.0286) = 112. This is the uint8 affine scheme (range [0, 255]). Writing = scale and = zero_point, quantize via and dequantize via .
Weight Precision: FP32 → FP16 → INT8 → INT4
What you're seeing: The same weight value (0.7832) stored at four precision levels. The bar shows relative memory usage; the error shows precision loss.
What to notice: INT4 uses 8x less memory than FP32, but the weight can only take on 16 discrete values — the error grows.
| Model | FP32 | FP16 | INT8 | INT4 |
|---|---|---|---|---|
| Llama-2 7B | 28 GB | 14 GB | 7 GB | 3.5 GB |
| Llama-2 70B | 280 GB | 140 GB | 70 GB | 35 GB |
The Intuition
The bottleneck is memory bandwidth, not compute. A. A single A100 has 80 GB HBM. You can't even load it. INT4 quantization: 70B × 4 bits = — fits on one GPU. The model runs 2–3× faster because it reads 4× less data from memory. The quality loss? Less than 1% perplexity increase.
| Format | Bits | 70B model size | Fits on A100 (80 GB)? | Perplexity impact |
|---|---|---|---|---|
| FP32 | 32 | 280 GB | No (needs 4 GPUs) | Baseline |
| FP16 | 16 | 140 GB | No (needs 2 GPUs) | ~0% |
| INT8 | 8 | 70 GB | Yes (barely) | <0.5% |
| INT4 | 4 | 35 GB | Yes (with room) | <1% |
Why quantize?During autoregressive generation, each token reads every weight from memory exactly once. An A100 has 2 TB/s bandwidth and 312 TFLOPS — at 2 bytes per FP16 parameter, that's ~1T FP16 params/sec of memory reads, while compute delivers ~156T operations/sec. The GPU is starving for data, not math. Smaller weights = more tokens/sec.
Weight-only vs. activation quantization. Weight-only quantization (GPTQ, AWQ) keeps weights in INT4/INT8 and dequantizes on-the-fly during computation. Activation quantization (LLM.int8(), SmoothQuant) also quantizes the input activations, enabling pure integer matmul on specialized hardware. Weight-only is more popular because LLM activations have extreme outliers that are hard to quantize.
PTQ vs. QAT. Post-training quantization (PTQ) quantizes a pre-trained model — fast and easy, works great at INT8. Quantization-aware training (QAT) inserts fake quantization during training so the model learns to tolerate the noise — essential for INT4 and below.
GPTQ uses the inverse Hessian to optimally round each weight column, compensating for rounding errors in remaining weights. One-shot, layer-by-layer — the GPTQ paper reports .
AWQ (Activation-Aware) observes that only — they correspond to channels with large activations. AWQ scales these channels up before quantization, effectively giving them more precision. Simpler and faster than GPTQ, better generalization.
FP8 training (DeepSeek-V3, H100 native): two formats — E4M3 for forward pass (more precision, narrower range), E5M2 for gradients (wider range, less precision). Cuts memory ~50% vs FP16, .
FP8 Training (H100): H100 GPUs include native FP8 tensor cores that achieve . Two formats are used together: E4M3 (4 exponent bits, 3 mantissa) for forward pass weights and activations — higher precision, narrower range — and E5M2 (5 exponent bits, 2 mantissa) for gradients — wider range to capture small gradient values. DeepSeek-V3 trained its 671B MoE model with FP8, cutting training memory ~50% vs BF16 with negligible quality loss. The key techniques: per-tensor scaling factors, loss scaling to prevent gradient underflow, and keeping normalization layers in higher precision.
1-bit LLMs — BitNet b1.58 (2024):Microsoft Research's BitNet b1.58 uses ternary weights {-1, 0, 1}, where the 1.58 refers to log₂(3) ≈ 1.58 bits per weight. The paper reports parity with same-size full-precision LLaMA baselines while substantially reducing memory and matmul cost — because weight-activation multiplication reduces to addition and subtraction. The tradeoff: BitNet requires training from scratch with ternary-aware optimization; you cannot quantize an existing FP16 model to ternary weights without severe quality loss.
AQLM / QuIP# (2024): Lattice-based codebook quantization methods that push toward sub-2-bit weights. QuIP# (Tseng et al.) applies randomized Hadamard transforms to weight matrices before quantization — this incoherence processing makes the weight distribution more uniform, so a small codebook (e.g., 256 entries for 2-bit) can cover the range with less error. AQLM (Egiazarian et al.) uses additive quantization with learned codebooks applied recursively. Both achieve near-lossless 2-bit compression on Llama-2 70B — something round-to-nearest INT2 completely fails at — at the cost of slower quantization setup and slightly higher decode overhead.
NF4 (Normalized Float 4-bit) is a non-uniform quantization format introduced by QLoRA (Dettmers et al., 2023). Standard INT4 uses 16 evenly spaced levels. NF4 instead places quantization levels at the quantiles of a standard normal distribution — matching the empirical distribution of neural network weights, which are approximately Gaussian. This means each level represents an equal fraction of the weight values, minimizing information loss. Concretely, . NF4 also supports double quantization: the quantization constants (scales) are themselves quantized to FP8, saving an additional 0.5 bits per parameter. Together, NF4 + double quantization enables fitting a — on a single A100 80GB with room for activations and KV cache.
Quick check
INT4 quantization reduces Llama-2 70B from 140 GB to 35 GB. An A100 has 2 TB/s HBM bandwidth. How many more tokens/sec does INT4 generate vs FP16 at batch=1?
Why is LLM inference memory-bandwidth bound rather than compute-bound?
Step-by-Step Derivation
Uniform Quantization (Asymmetric)
Map a floating-point value to a -bit integer:
Where (scale) and (zero point) are:
Memory Savings
For a model with parameters:
A 70B parameter model: FP16 = GB, INT4 = GB — fits on a single 48GB GPU.
GPTQ Layer-wise Objective
Minimize reconstruction error for each layer using the Hessian :
GPTQ processes columns sequentially: quantize column , then update remaining columns using the inverse Hessian row to compensate for the rounding error.
PyTorch: 4-bit Loading with bitsandbytes
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# NF4 quantization (QLoRA style)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # normalized float 4-bit
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # quantize the quantization constants
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
quantization_config=bnb_config,
device_map="auto",
)
# 70B model now fits in ~35GB VRAM (single A100)PyTorch: GPTQ via auto-gptq
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b-hf")
gptq_config = GPTQConfig(
bits=4,
dataset="c4", # calibration dataset
tokenizer=tokenizer, # required for calibration
group_size=128, # quantize in groups of 128 weights
desc_act=True, # Hessian-based column ordering
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
quantization_config=gptq_config,
device_map="auto",
)PyTorch implementation
# Manual uint8 affine quantization: compute scale/zero-point, quantize, dequantize
import torch
def quantize_uint8(x: torch.Tensor):
"""Asymmetric per-tensor 8-bit affine quantization (uint8, range [0, 255])."""
x_min, x_max = x.min().item(), x.max().item()
scale = (x_max - x_min) / 255.0 # Δ: maps [x_min, x_max] → [0, 255]
zero_point = round(-x_min / scale) # z: offset so x_min maps to 0
q = torch.clamp(torch.round(x / scale + zero_point), 0, 255).to(torch.uint8)
return q, scale, zero_point
def dequantize_uint8(q, scale, zero_point):
return (q.float() - zero_point) * scale
# torch.quantization.quantize_dynamic (PTQ for linear layers)
import torch.nn as nn
model = nn.Linear(512, 512)
quantized = torch.quantization.quantize_dynamic(model, {nn.Linear}, dtype=torch.qint8)Quick check
Asymmetric quantization (zero_point != 0) adds a zero_point parameter per tensor. For what weight distribution does symmetric quantization waste representable levels, and how much?
Break It — See What Happens
Quick check
You run GPTQ on Llama-2 70B without providing any calibration data. What is the quantization quality, and why?
Real-World Numbers
| Model | Method | Memory | Quality (Perplexity) |
|---|---|---|---|
| Llama-2 70B | FP16 (baseline) | 140 GB | 3.32 (WikiText-2) |
| Llama-2 70B | GPTQ 4-bit | ||
| Llama-2 70B | AWQ 4-bit | 35 GB | |
| Llama-2 70B | LLM.int8() | 70 GB | |
| DeepSeek-V3 | FP8 training | ~50% of FP16 | Negligible loss vs BF16 training |
Quick check
AWQ 4-bit achieves 3.74 perplexity on Llama-2 70B while GPTQ 4-bit achieves 3.85. AWQ is also faster to quantize. What tradeoff makes GPTQ potentially preferable despite lower quality?
SOTA 2024–2025: Beyond INT4 — 1-bit and FP8
BitNet b1.58 — ternary weights trained from scratch (Feb 2024, arxiv:2402.17764)
BitNet b1.58 trains models with ternary weights — 1.58 bits per weight (log₂3 ≈ 1.58). Critically, this is not PTQ(post-training quantization applied to an FP16 model) — the model is trained from scratch with ternary weights via straight-through estimators. Results at 3B parameters: , , and vs an equivalent FP16 model — because ternary weights replace floating-point multiplications with additions and sign flips. Quality matches BF16 baselines at the same model size.
Why 1.58 bits? The ternary quantization math
Given a weight matrix , BitNet quantizes each weight to where (per-tensor absolute mean). The result is always in .
At inference, the weight matrix is stored as 2-bit integers (1 byte per 4 weights). The matmul becomes a lookup + accumulate (no floating-point multiply) — this is what enables the CPU speedup.
import torch
def quantize_ternary(W: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
"""BitNet b1.58 per-tensor ternary quantization (arxiv:2402.17764)."""
gamma = W.abs().mean() # per-tensor scale
W_hat = (W / gamma).round().clamp(-1, 1).to(torch.int8)
return W_hat, gamma # store (ternary weights, scale)
def dequant_matmul(x: torch.Tensor, W_hat: torch.Tensor, gamma: float) -> torch.Tensor:
# W_hat is {-1,0,+1} int8 — cast to float for matmul
return x @ W_hat.float().T * gammaFP8 — production standard on H100 (2024–2025)
vLLM defaults to FP8 weight quantization on H100 (vllm docs). DeepSeek-V3 (671B MoE) was trained in FP8 — the first large open model to demonstrate FP8 pre-training viability. Quality impact: . The H100 supports two FP8 formats: E4M3 (4 exponent bits, 3 mantissa) for forward pass activations and weights (wider dynamic range), and E5M2 (5 exponent, 2 mantissa) for gradients (more precision for small gradient values). Combined with FP8 KV cache, FP8 end-to-end roughly doubles effective throughput vs FP16 on H100.
FP4 — Blackwell (B200/GB200) next frontier
NVIDIA's B200/GB200 (Blackwell, 2025) adds native FP4 Tensor Core support — the first generation where 4-bit floating-point inference is a first-class hardware feature rather than a software workaround. FP4 offers 2× the throughput of FP8 at the same chip area. Early results from NVIDIA show FP4 inference for LLMs within 1% perplexity of FP8 with per-group scaling. This makes FP4 the likely production default for the 2026 generation of deployments, following the FP16 → FP8 → FP4 trajectory.
Extended quantization comparison (2024–2025)
| Format | Bits/weight | Method | Quality vs FP16 | Hardware support |
|---|---|---|---|---|
| INT4 (AWQ/GPTQ) | 4 | PTQ | <1% ppl increase | A100, H100 (emulated) |
| FP8 (E4M3) | 8 | PTQ or QAT | H100 native tensor cores | |
| BitNet b1.58 | 1.58 | Train from scratch | Matches BF16 | |
| FP4 | 4 | PTQ + per-group scale | ~1% ppl (early) | B200/GB200 native (2025) |
Key Takeaways
What to remember for interviews
- 1LLM inference is memory-bandwidth bound: each token reads all weights from HBM once but does very little math per byte. Smaller weights = more tokens/sec.
- 2INT4 quantization fits a 70B model (35 GB) on a single A100 80GB with under 1% perplexity increase — enabling 2-3x faster generation vs FP16.
- 3GPTQ uses the inverse Hessian to optimally round weights layer by layer. AWQ identifies the ~1% salient weights (large activation channels) and scales them before quantization.
- 4BitNet b1.58 (2024): ternary {-1,0,+1} weights trained from scratch — 2.71× faster, 3.55× less GPU memory at 3B vs FP16. No floating-point multiply needed.
- 5FP8 is the 2025 production standard on H100 (vLLM default, DeepSeek-V3 trained in FP8, <0.5% accuracy loss). Blackwell B200/GB200 adds native FP4 tensor cores as the next step.
Recap quiz
An A100 has 2 TB/s HBM bandwidth and 312 TFLOPS. Llama-2 70B in FP16 = 140 GB. At batch=1, roughly how many tokens/sec can the GPU generate, and what limits it?
At 4-bit on Llama-2 70B, AWQ achieves 3.74 perplexity vs GPTQ's 3.85 (WikiText-2). AWQ is also faster to run. What is the core reason AWQ beats GPTQ at 4-bit?
Llama-2 70B in FP16 needs 140 GB and doesn't fit on one A100 80 GB. After GPTQ 4-bit quantization, how much memory does it need, and what is the perplexity cost on WikiText-2?
LLM.int8() achieves near-lossless INT8 inference for 175B+ models. What is the key mechanism, and why can't you just quantize all activations to INT8 naively?
GPTQ requires a calibration dataset (e.g., 128 samples from C4). If you run GPTQ with no calibration data at all, what happens and why?
DeepSeek-V3 used FP8 training with two formats: E4M3 for forward pass weights/activations, E5M2 for gradients. Why the split, and what does each format optimize?
Further Reading
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — One-shot weight quantization using approximate second-order information, enabling 3-4 bit models with minimal quality loss.
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Protects salient weight channels identified by activation magnitudes, achieving better quality than round-to-nearest at 4-bit.
- SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models — Migrates quantization difficulty from activations to weights via per-channel smoothing, enabling W8A8 quantization.
- Lilian Weng's Blog — Technical posts on model compression, quantization theory, and efficient inference for LLMs.
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale — Dettmers et al. 2022 — mixed-precision INT8 quantization that handles outlier activations separately, enabling near-lossless 8-bit inference for 175B+ models.
- FP8 Formats for Deep Learning — Micikevicius et al. 2022 — defines E4M3 and E5M2 FP8 formats and training recipes. The format used by DeepSeek-V3 and H100 tensor cores.
Interview Questions
Showing 6 of 6
Explain the quantization formula. How do you choose scale and zero_point for asymmetric quantization?
★★☆Compare GPTQ and AWQ. When would you choose one over the other?
★★★What is the difference between weight-only quantization and weight+activation quantization? Why is weight-only more popular for LLMs?
★★☆What is quantization-aware training (QAT) and why is it better than post-training quantization (PTQ) at low bit-widths?
★★☆Explain the LLM.int8() method. What problem does it solve with outlier features?
★★☆How does FP8 training work, and why did DeepSeek-V3 use it? What are the two FP8 formats?
★★★