Skip to content

Transformer Math

Module 28 · Inference

🚀 LLM Deployment

Continuous batching serves 23x more requests than static batching on the same GPU

Status:

Why can't you just model.generate() in production? At 1000 requests/second, naive serving wastes 70% of your GPU waiting for sequences to finish. Modern LLM serving is a systems problem: continuous batching keeps GPUs busy, PagedAttention eliminates KV cache fragmentation, and the prefill/decode split reveals why latency and throughput are fundamentally different optimization targets.

🎮

The LLM Serving Stack

What you're seeing: The full request lifecycle from client to GPU and back. Each layer has a distinct job — from load balancing to tokenization to iterative decode.

What to notice: Prefill (step 2) processes all prompt tokens in parallel — fast but compute-heavy. Decode (step 3) is one token at a time — slow and memory-bound. These two phases have different bottlenecks and optimization strategies.

LLM Serving Stack — Request LifecycleClientHTTP / SSELoad BalancerRound-robin / HashAPI GatewayAuth · Rate limitModel ServervLLM / TGITensorRT-LLMContinuous batchingGPU 0A100 80GBGPU 1A100 80GBRequest Lifecycle1. Tokenizetext → token IDs2. Prefillprocess all prompttokens in parallel3. Decodegenerate 1 tokenper step (KV cache)4. Detokenizetoken IDs → text5. StreamSSE chunksto clientCompute-bound (parallel)Memory-bandwidth-boundstreamed response → client
💡

Batching Strategies

Static vs Continuous Batching

The problem: A 2000-token sequence takes 40x longer than a 50-token one. Static batching makes short sequences wait — GPU idles for the longest.

Orca (2022) fix: At every decode step, evict finished sequences and insert waiting requests. Batch composition changes every iteration — no idle slots.

Static BatchingBatch fixed until longest sequence finishest=0t=TSeq A (20 tok)Seq B (28 tok)Seq C (14 tok)Seq D (50 tok) ← longestGPU utilization ~56% — idle slots waiting for Seq DContinuous Batching (Orca)New requests inserted as sequences completeSlot 0Slot 1Slot 2~90% GPU utilizationOriginal batchNew requests inserted
✨ Insight · Continuous batching is iteration-level scheduling: instead of scheduling at the request boundary, the scheduler acts at every forward pass. This is why vLLM can sustain substantially higher throughput — the GPU never waits for one slow request. The vLLM paper reports ; gains vs. naive static batching are larger and workload-dependent.

Quick check

Derivation

A static batch has 4 sequences: 20, 28, 14, and 50 tokens. The GPU sits idle for how many token-steps while waiting for the longest sequence, and what does continuous batching eliminate?

A static batch has 4 sequences: 20, 28, 14, and 50 tokens. The GPU sits idle for how many token-steps while waiting for the longest sequence, and what does continuous batching eliminate?

PagedAttention: Virtual Memory for KV Cache

The problem: KV cache for a 2048-token sequence pre-allocates contiguous GPU memory at request arrival — even if the sequence only uses 100 tokens. before a single output token is generated. Memory fragmentation prevents packing more requests.

vLLM (2023) fix: Divide KV cache into fixed-size pages (16 tokens each). Map virtual sequence positions to physical GPU pages on demand — like OS virtual memory.

Traditional KV CacheContiguous pre-allocation per sequenceGPU MemoryUsedWastedUsedWastedSeq 1: pre-alloc 160 tokens, used 60 → 62.5% wasteSeq 2: pre-alloc 160 tokens, used 80 → 50% wasteAverage KV cache waste: ~60%PagedAttention (vLLM)Pages allocated on demand — like virtual memoryPhysical GPU Memory (blocks of 16 tokens)AAAABBBVirtual block table → physical block mappingSeq A: [virt 0→phys 0] [virt 1→phys 1] [virt 2→phys 2] [virt 3→phys 3]Seq B: [virt 0→phys 4] [virt 1→phys 5] [virt 2→phys 6]Average KV cache waste: ~4% (last partial block only)Result: PagedAttention enables 2-4x more concurrent requests for same GPU memoryAlso enables prefix caching (share physical blocks between requests with same prompt prefix)and copy-on-write for beam search (multiple sequences share the same blocks until they diverge)
Quick Check

Why does continuous batching improve GPU utilization compared to static batching?

📐

Prefill vs Decode: Two Different Problems

Prefill (Prompt Processing)

  • • All prompt tokens processed in parallel (one forward pass)
  • • Produces the initial KV cache entries
  • • Bottleneck metric: TTFT (Time To First Token)
  • • Optimization: tensor parallelism, , , large batch compute

Decode (Token Generation)

  • • One token generated per forward pass (autoregressive)
  • • Reads entire KV cache + model weights each step
  • • Bottleneck metric: TPS (Tokens Per Second)
  • • Optimization: , GQA,

Latency Formulas

TTFT scales with prompt length (more tokens to process in prefill):

TPS is bounded by how fast you can read model weights from GPU memory (HBM):

Example: . Batching multiple requests amortizes the weight read cost, improving throughput but not per-request latency.

🎯 Interview · TTFT and TPS are optimized by different techniques. Fast TTFT needs fast compute (prefill phase); high TPS needs high memory bandwidth and quantization (decode phase). , eliminating the head-of-line blocking where long prefills stall concurrent decode steps.

Quick check

Trade-off

Your LLM API has p99 TTFT of 800 ms (target: 200 ms) but p99 TPS is acceptable at 60 tok/s. Which optimization addresses the right bottleneck?

Your LLM API has p99 TTFT of 800 ms (target: 200 ms) but p99 TPS is acceptable at 60 tok/s. Which optimization addresses the right bottleneck?
📊

Serving Framework Comparison

Llama-3 70B on 4×A100-80GB. Numbers are approximate and depend heavily on workload (prompt length distribution, output length, batch size).

FrameworkContinuous BatchingPagedAttentionSpeculative DecodingQuantizationMulti-GPUThroughput (tok/s)
vLLM✓ (original)FP8, INT8, INT4TP + PP~2,000–3,000
TGI (HuggingFace)GPTQ, AWQ, FP8TP~1,500–2,500
TensorRT-LLM (NVIDIA)✓ (in-flight)✓ (draft model)FP8, INT8, INT4TP + PP~3,000–5,000
SGLang✓ (EAGLE)FP8, INT4TP~2,500–4,000

TP = Tensor Parallelism (split each layer across GPUs). PP = Pipeline Parallelism (split layers across GPUs in stages). . SGLang adds RadixAttention (prefix caching) for workloads with shared prefixes (e.g., system prompts).

💡 Tip · For most production deployments: start with vLLM (easiest to operate, great ecosystem), move to TensorRT-LLM when you need maximum throughput on NVIDIA hardware, and use SGLang when your workload has heavy prefix sharing (RAG, agents with fixed system prompts).

Quick check

Trade-off

Your team needs to serve Llama-70B at maximum throughput on NVIDIA H100s, but can tolerate a 2-week setup time per model update. Which framework is the right choice and why?

Your team needs to serve Llama-70B at maximum throughput on NVIDIA H100s, but can tolerate a 2-week setup time per model update. Which framework is the right choice and why?
🔌

API Design Patterns

Streaming (SSE)

Server-Sent Events: server pushes token chunks as they are generated. Client reads a chunked HTTP response. Each chunk is a JSON delta. Critical for UX — users see output immediately instead of waiting for the full response.

Token Counting & Billing

Count input tokens at the API gateway before sending to the model (use tiktoken/sentencepiece). Output tokens are counted as they are generated. Bill on prompt tokens + completion tokens separately — prompt tokens are cheaper (prefill is parallelized across users via batching).

Rate Limiting & Timeouts

Rate limit on tokens-per-minute (TPM), not just requests-per-minute (RPM). A single 100k-token request consumes 100x resources of a 1k-token request. Set request timeout at API gateway (e.g., 120s) plus per-token timeout (e.g., 30s for first token). Abort server-side generation on client disconnect to free GPU slots.

Minimal vLLM Serving Setup (Python)

python
from vllm import LLM, SamplingParams
from vllm.entrypoints.openai.api_server import build_app
import uvicorn

# 1. Launch the model — vLLM handles continuous batching + PagedAttention
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B-Instruct",
    tensor_parallel_size=4,       # split across 4 GPUs
    gpu_memory_utilization=0.92,  # leave 8% for CUDA overhead
    max_model_len=8192,           # max context length
    quantization="fp8",           # halves memory usage
    enable_prefix_caching=True,   # cache common prefix KV blocks
)

# 2. Sampling parameters per request
params = SamplingParams(
    temperature=0.7,
    max_tokens=512,
    stop=["</s>", "<|eot_id|>"],
)

# 3. Batch inference (offline) — vLLM auto-batches for throughput
prompts = ["Explain transformers in one paragraph", "What is RLHF?"]
outputs = llm.generate(prompts, params)
for output in outputs:
    print(output.outputs[0].text)

# 4. Online serving — exposes OpenAI-compatible API
# vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
#   --tensor-parallel-size 4 \
#   --gpu-memory-utilization 0.92 \
#   --enable-prefix-caching

# 5. Streaming client (OpenAI SDK)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-70B-Instruct",
    messages=[{"role": "user", "content": "Explain KV cache"}],
    stream=True,  # SSE — yields chunks as tokens are generated
    max_tokens=256,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)  # stream to terminal
🔧

Break It — See What Happens

Disable continuous batching
Disable KV cache paging

Quick check

Derivation

Without PagedAttention, KV-cache waste is ~60%. On a GPU with 80 GB of HBM and a 140 GB Llama-70B model split across 2 GPUs, how does this waste affect maximum concurrent requests?

Without PagedAttention, KV-cache waste is ~60%. On a GPU with 80 GB of HBM and a 140 GB Llama-70B model split across 2 GPUs, how does this waste affect maximum concurrent requests?
🚀

SOTA 2024–2025: Production Serving Stack

vLLM V1 refactor (early 2025)

The vLLM v0.6 refactor (EECS-2025-192) simplified the scheduler and execution loop, delivering compared to v0.5 on standard benchmarks. The rewrite eliminated several layers of Python overhead in the hot path and made the scheduler easier to extend (e.g. for chunked prefill, prefix caching). As of 2025, vLLM V1 is the default for new deployments.

SGLang RadixAttention — prefix-heavy workloads

SGLang (arxiv:2312.07104) uses a radix tree to automatically deduplicate KV cache across requests that share a common prefix (system prompt, RAG context, few-shot examples). On standard workloads: . On prefix-heavy workloads (shared system prompt): (16,200 tok/s vs vLLM 12,500 tok/s) because it serves many requests from a single cached prefix instead of re-computing KV for each.

NVIDIA GB200 NVL72 (GA: Feb 2025)

The GB200 NVL72 rack unit (nvidia.com/gb200-nvl72) pairs 36 Grace CPUs with 72 Blackwell B200 GPUs in a single liquid-cooled rack. Key specs: . CoreWeave became the first cloud provider to offer GB200 NVL72 instances at GA (Feb 2025) at . The NVLink-C2C die-to-die interconnect allows all 72 GPUs to share a unified 13.5 TB HBM3e pool — enabling single-rack inference for 405B+ models without NVLink switches.

FP8 as the 2025 production standard

H100 tensor cores natively support FP8 (E4M3 forward, E5M2 gradients). vLLM defaults to FP8 weight quantization on H100 (vllm docs) and DeepSeek-V3 was trained in FP8 — demonstrating that FP8 is no longer just a serving optimization but viable for pre-training at scale. Quality impact: . Combined with FP8 KV cache (halving KV memory), FP8 end-to-end roughly doubles effective throughput vs FP16 on H100.

2025 serving stack cheat sheet

ScenarioRecommended stack
General production (A100/H100)vLLM V1 + FP8 + continuous batching
Prefix-heavy (RAG, agents, shared system prompt)SGLang + RadixAttention
Maximum throughput on NVIDIA hardwareTensorRT-LLM + FP8 + in-flight batching
405B+ single-rack inferenceGB200 NVL72 (130 TB/s NVLink, 13.5 TB unified HBM)

Non-NVIDIA Hardware Frontier (2024–2025)

NVIDIA dominates LLM training and inference, but several alternative architectures have reached production or near-production scale. Each trades NVIDIA’s flexibility for a specific advantage — memory-bandwidth, on-chip SRAM, or energy efficiency.

Cerebras WSE-3 (2024)

The WSE-3 (cerebras.ai) eliminates the HBM memory bandwidth bottleneck by placing all weights on-chip: 4 trillion transistors, 900K AI cores, 44 GB on-chip SRAM. The result: . The tradeoff is model size — the chip can only serve models that fit in 44 GB on-chip, limiting it to 70B-class models without multi-chip scaling. Available as cloud inference at cerebras.ai.

Groq LPU (2024)

Groq’s Language Processing Unit (groq.com) uses a deterministic, compiler-scheduled execution model with on-chip SRAM. Published benchmarks show . Like Cerebras, it trades flexibility for throughput — the LPU is a fixed dataflow processor, not a general GPU. Available via GroqCloud at competitive per-token pricing.

Google TPU v6e Trillium (GA late 2024)

Trillium (Google Cloud blog) is Google’s 6th-generation TPU, generally available in late 2024. . Used for Gemini training and inference internally. Available on Google Cloud — the primary choice for organizations with existing GCP infrastructure or JAX/XLA codebases.

AWS Trainium 2 / Inferentia 2 (2024–2025 production adoption)

AWS Trainium 2 (training) and Inferentia 2 (inference) reached broad production adoption in 2024–2025. Anthropic is a major customer, using Trainium for Claude training at scale (per-source: Anthropic/AWS partnership announcements). The key advantage is AWS-ecosystem integration and lower cost-per-token vs. H100 on-demand for steady-state inference workloads. Inferentia 2 supports Llama, Mistral, and Titan model families natively via AWS Neuron SDK.

Etched Sohu (announced Jun 2024) — unverified

Etched (etched.com) announced Sohu as the first transformer-only ASIC. By baking the transformer architecture into silicon, it claims . The architectural constraint is fundamental: Sohu cannot run non-transformer models. Worth tracking but not production-ready.

Non-NVIDIA hardware comparison (Llama 3 70B inference, 2025-Q1)

Chip / PlatformArchitectureBest tok/s (70B)Use case fitVendor lock-inPublic availability
Cerebras WSE-3On-chip SRAM wafer-scale~2,100Low-latency inferenceHigh (proprietary)cerebras.ai cloud
Groq LPUDeterministic dataflow~800 (1,665 w/ spec-dec)High-throughput inferenceHigh (proprietary)GroqCloud
Google TPU v6eSystolic array (JAX/XLA)N/A (training primary)Training + serving at scaleHigh (GCP/JAX)Google Cloud GA
AWS Trainium 2NeuronCore v2N/A (training primary)Training (AWS ecosystem)High (AWS Neuron SDK)AWS GA
Etched SohuTransformer-only ASIC~500K (claimed)Transformer inference onlyExtreme (ASICs)Not yet shipping
NVIDIA H100 (ref)CUDA / tensor cores~300–600 (4× H100)Training + any inferenceMedium (CUDA ecosystem)All major clouds

Throughput figures are vendor-reported or derived from public benchmarks; independent verification varies. H100 throughput for reference uses 4-GPU tensor-parallel vLLM setup. Etched Sohu figure is claimed and unverified (community estimate).

💡 Tip · Interview framing:When asked “why not just use Cerebras or Groq everywhere?” — the answer is flexibility vs. throughput tradeoff. Cerebras/Groq win on pure tokens/second for fixed models but cannot run arbitrary model architectures, require vendor-specific SDKs, and lack the multi-modal / multi-task flexibility of GPU clusters. NVIDIA’s moat is the CUDA software ecosystem and model-architecture generality, not raw arithmetic throughput.
🧠

Key Takeaways

What to remember for interviews

  1. 1Continuous batching (Orca) acts at iteration level — finished sequences are evicted and new requests inserted every decode step, keeping GPU utilization near 90%
  2. 2PagedAttention maps KV cache to fixed-size physical blocks on demand, cutting memory waste from ~60% to ~4% and enabling 2-4x more concurrent requests
  3. 3Prefill is compute-bound (parallel, bottleneck = TTFT); decode is memory-bandwidth-bound (autoregressive, bottleneck = TPS) — different phases, different optimization targets
  4. 4SGLang RadixAttention caches KV for shared prefixes in a radix tree — 6.4× throughput on prefix-heavy workloads vs vLLM baseline. vLLM V1 (2025) delivers ~5× latency reduction over v0.5.
  5. 5GB200 NVL72 (GA Feb 2025): 72 B200 GPUs, 130 TB/s NVLink, 13.5 TB unified HBM — enables single-rack inference for 405B+ models. FP8 is the 2025 default on H100 (<0.5% accuracy loss vs BF16).
  6. 6Non-NVIDIA frontier (2024–2025): Cerebras WSE-3 ~2,100 tok/s, Groq LPU ~800 tok/s (1,665 with spec-dec), TPU v6e 4.7× vs v5e — all trade CUDA flexibility for specific throughput/efficiency wins.
🧠

Recap quiz

Derivation

The vLLM paper reports 2–4× throughput over FasterTransformer/Orca at the same latency. What is the primary mechanism behind this gain?

The vLLM paper reports 2–4× throughput over FasterTransformer/Orca at the same latency. What is the primary mechanism behind this gain?
Derivation

Llama-70B at FP16 has 140 GB of weights. An A100 has 2 TB/s HBM bandwidth. What is the approximate batch=1 decode throughput, and why does batching improve it?

Llama-70B at FP16 has 140 GB of weights. An A100 has 2 TB/s HBM bandwidth. What is the approximate batch=1 decode throughput, and why does batching improve it?
Trade-off

TensorRT-LLM achieves roughly 1.5–2× higher throughput than vLLM on NVIDIA hardware. When is vLLM still the better choice for a production deployment?

TensorRT-LLM achieves roughly 1.5–2× higher throughput than vLLM on NVIDIA hardware. When is vLLM still the better choice for a production deployment?
Trade-off

A serving system processes requests with highly variable prompt lengths (50–8,000 tokens). Why does prefill-disaggregation (Splitwise/DistServe) reduce p99 latency even if total GPU-seconds consumed is unchanged?

A serving system processes requests with highly variable prompt lengths (50–8,000 tokens). Why does prefill-disaggregation (Splitwise/DistServe) reduce p99 latency even if total GPU-seconds consumed is unchanged?
Derivation

Beyond reducing KV-cache waste from ~60% to under 4%, PagedAttention enables prefix caching. What makes prefix caching possible in the paged design that was impossible with contiguous pre-allocation?

Beyond reducing KV-cache waste from ~60% to under 4%, PagedAttention enables prefix caching. What makes prefix caching possible in the paged design that was impossible with contiguous pre-allocation?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 5 of 5

Explain continuous batching and why it improves throughput over static batching.

★★☆
GoogleDatabricks

What is the difference between TTFT and TPS, and why do different use cases care about different metrics?

★★☆
OpenAIAnthropic

How does PagedAttention reduce KV cache memory waste from ~60% to under 4%?

★★★
DatabricksOpenAI

Design an LLM serving system for 1000 QPS with p99 latency < 500ms for Llama-70B.

★★★
GoogleMeta

Compare prefill-disaggregated architectures (Splitwise, DistServe) vs unified serving. When does disaggregation win?

★★★
AnthropicDatabricks