🚀 LLM Deployment
Continuous batching serves 23x more requests than static batching on the same GPU
Why can't you just model.generate() in production? At 1000 requests/second, naive serving wastes 70% of your GPU waiting for sequences to finish. Modern LLM serving is a systems problem: continuous batching keeps GPUs busy, PagedAttention eliminates KV cache fragmentation, and the prefill/decode split reveals why latency and throughput are fundamentally different optimization targets.
The LLM Serving Stack
What you're seeing: The full request lifecycle from client to GPU and back. Each layer has a distinct job — from load balancing to tokenization to iterative decode.
What to notice: Prefill (step 2) processes all prompt tokens in parallel — fast but compute-heavy. Decode (step 3) is one token at a time — slow and memory-bound. These two phases have different bottlenecks and optimization strategies.
Batching Strategies
Static vs Continuous Batching
The problem: A 2000-token sequence takes 40x longer than a 50-token one. Static batching makes short sequences wait — GPU idles for the longest.
Orca (2022) fix: At every decode step, evict finished sequences and insert waiting requests. Batch composition changes every iteration — no idle slots.
Quick check
A static batch has 4 sequences: 20, 28, 14, and 50 tokens. The GPU sits idle for how many token-steps while waiting for the longest sequence, and what does continuous batching eliminate?
PagedAttention: Virtual Memory for KV Cache
The problem: KV cache for a 2048-token sequence pre-allocates contiguous GPU memory at request arrival — even if the sequence only uses 100 tokens. before a single output token is generated. Memory fragmentation prevents packing more requests.
vLLM (2023) fix: Divide KV cache into fixed-size pages (16 tokens each). Map virtual sequence positions to physical GPU pages on demand — like OS virtual memory.
Why does continuous batching improve GPU utilization compared to static batching?
Prefill vs Decode: Two Different Problems
Prefill (Prompt Processing)
- • All prompt tokens processed in parallel (one forward pass)
- • Produces the initial KV cache entries
- •
- • Bottleneck metric: TTFT (Time To First Token)
- • Optimization: tensor parallelism, , , large batch compute
Decode (Token Generation)
- • One token generated per forward pass (autoregressive)
- • Reads entire KV cache + model weights each step
- •
- • Bottleneck metric: TPS (Tokens Per Second)
- • Optimization: , GQA,
Latency Formulas
TTFT scales with prompt length (more tokens to process in prefill):
TPS is bounded by how fast you can read model weights from GPU memory (HBM):
Example: . Batching multiple requests amortizes the weight read cost, improving throughput but not per-request latency.
Quick check
Your LLM API has p99 TTFT of 800 ms (target: 200 ms) but p99 TPS is acceptable at 60 tok/s. Which optimization addresses the right bottleneck?
Serving Framework Comparison
Llama-3 70B on 4×A100-80GB. Numbers are approximate and depend heavily on workload (prompt length distribution, output length, batch size).
| Framework | Continuous Batching | PagedAttention | Speculative Decoding | Quantization | Multi-GPU | Throughput (tok/s) |
|---|---|---|---|---|---|---|
| vLLM | ✓ | ✓ (original) | ✓ | FP8, INT8, INT4 | TP + PP | ~2,000–3,000 |
| TGI (HuggingFace) | ✓ | ✓ | ✓ | GPTQ, AWQ, FP8 | TP | ~1,500–2,500 |
| TensorRT-LLM (NVIDIA) | ✓ (in-flight) | ✓ | ✓ (draft model) | FP8, INT8, INT4 | TP + PP | ~3,000–5,000 |
| SGLang | ✓ | ✓ | ✓ (EAGLE) | FP8, INT4 | TP | ~2,500–4,000 |
TP = Tensor Parallelism (split each layer across GPUs). PP = Pipeline Parallelism (split layers across GPUs in stages). . SGLang adds RadixAttention (prefix caching) for workloads with shared prefixes (e.g., system prompts).
Quick check
Your team needs to serve Llama-70B at maximum throughput on NVIDIA H100s, but can tolerate a 2-week setup time per model update. Which framework is the right choice and why?
API Design Patterns
Streaming (SSE)
Server-Sent Events: server pushes token chunks as they are generated. Client reads a chunked HTTP response. Each chunk is a JSON delta. Critical for UX — users see output immediately instead of waiting for the full response.
Token Counting & Billing
Count input tokens at the API gateway before sending to the model (use tiktoken/sentencepiece). Output tokens are counted as they are generated. Bill on prompt tokens + completion tokens separately — prompt tokens are cheaper (prefill is parallelized across users via batching).
Rate Limiting & Timeouts
Rate limit on tokens-per-minute (TPM), not just requests-per-minute (RPM). A single 100k-token request consumes 100x resources of a 1k-token request. Set request timeout at API gateway (e.g., 120s) plus per-token timeout (e.g., 30s for first token). Abort server-side generation on client disconnect to free GPU slots.
Minimal vLLM Serving Setup (Python)
from vllm import LLM, SamplingParams
from vllm.entrypoints.openai.api_server import build_app
import uvicorn
# 1. Launch the model — vLLM handles continuous batching + PagedAttention
llm = LLM(
model="meta-llama/Meta-Llama-3-70B-Instruct",
tensor_parallel_size=4, # split across 4 GPUs
gpu_memory_utilization=0.92, # leave 8% for CUDA overhead
max_model_len=8192, # max context length
quantization="fp8", # halves memory usage
enable_prefix_caching=True, # cache common prefix KV blocks
)
# 2. Sampling parameters per request
params = SamplingParams(
temperature=0.7,
max_tokens=512,
stop=["</s>", "<|eot_id|>"],
)
# 3. Batch inference (offline) — vLLM auto-batches for throughput
prompts = ["Explain transformers in one paragraph", "What is RLHF?"]
outputs = llm.generate(prompts, params)
for output in outputs:
print(output.outputs[0].text)
# 4. Online serving — exposes OpenAI-compatible API
# vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
# --tensor-parallel-size 4 \
# --gpu-memory-utilization 0.92 \
# --enable-prefix-caching
# 5. Streaming client (OpenAI SDK)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
stream = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-70B-Instruct",
messages=[{"role": "user", "content": "Explain KV cache"}],
stream=True, # SSE — yields chunks as tokens are generated
max_tokens=256,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True) # stream to terminalBreak It — See What Happens
Quick check
Without PagedAttention, KV-cache waste is ~60%. On a GPU with 80 GB of HBM and a 140 GB Llama-70B model split across 2 GPUs, how does this waste affect maximum concurrent requests?
SOTA 2024–2025: Production Serving Stack
vLLM V1 refactor (early 2025)
The vLLM v0.6 refactor (EECS-2025-192) simplified the scheduler and execution loop, delivering compared to v0.5 on standard benchmarks. The rewrite eliminated several layers of Python overhead in the hot path and made the scheduler easier to extend (e.g. for chunked prefill, prefix caching). As of 2025, vLLM V1 is the default for new deployments.
SGLang RadixAttention — prefix-heavy workloads
SGLang (arxiv:2312.07104) uses a radix tree to automatically deduplicate KV cache across requests that share a common prefix (system prompt, RAG context, few-shot examples). On standard workloads: . On prefix-heavy workloads (shared system prompt): (16,200 tok/s vs vLLM 12,500 tok/s) because it serves many requests from a single cached prefix instead of re-computing KV for each.
NVIDIA GB200 NVL72 (GA: Feb 2025)
The GB200 NVL72 rack unit (nvidia.com/gb200-nvl72) pairs 36 Grace CPUs with 72 Blackwell B200 GPUs in a single liquid-cooled rack. Key specs: . CoreWeave became the first cloud provider to offer GB200 NVL72 instances at GA (Feb 2025) at . The NVLink-C2C die-to-die interconnect allows all 72 GPUs to share a unified 13.5 TB HBM3e pool — enabling single-rack inference for 405B+ models without NVLink switches.
FP8 as the 2025 production standard
H100 tensor cores natively support FP8 (E4M3 forward, E5M2 gradients). vLLM defaults to FP8 weight quantization on H100 (vllm docs) and DeepSeek-V3 was trained in FP8 — demonstrating that FP8 is no longer just a serving optimization but viable for pre-training at scale. Quality impact: . Combined with FP8 KV cache (halving KV memory), FP8 end-to-end roughly doubles effective throughput vs FP16 on H100.
2025 serving stack cheat sheet
| Scenario | Recommended stack |
|---|---|
| General production (A100/H100) | vLLM V1 + FP8 + continuous batching |
| Prefix-heavy (RAG, agents, shared system prompt) | SGLang + RadixAttention |
| Maximum throughput on NVIDIA hardware | TensorRT-LLM + FP8 + in-flight batching |
| 405B+ single-rack inference | GB200 NVL72 (130 TB/s NVLink, 13.5 TB unified HBM) |
Non-NVIDIA Hardware Frontier (2024–2025)
NVIDIA dominates LLM training and inference, but several alternative architectures have reached production or near-production scale. Each trades NVIDIA’s flexibility for a specific advantage — memory-bandwidth, on-chip SRAM, or energy efficiency.
Cerebras WSE-3 (2024)
The WSE-3 (cerebras.ai) eliminates the HBM memory bandwidth bottleneck by placing all weights on-chip: 4 trillion transistors, 900K AI cores, 44 GB on-chip SRAM. The result: . The tradeoff is model size — the chip can only serve models that fit in 44 GB on-chip, limiting it to 70B-class models without multi-chip scaling. Available as cloud inference at cerebras.ai.
Groq LPU (2024)
Groq’s Language Processing Unit (groq.com) uses a deterministic, compiler-scheduled execution model with on-chip SRAM. Published benchmarks show . Like Cerebras, it trades flexibility for throughput — the LPU is a fixed dataflow processor, not a general GPU. Available via GroqCloud at competitive per-token pricing.
Google TPU v6e Trillium (GA late 2024)
Trillium (Google Cloud blog) is Google’s 6th-generation TPU, generally available in late 2024. . Used for Gemini training and inference internally. Available on Google Cloud — the primary choice for organizations with existing GCP infrastructure or JAX/XLA codebases.
AWS Trainium 2 / Inferentia 2 (2024–2025 production adoption)
AWS Trainium 2 (training) and Inferentia 2 (inference) reached broad production adoption in 2024–2025. Anthropic is a major customer, using Trainium for Claude training at scale (per-source: Anthropic/AWS partnership announcements). The key advantage is AWS-ecosystem integration and lower cost-per-token vs. H100 on-demand for steady-state inference workloads. Inferentia 2 supports Llama, Mistral, and Titan model families natively via AWS Neuron SDK.
Etched Sohu (announced Jun 2024) — unverified
Etched (etched.com) announced Sohu as the first transformer-only ASIC. By baking the transformer architecture into silicon, it claims . The architectural constraint is fundamental: Sohu cannot run non-transformer models. Worth tracking but not production-ready.
Non-NVIDIA hardware comparison (Llama 3 70B inference, 2025-Q1)
| Chip / Platform | Architecture | Best tok/s (70B) | Use case fit | Vendor lock-in | Public availability |
|---|---|---|---|---|---|
| Cerebras WSE-3 | On-chip SRAM wafer-scale | ~2,100 | Low-latency inference | High (proprietary) | cerebras.ai cloud |
| Groq LPU | Deterministic dataflow | ~800 (1,665 w/ spec-dec) | High-throughput inference | High (proprietary) | GroqCloud |
| Google TPU v6e | Systolic array (JAX/XLA) | N/A (training primary) | Training + serving at scale | High (GCP/JAX) | Google Cloud GA |
| AWS Trainium 2 | NeuronCore v2 | N/A (training primary) | Training (AWS ecosystem) | High (AWS Neuron SDK) | AWS GA |
| Etched Sohu | Transformer-only ASIC | ~500K (claimed) | Transformer inference only | Extreme (ASICs) | Not yet shipping |
| NVIDIA H100 (ref) | CUDA / tensor cores | ~300–600 (4× H100) | Training + any inference | Medium (CUDA ecosystem) | All major clouds |
Throughput figures are vendor-reported or derived from public benchmarks; independent verification varies. H100 throughput for reference uses 4-GPU tensor-parallel vLLM setup. Etched Sohu figure is claimed and unverified (community estimate).
Key Takeaways
What to remember for interviews
- 1Continuous batching (Orca) acts at iteration level — finished sequences are evicted and new requests inserted every decode step, keeping GPU utilization near 90%
- 2PagedAttention maps KV cache to fixed-size physical blocks on demand, cutting memory waste from ~60% to ~4% and enabling 2-4x more concurrent requests
- 3Prefill is compute-bound (parallel, bottleneck = TTFT); decode is memory-bandwidth-bound (autoregressive, bottleneck = TPS) — different phases, different optimization targets
- 4SGLang RadixAttention caches KV for shared prefixes in a radix tree — 6.4× throughput on prefix-heavy workloads vs vLLM baseline. vLLM V1 (2025) delivers ~5× latency reduction over v0.5.
- 5GB200 NVL72 (GA Feb 2025): 72 B200 GPUs, 130 TB/s NVLink, 13.5 TB unified HBM — enables single-rack inference for 405B+ models. FP8 is the 2025 default on H100 (<0.5% accuracy loss vs BF16).
- 6Non-NVIDIA frontier (2024–2025): Cerebras WSE-3 ~2,100 tok/s, Groq LPU ~800 tok/s (1,665 with spec-dec), TPU v6e 4.7× vs v5e — all trade CUDA flexibility for specific throughput/efficiency wins.
Recap quiz
The vLLM paper reports 2–4× throughput over FasterTransformer/Orca at the same latency. What is the primary mechanism behind this gain?
Llama-70B at FP16 has 140 GB of weights. An A100 has 2 TB/s HBM bandwidth. What is the approximate batch=1 decode throughput, and why does batching improve it?
TensorRT-LLM achieves roughly 1.5–2× higher throughput than vLLM on NVIDIA hardware. When is vLLM still the better choice for a production deployment?
A serving system processes requests with highly variable prompt lengths (50–8,000 tokens). Why does prefill-disaggregation (Splitwise/DistServe) reduce p99 latency even if total GPU-seconds consumed is unchanged?
Beyond reducing KV-cache waste from ~60% to under 4%, PagedAttention enables prefix caching. What makes prefix caching possible in the paged design that was impossible with contiguous pre-allocation?
Further Reading
- Efficient Memory Management for Large Language Model Serving with PagedAttention — Kwon et al. 2023 — the vLLM paper introducing PagedAttention (virtual-memory paging for KV cache). The paper reports 2–4× higher throughput than prior systems (FasterTransformer, Orca) at the same latency; much larger headline numbers seen elsewhere depend on the specific static-batching baseline being compared against.
- Orca: A Distributed Serving System for Transformer-Based Generative Models — Yu et al. 2022 — the continuous batching paper (Orca), showing iteration-level scheduling that keeps GPUs busy instead of waiting for the longest sequence.
- Andrej Karpathy — Let's Build the GPT Tokenizer (YouTube) — Karpathy's hands-on walkthrough of tokenization and inference fundamentals — foundational context for understanding deployment latency.
- NVIDIA TensorRT-LLM Documentation — Official docs for TensorRT-LLM — NVIDIA's high-performance inference library with in-flight batching, FP8 quantization, and multi-GPU tensor parallelism.
- Splitwise: Efficient Generative Inference with Model Splitting — Patel et al. 2023 — separates prefill and decode across GPU clusters for optimal hardware utilization
- SGLang: Efficient Execution of Structured Language Model Programs — Zheng et al. 2023 — RadixAttention for KV cache reuse across requests, 5x throughput on multi-turn workloads
Interview Questions
Showing 5 of 5
Explain continuous batching and why it improves throughput over static batching.
★★☆What is the difference between TTFT and TPS, and why do different use cases care about different metrics?
★★☆How does PagedAttention reduce KV cache memory waste from ~60% to under 4%?
★★★Design an LLM serving system for 1000 QPS with p99 latency < 500ms for Llama-70B.
★★★Compare prefill-disaggregated architectures (Splitwise, DistServe) vs unified serving. When does disaggregation win?
★★★