LLM Deployment — Transformer Math

Module 28 · Inference

🚀 LLM Deployment

Continuous batching serves 23x more requests than static batching on the same GPU

Status:

Why can't you just model.generate() in production? At 1000 requests/second, naive serving wastes 70% of your GPU waiting for sequences to finish. Modern LLM serving is a systems problem: continuous batching keeps GPUs busy, PagedAttention eliminates KV cache fragmentation, and the prefill/decode split reveals why latency and throughput are fundamentally different optimization targets.

🎮

The LLM Serving Stack

What you're seeing: The full request lifecycle from client to GPU and back. Each layer has a distinct job — from load balancing to tokenization to iterative decode.

What to notice: Prefill (step 2) processes all prompt tokens in parallel — fast but compute-heavy. Decode (step 3) is one token at a time — slow and memory-bound. These two phases have different bottlenecks and optimization strategies.

💡

Batching Strategies

Static vs Continuous Batching

The problem: A 2000-token sequence takes 40x longer than a 50-token one. Static batching makes short sequences wait — GPU idles for the longest.

Orca (2022) fix: At every decode step, evict finished sequences and insert waiting requests. Batch composition changes every iteration — no idle slots.

✨ Insight · Continuous batching is iteration-level scheduling: instead of scheduling at the request boundary, the scheduler acts at every forward pass. This is why vLLM can sustain substantially higher throughput — the GPU never waits for one slow request. The vLLM paper reports ; gains vs. naive static batching are larger and workload-dependent.

Quick check

Derivation

A static batch has 4 sequences: 20, 28, 14, and 50 tokens. The GPU sits idle for how many token-steps while waiting for the longest sequence, and what does continuous batching eliminate?

The GPU never idles; static batching parallelizes all sequences and finishes them together via padding.30 idle steps on slot C (50−20=30) and similar for slots A and B; continuous batching fills those slots with new requests immediately.No idle time; GPUs queue decode steps internally and process short sequences twice while waiting for the longest.Only 1 idle step; modern schedulers pre-load the next request while the longest sequence finishes its last token.

PagedAttention: Virtual Memory for KV Cache

The problem: KV cache for a 2048-token sequence pre-allocates contiguous GPU memory at request arrival — even if the sequence only uses 100 tokens. before a single output token is generated. Memory fragmentation prevents packing more requests.

vLLM (2023) fix: Divide KV cache into fixed-size pages (16 tokens each). Map virtual sequence positions to physical GPU pages on demand — like OS virtual memory.

Quick Check

Why does continuous batching improve GPU utilization compared to static batching?

📐

Prefill vs Decode: Two Different Problems

Prefill (Prompt Processing)

• All prompt tokens processed in parallel (one forward pass)
• Produces the initial KV cache entries
•
• Bottleneck metric: TTFT (Time To First Token)
• Optimization: tensor parallelism, , , large batch compute

Decode (Token Generation)

• One token generated per forward pass (autoregressive)
• Reads entire KV cache + model weights each step
•
• Bottleneck metric: TPS (Tokens Per Second)
• Optimization: , GQA,

Latency Formulas

TTFT scales with prompt length (more tokens to process in prefill):

TPS is bounded by how fast you can read model weights from GPU memory (HBM):

Example: . Batching multiple requests amortizes the weight read cost, improving throughput but not per-request latency.

🎯 Interview · TTFT and TPS are optimized by different techniques. Fast TTFT needs fast compute (prefill phase); high TPS needs high memory bandwidth and quantization (decode phase). , eliminating the head-of-line blocking where long prefills stall concurrent decode steps.

Quick check

Trade-off

Your LLM API has p99 TTFT of 800 ms (target: 200 ms) but p99 TPS is acceptable at 60 tok/s. Which optimization addresses the right bottleneck?

Prefill optimization: tensor parallelism, Flash Attention, or chunked prefill to process the prompt faster.Increase batch size to amortize decode costs across more requests, which speeds up TTFT.Apply INT4 quantization to reduce model size and improve memory bandwidth for faster decoding.Enable speculative decoding to generate 3–5 tokens per step instead of 1, cutting overall latency.

📊

Serving Framework Comparison

Llama-3 70B on 4×A100-80GB. Numbers are approximate and depend heavily on workload (prompt length distribution, output length, batch size).

Framework	Continuous Batching	PagedAttention	Speculative Decoding	Quantization	Multi-GPU	Throughput (tok/s)
vLLM	✓	✓ (original)	✓	FP8, INT8, INT4	TP + PP	~2,000–3,000
TGI (HuggingFace)	✓	✓	✓	GPTQ, AWQ, FP8	TP	~1,500–2,500
TensorRT-LLM (NVIDIA)	✓ (in-flight)	✓	✓ (draft model)	FP8, INT8, INT4	TP + PP	~3,000–5,000
SGLang	✓	✓	✓ (EAGLE)	FP8, INT4	TP	~2,500–4,000

TP = Tensor Parallelism (split each layer across GPUs). PP = Pipeline Parallelism (split layers across GPUs in stages). . SGLang adds RadixAttention (prefix caching) for workloads with shared prefixes (e.g., system prompts).

💡 Tip · For most production deployments: start with vLLM (easiest to operate, great ecosystem), move to TensorRT-LLM when you need maximum throughput on NVIDIA hardware, and use SGLang when your workload has heavy prefix sharing (RAG, agents with fixed system prompts).

Quick check

Trade-off

Your team needs to serve Llama-70B at maximum throughput on NVIDIA H100s, but can tolerate a 2-week setup time per model update. Which framework is the right choice and why?

TGI: it has the broadest hardware support and the highest throughput among open-source frameworks.vLLM: it auto-compiles CUDA kernels at startup, providing TRT-LLM-level performance with no setup overhead.TensorRT-LLM: hand-optimized CUDA kernels and FP8 on Hopper give 1.5–2× throughput; the engine compile step causes the update delay.SGLang: RadixAttention prefix caching eliminates all KV-cache memory waste, making it faster than TRT-LLM.

🔌

API Design Patterns

Streaming (SSE)

Server-Sent Events: server pushes token chunks as they are generated. Client reads a chunked HTTP response. Each chunk is a JSON delta. Critical for UX — users see output immediately instead of waiting for the full response.

Token Counting & Billing

Count input tokens at the API gateway before sending to the model (use tiktoken/sentencepiece). Output tokens are counted as they are generated. Bill on prompt tokens + completion tokens separately — prompt tokens are cheaper (prefill is parallelized across users via batching).

Rate Limiting & Timeouts

Rate limit on tokens-per-minute (TPM), not just requests-per-minute (RPM). A single 100k-token request consumes 100x resources of a 1k-token request. Set request timeout at API gateway (e.g., 120s) plus per-token timeout (e.g., 30s for first token). Abort server-side generation on client disconnect to free GPU slots.

Minimal vLLM Serving Setup (Python)

python

from vllm import LLM, SamplingParams
from vllm.entrypoints.openai.api_server import build_app
import uvicorn

# 1. Launch the model — vLLM handles continuous batching + PagedAttention
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B-Instruct",
    tensor_parallel_size=4,       # split across 4 GPUs
    gpu_memory_utilization=0.92,  # leave 8% for CUDA overhead
    max_model_len=8192,           # max context length
    quantization="fp8",           # halves memory usage
    enable_prefix_caching=True,   # cache common prefix KV blocks
)

# 2. Sampling parameters per request
params = SamplingParams(
    temperature=0.7,
    max_tokens=512,
    stop=["</s>", "<|eot_id|>"],
)

# 3. Batch inference (offline) — vLLM auto-batches for throughput
prompts = ["Explain transformers in one paragraph", "What is RLHF?"]
outputs = llm.generate(prompts, params)
for output in outputs:
    print(output.outputs[0].text)

# 4. Online serving — exposes OpenAI-compatible API
# vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
#   --tensor-parallel-size 4 \
#   --gpu-memory-utilization 0.92 \
#   --enable-prefix-caching

# 5. Streaming client (OpenAI SDK)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-70B-Instruct",
    messages=[{"role": "user", "content": "Explain KV cache"}],
    stream=True,  # SSE — yields chunks as tokens are generated
    max_tokens=256,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)  # stream to terminal

🔧

Break It — See What Happens

Disable continuous batching

Disable KV cache paging

Quick check

Derivation

Without PagedAttention, KV-cache waste is ~60%. On a GPU with 80 GB of HBM and a 140 GB Llama-70B model split across 2 GPUs, how does this waste affect maximum concurrent requests?

60% waste affects only sequences longer than 1,024 tokens; short requests are unaffected because their pre-allocation fits in L2 cache.60% waste doubles decode latency because the GPU must swap KV-cache blocks to CPU memory every step.Fragmentation only affects throughput, not the number of concurrent requests, since the GPU scheduler queues excess requests in CPU RAM.Each GPU has ~40 GB net VRAM after weights; 60% waste reduces effective KV-cache capacity to ~16 GB, roughly halving the number of concurrent sequences vs paged allocation.

🚀

SOTA 2024–2025: Production Serving Stack

vLLM V1 refactor (early 2025)

The vLLM v0.6 refactor (EECS-2025-192) simplified the scheduler and execution loop, delivering compared to v0.5 on standard benchmarks. The rewrite eliminated several layers of Python overhead in the hot path and made the scheduler easier to extend (e.g. for chunked prefill, prefix caching). As of 2025, vLLM V1 is the default for new deployments.

SGLang RadixAttention — prefix-heavy workloads

SGLang (arxiv:2312.07104) uses a radix tree to automatically deduplicate KV cache across requests that share a common prefix (system prompt, RAG context, few-shot examples). On standard workloads: . On prefix-heavy workloads (shared system prompt): (16,200 tok/s vs vLLM 12,500 tok/s) because it serves many requests from a single cached prefix instead of re-computing KV for each.

NVIDIA GB200 NVL72 (GA: Feb 2025)

The GB200 NVL72 rack unit (nvidia.com/gb200-nvl72) pairs 36 Grace CPUs with 72 Blackwell B200 GPUs in a single liquid-cooled rack. Key specs: . CoreWeave became the first cloud provider to offer GB200 NVL72 instances at GA (Feb 2025) at . The NVLink-C2C die-to-die interconnect allows all 72 GPUs to share a unified 13.5 TB HBM3e pool — enabling single-rack inference for 405B+ models without NVLink switches.

FP8 as the 2025 production standard

H100 tensor cores natively support FP8 (E4M3 forward, E5M2 gradients). vLLM defaults to FP8 weight quantization on H100 (vllm docs) and DeepSeek-V3 was trained in FP8 — demonstrating that FP8 is no longer just a serving optimization but viable for pre-training at scale. Quality impact: . Combined with FP8 KV cache (halving KV memory), FP8 end-to-end roughly doubles effective throughput vs FP16 on H100.

2025 serving stack cheat sheet

Scenario	Recommended stack
General production (A100/H100)	vLLM V1 + FP8 + continuous batching
Prefix-heavy (RAG, agents, shared system prompt)	SGLang + RadixAttention
Maximum throughput on NVIDIA hardware	TensorRT-LLM + FP8 + in-flight batching
405B+ single-rack inference	GB200 NVL72 (130 TB/s NVLink, 13.5 TB unified HBM)

⚡

Non-NVIDIA Hardware Frontier (2024–2025)

NVIDIA dominates LLM training and inference, but several alternative architectures have reached production or near-production scale. Each trades NVIDIA’s flexibility for a specific advantage — memory-bandwidth, on-chip SRAM, or energy efficiency.

Cerebras WSE-3 (2024)

The WSE-3 (cerebras.ai) eliminates the HBM memory bandwidth bottleneck by placing all weights on-chip: 4 trillion transistors, 900K AI cores, 44 GB on-chip SRAM. The result: . The tradeoff is model size — the chip can only serve models that fit in 44 GB on-chip, limiting it to 70B-class models without multi-chip scaling. Available as cloud inference at cerebras.ai.

Groq LPU (2024)

Groq’s Language Processing Unit (groq.com) uses a deterministic, compiler-scheduled execution model with on-chip SRAM. Published benchmarks show . Like Cerebras, it trades flexibility for throughput — the LPU is a fixed dataflow processor, not a general GPU. Available via GroqCloud at competitive per-token pricing.

Google TPU v6e Trillium (GA late 2024)

Trillium (Google Cloud blog) is Google’s 6th-generation TPU, generally available in late 2024. . Used for Gemini training and inference internally. Available on Google Cloud — the primary choice for organizations with existing GCP infrastructure or JAX/XLA codebases.

AWS Trainium 2 / Inferentia 2 (2024–2025 production adoption)

AWS Trainium 2 (training) and Inferentia 2 (inference) reached broad production adoption in 2024–2025. Anthropic is a major customer, using Trainium for Claude training at scale (per-source: Anthropic/AWS partnership announcements). The key advantage is AWS-ecosystem integration and lower cost-per-token vs. H100 on-demand for steady-state inference workloads. Inferentia 2 supports Llama, Mistral, and Titan model families natively via AWS Neuron SDK.

Etched Sohu (announced Jun 2024) — unverified

Etched (etched.com) announced Sohu as the first transformer-only ASIC. By baking the transformer architecture into silicon, it claims . The architectural constraint is fundamental: Sohu cannot run non-transformer models. Worth tracking but not production-ready.

Non-NVIDIA hardware comparison (Llama 3 70B inference, 2025-Q1)

Chip / Platform	Architecture	Best tok/s (70B)	Use case fit	Vendor lock-in	Public availability
Cerebras WSE-3	On-chip SRAM wafer-scale	~2,100	Low-latency inference	High (proprietary)	cerebras.ai cloud
Groq LPU	Deterministic dataflow	~800 (1,665 w/ spec-dec)	High-throughput inference	High (proprietary)	GroqCloud
Google TPU v6e	Systolic array (JAX/XLA)	N/A (training primary)	Training + serving at scale	High (GCP/JAX)	Google Cloud GA
AWS Trainium 2	NeuronCore v2	N/A (training primary)	Training (AWS ecosystem)	High (AWS Neuron SDK)	AWS GA
Etched Sohu	Transformer-only ASIC	~500K (claimed)	Transformer inference only	Extreme (ASICs)	Not yet shipping
NVIDIA H100 (ref)	CUDA / tensor cores	~300–600 (4× H100)	Training + any inference	Medium (CUDA ecosystem)	All major clouds

Throughput figures are vendor-reported or derived from public benchmarks; independent verification varies. H100 throughput for reference uses 4-GPU tensor-parallel vLLM setup. Etched Sohu figure is claimed and unverified (community estimate).

💡 Tip · Interview framing:When asked “why not just use Cerebras or Groq everywhere?” — the answer is flexibility vs. throughput tradeoff. Cerebras/Groq win on pure tokens/second for fixed models but cannot run arbitrary model architectures, require vendor-specific SDKs, and lack the multi-modal / multi-task flexibility of GPU clusters. NVIDIA’s moat is the CUDA software ecosystem and model-architecture generality, not raw arithmetic throughput.

🧠

Key Takeaways

What to remember for interviews

1Continuous batching (Orca) acts at iteration level — finished sequences are evicted and new requests inserted every decode step, keeping GPU utilization near 90%
2PagedAttention maps KV cache to fixed-size physical blocks on demand, cutting memory waste from ~60% to ~4% and enabling 2-4x more concurrent requests
3Prefill is compute-bound (parallel, bottleneck = TTFT); decode is memory-bandwidth-bound (autoregressive, bottleneck = TPS) — different phases, different optimization targets
4SGLang RadixAttention caches KV for shared prefixes in a radix tree — 6.4× throughput on prefix-heavy workloads vs vLLM baseline. vLLM V1 (2025) delivers ~5× latency reduction over v0.5.
5GB200 NVL72 (GA Feb 2025): 72 B200 GPUs, 130 TB/s NVLink, 13.5 TB unified HBM — enables single-rack inference for 405B+ models. FP8 is the 2025 default on H100 (<0.5% accuracy loss vs BF16).
6Non-NVIDIA frontier (2024–2025): Cerebras WSE-3 ~2,100 tok/s, Groq LPU ~800 tok/s (1,665 with spec-dec), TPU v6e 4.7× vs v5e — all trade CUDA flexibility for specific throughput/efficiency wins.

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 5 of 5

Explain continuous batching and why it improves throughput over static batching.

★★☆

GoogleDatabricks

What is the difference between TTFT and TPS, and why do different use cases care about different metrics?

★★☆

OpenAIAnthropic

How does PagedAttention reduce KV cache memory waste from ~60% to under 4%?

★★★

DatabricksOpenAI

Design an LLM serving system for 1000 QPS with p99 latency < 500ms for Llama-70B.

★★★

GoogleMeta

Compare prefill-disaggregated architectures (Splitwise, DistServe) vs unified serving. When does disaggregation win?

★★★

AnthropicDatabricks

←

🏎️ Speculative Decoding

🚀 LLM Deployment

The LLM Serving Stack

Batching Strategies

Prefill vs Decode: Two Different Problems

Serving Framework Comparison

API Design Patterns

Break It — See What Happens

SOTA 2024–2025: Production Serving Stack

Non-NVIDIA Hardware Frontier (2024–2025)

Key Takeaways

Recap quiz

Further Reading

Interview Questions