🔤 Tokenization
Why can't GPT count letters in "strawberry"?
Interactive Sandbox
What you’re seeing: the full tokenization-to-embedding pipeline — raw text is split into BPE subword tokens, each token is looked up in the embedding table to produce a dense vector, then positional encodings are added before the first Transformer layer. What to try: use the interactive tokenizer below to see exactly where word boundaries fall and which rare words get split into multiple tokens.
BPE (Byte Pair Encoding) is still the standard tokenizer for all major LLMs in 2024-2025 — GPT-4 (cl100k_base), Llama-3 (128K vocab), Claude, and Mistral all use BPE variants. The algorithm starts with individual bytes/characters, then repeatedly merges the most frequent adjacent pair. After ~100K merges:
- Common words like "the" become a single token
- Rare words like "tokenization" get split into subwords: "token" + "ization"
- Any text can be encoded — no unknown words, ever
Type any text below to see how BPE breaks it down, or switch to character/word mode to see why BPE is the sweet spot.
How BPE builds tokens — iterative pair merges
Token boundaries — where GPT-4 splits “Hello, world! I'm learning.”
Note: “I” and “'m” are separate tokens — the model never sees the full word “I'm”.
BPE result — real GPT-4 tokenizer (cl100k_base)
6 tokens
BPE (Byte Pair Encoding): The sweet spot. Common words are single tokens; rare words split into subword pieces. This is the actual GPT-4 tokenizer — the same one used in production.
Cross-Language Token Comparison
6 tokens (GPT-4 BPE)
6 tokens
9 tokens (GPT-4 BPE)
7 tokens
Key insight: "The weather is nice today." → 6 tokens (26 chars, ~4.3chars/token). "今天天气很好。" → 9 tokens (7 chars, ~0.8 chars/token).
English averages ~4 characters per token. Chinese averages ~1-2 characters per token. Same meaning in Chinese typically costs 1.5-2x more tokens — directly impacting API cost and context window usage. Try typing equivalent sentences in both to compare!
The Intuition
Models don't understand text — only numbers. BPE strikes a balance between character-level and word-level: common words (like "the") become a single token, while rare words (like "tokenization") are split into several subword pieces.
After tokenization, each token looks up a large table (the embedding matrix) to get a high-dimensional vector. This vector is the token's "identity card" in the model's eyes — containing semantic information that can be processed by subsequent Attention layers.
Quick check
Character-level tokenization has a vocabulary of only ~100–256 tokens. Why does this make it worse than BPE for a 1B-parameter model, not better?
Why does Chinese text use ~2-3x more tokens than English for the same meaning?
SentencePiece & Unigram LM — The Other Major Algorithm
BPE is a bottom-upalgorithm: start with characters, greedily merge the most frequent pair. SentencePiece's Unigram Language Model tokenizer (Kudo & Richardson 2018) is top-down: start with a huge vocabulary, then prune tokens whose removal increases corpus loss the least. This produces a probabilistic model over segmentations, not a single deterministic one.
The key difference is that Unigram LM can assign probabilities to all possible segmentations of a word, not just the greedy BPE one. During training, it samples from this distribution — a form of subword regularization that exposes the model to multiple segmentations of the same word, improving robustness to tokenization variation.
| Property | BPE | Unigram LM |
|---|---|---|
| Direction | Bottom-up (merge) | Top-down (prune) |
| Segmentation | Deterministic | Probabilistic (samples) |
| Training benefit | None | Subword regularization |
| Used by | GPT-4, Llama, Mistral | T5, mT5, XLNet |
Quick check
T5 and mT5 use SentencePiece Unigram LM rather than BPE. Which property of Unigram LM makes it specifically better suited for massively multilingual models covering 100+ languages?
Step-by-Step Derivation
Step 1: Tokenize text into IDs
Input text is split into subword tokens, each mapped to an integer ID from the vocabulary of size .
Step 2: Embedding lookup
The embedding matrix stores a d-dimensional vector for each token in the vocabulary. Lookup is just indexing a row:
Embedding lookup — token ID selects a row from matrix E
Full sequence embedding
PyTorch: tokenization plus embedding lookup
# BPE tokenization + embedding lookup
import torch
import torch.nn as nn
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
token_ids = enc.encode("Hello world") # [9906, 1917]
embed = nn.Embedding(num_embeddings=100256, embedding_dim=768)
x = embed(torch.tensor(token_ids)) # [2, 768]Parameter count
GPT-3: , → params out of 175B total =
Quick check
GPT-3 175B: |V|=50,257, d_model=12,288, 96 transformer layers. Each layer has ~16d² non-embedding params. Compute the approximate embedding fraction.
PyTorch implementation
# BPE tokenization with tiktoken (real GPT-4 tokenizer)
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 / Claude encoding
text = "Hello, world! How are you today?"
token_ids = enc.encode(text)
print(token_ids) # [9906, 11, 1917, 0, 2650, 527, 499, 3432, 30]
print(len(token_ids)) # 9 tokens (not 8 words — BPE is subword)
# Decode back to text
decoded = enc.decode(token_ids)
print(decoded) # "Hello, world! How are you today?"
# BPE merge step (simplified):
# 1. Start with character-level tokens
# 2. Count all adjacent pairs
# 3. Merge the most frequent pair into a new token
# 4. Repeat until vocab_size reached (GPT-4: ~100K merges)Break It — See What Happens
Toggle these to see how the tokenizer in the Play section above changes. Watch the token count explode or the vocab become unmanageable.
Real-World Numbers
| Model | Vocab |V| | d_model | Embed params | % of total |
|---|---|---|---|---|
| GPT-3 175B | 12,288 | |||
| GPT-4 | ~100,000 | — | — | — |
| Llama-3 70B | 8,192 | |||
| GPT-2 124M | 50,257 | 768 | ||
| Llama-2 7B | 32,000 | 4,096 |
Quick check
Llama-3 70B grew its vocab from 32K (Llama-2) to 128K, adding ~786M embedding params. At what model scale does this vocab-size upgrade break even — i.e., when does the sequence-length saving justify the parameter overhead?
Key Takeaways
What to remember for interviews
- 1BPE starts from raw bytes and iteratively merges the most frequent adjacent pair — no hand-crafted rules
- 2Vocabulary size is a key hyperparameter (32K–200K): larger vocab means shorter sequences and faster attention, but a bigger embedding table
- 3Tokenization determines the model's 'vision' of text — the model only ever sees token IDs, never characters
- 4Multilingual text often uses more tokens per word than English, creating both a cost and a quality gap
Recap quiz
GPT-3 has a 50,257-token vocabulary and d_model=12,288. Its embedding table is 617M params — only 0.35% of 175B total. Why does this fraction shrink so dramatically as model size grows?
Llama-3 increased its vocabulary from Llama-2's 32K to 128K tokens. For a fixed input text, what is the primary serving benefit, and what is the cost?
GPT-2 introduced byte-level BPE with 256 base tokens. What problem does this solve that character-level BPE with 128 ASCII base tokens does not?
GPT-2 ties the input embedding matrix E ∈ ℝ^{|V|×d} with the output projection. For GPT-2 Small (|V|=50,257, d=768), how many parameters does this save, and what is the regularization effect?
Chinese text uses ~2–3× more tokens than English for the same semantic content under GPT-4's cl100k_base tokenizer. What is the practical inference cost implication when serving a Chinese-language chatbot at the same QPS as an English one?
BPE is deterministic — “tokenization” always splits the same way. SentencePiece Unigram LM samples from a distribution of segmentations during training. What practical quality benefit does this provide, and why doesn't it hurt inference?
A GPT-2 Small model (124M params total) spends ~31% of its parameters on the embedding table alone. If you were designing a specialized code model with only 50M total parameters, what tokenizer design change would most effectively reduce this fraction?
Further Reading
- Neural Machine Translation of Rare Words with Subword Units (BPE) — Sennrich et al. 2016 — the paper that introduced Byte Pair Encoding for NLP tokenization.
- SentencePiece: A simple and language independent subword tokenizer — Kudo & Richardson 2018 — unigram language model tokenizer used by T5, mT5, XLNet, and many multilingual models.
- OpenAI Tiktoken (cl100k_base) — Production BPE tokenizer powering GPT-4. Fast Rust implementation with Python bindings.
- Andrej Karpathy — Let's Build GPT — Builds a GPT from scratch including the tokenization step — great for seeing BPE in context of the full pipeline.
- Andrej Karpathy — Let's Build the GPT Tokenizer — 2-hour deep-dive building the GPT-4 BPE tokenizer from scratch — covers byte-level BPE, special tokens, and tiktoken internals.
- Tokenizer Summary — Hugging Face docs — HuggingFace reference covering every major algorithm — WordPiece, Unigram, BPE, byte-level BPE — with concrete examples of merge rules.
- Lilian Weng — Large Transformer Model Inference Optimization — Covers vocabulary size tradeoffs and how tokenization choices affect inference throughput and model capacity.
Interview Questions
Showing 7 of 7
Why BPE over word-level or character-level?
★☆☆Why can't LLMs count letters in "strawberry"?
★☆☆Embedding param count — what % of total model?
★★☆Weight tying: share embedding and output projection?
★★☆How does the tokenizer handle out-of-vocabulary (OOV) words? Why is this important?
★☆☆What is the relationship between vocabulary size and model performance? What's the tradeoff?
★★☆How would you handle a new language that wasn't in the original training data?
★★☆