Skip to content

Transformer Math

Module 1 · The Transformer

🔤 Tokenization

Why can't GPT count letters in "strawberry"?

Status:
🎮

Interactive Sandbox

What you’re seeing: the full tokenization-to-embedding pipeline — raw text is split into BPE subword tokens, each token is looked up in the embedding table to produce a dense vector, then positional encodings are added before the first Transformer layer. What to try: use the interactive tokenizer below to see exactly where word boundaries fall and which rare words get split into multiple tokens.

Tokenization & Embedding Pipeline"The cat sat"raw textTokenizer (BPE)The▁cat▁sat46437973332token IDsEmbedding Lookup|V| x d+Positional Enc.sin/cos wavesFinal Input[n, d]TokenizerToken IDsEmbeddingPositional EncodingFinal vectors

BPE (Byte Pair Encoding) is still the standard tokenizer for all major LLMs in 2024-2025 — GPT-4 (cl100k_base), Llama-3 (128K vocab), Claude, and Mistral all use BPE variants. The algorithm starts with individual bytes/characters, then repeatedly merges the most frequent adjacent pair. After ~100K merges:

  • Common words like "the" become a single token
  • Rare words like "tokenization" get split into subwords: "token" + "ization"
  • Any text can be encoded — no unknown words, ever

Type any text below to see how BPE breaks it down, or switch to character/word mode to see why BPE is the sweet spot.

How BPE builds tokens — iterative pair merges

STEP 0 — charactersSTEP 1 — merge "l"+"o"STEP 2 — merge "lo"+"w"lowerlowerlowerFINAL TOKENSInput: "lowest"lowest2 tokens (vs 6 chars)Input: "tokenization"tokenization2 tokens (vs 12 chars)

Token boundaries — where GPT-4 splits “Hello, world! I'm learning.”

Hello9906,11 world1917!0 I358'm2846 learning4477.13token IDs ↑

Note: “I” and “'m” are separate tokens — the model never sees the full word “I'm”.

BPE result — real GPT-4 tokenizer (cl100k_base)

Hello9906,11 how1268 are527 you499?30

6 tokens

BPE (Byte Pair Encoding): The sweet spot. Common words are single tokens; rare words split into subword pieces. This is the actual GPT-4 tokenizer — the same one used in production.

Cross-Language Token Comparison

6 tokens (GPT-4 BPE)

The791 weather9282 is374 nice6555 today3432.13

6 tokens

9 tokens (GPT-4 BPE)

37271368273682730320242539011811

7 tokens

Key insight: "The weather is nice today." → 6 tokens (26 chars, ~4.3chars/token). "今天天气很好。" → 9 tokens (7 chars, ~0.8 chars/token).

English averages ~4 characters per token. Chinese averages ~1-2 characters per token. Same meaning in Chinese typically costs 1.5-2x more tokens — directly impacting API cost and context window usage. Try typing equivalent sentences in both to compare!

💡

The Intuition

Models don't understand text — only numbers. BPE strikes a balance between character-level and word-level: common words (like "the") become a single token, while rare words (like "tokenization") are split into several subword pieces.

After tokenization, each token looks up a large table (the embedding matrix) to get a high-dimensional vector. This vector is the token's "identity card" in the model's eyes — containing semantic information that can be processed by subsequent Attention layers.

💡 Tip · Think of BPE as a compression algorithm for language. Frequent patterns get short codes (single tokens), rare patterns get longer codes (multiple tokens). Just like how Huffman coding assigns shorter bit sequences to more frequent characters.

Quick check

Trade-off

Character-level tokenization has a vocabulary of only ~100–256 tokens. Why does this make it worse than BPE for a 1B-parameter model, not better?

Character-level tokenization has a vocabulary of only ~100–256 tokens. Why does this make it worse than BPE for a 1B-parameter model, not better?
Quick Check

Why does Chinese text use ~2-3x more tokens than English for the same meaning?

🔬

SentencePiece & Unigram LM — The Other Major Algorithm

BPE is a bottom-upalgorithm: start with characters, greedily merge the most frequent pair. SentencePiece's Unigram Language Model tokenizer (Kudo & Richardson 2018) is top-down: start with a huge vocabulary, then prune tokens whose removal increases corpus loss the least. This produces a probabilistic model over segmentations, not a single deterministic one.

The key difference is that Unigram LM can assign probabilities to all possible segmentations of a word, not just the greedy BPE one. During training, it samples from this distribution — a form of subword regularization that exposes the model to multiple segmentations of the same word, improving robustness to tokenization variation.

PropertyBPEUnigram LM
DirectionBottom-up (merge)Top-down (prune)
SegmentationDeterministicProbabilistic (samples)
Training benefitNoneSubword regularization
Used byGPT-4, Llama, MistralT5, mT5, XLNet
✨ Insight · SentencePiece operates directly on raw text, removing the need for pre-tokenization (whitespace splitting). This makes it truly language-agnostic — essential for multilingual models like mT5 that cover 100+ languages (Kudo & Richardson 2018).

Quick check

Trade-off

T5 and mT5 use SentencePiece Unigram LM rather than BPE. Which property of Unigram LM makes it specifically better suited for massively multilingual models covering 100+ languages?

T5 and mT5 use SentencePiece Unigram LM rather than BPE. Which property of Unigram LM makes it specifically better suited for massively multilingual models covering 100+ languages?
📐

Step-by-Step Derivation

Step 1: Tokenize text into IDs

Input text is split into subword tokens, each mapped to an integer ID from the vocabulary of size .

Step 2: Embedding lookup

The embedding matrix stores a d-dimensional vector for each token in the vocabulary. Lookup is just indexing a row:

Embedding lookup — token ID selects a row from matrix E

TOKEN IDsHelloid=9906worldid=1917row 9906row 1917MATRIX E [|V| × d]99061917… 50,257 rows × 768 cols (GPT-2)VECTORS x_i ∈ ℝ^dx₁ [0.12, −0.34, … 768 dims]x₂ [−0.07, 0.88, … 768 dims]

Full sequence embedding

PyTorch: tokenization plus embedding lookup

python
# BPE tokenization + embedding lookup
import torch
import torch.nn as nn
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
token_ids = enc.encode("Hello world")  # [9906, 1917]

embed = nn.Embedding(num_embeddings=100256, embedding_dim=768)
x = embed(torch.tensor(token_ids))  # [2, 768]
✨ Insight · The embedding lookup is not a matrix multiplication — it is a table lookup (indexing rows). This is , not . The embedding matrix is learned during training: similar tokens end up with similar vectors.

Parameter count

GPT-3: , params out of 175B total =

Quick check

Derivation

GPT-3 175B: |V|=50,257, d_model=12,288, 96 transformer layers. Each layer has ~16d² non-embedding params. Compute the approximate embedding fraction.

GPT-3 175B: |V|=50,257, d_model=12,288, 96 transformer layers. Each layer has ~16d² non-embedding params. Compute the approximate embedding fraction.
PyTorch implementation
# BPE tokenization with tiktoken (real GPT-4 tokenizer)
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # GPT-4 / Claude encoding

text = "Hello, world! How are you today?"
token_ids = enc.encode(text)
print(token_ids)       # [9906, 11, 1917, 0, 2650, 527, 499, 3432, 30]
print(len(token_ids))  # 9 tokens (not 8 words — BPE is subword)

# Decode back to text
decoded = enc.decode(token_ids)
print(decoded)         # "Hello, world! How are you today?"

# BPE merge step (simplified):
# 1. Start with character-level tokens
# 2. Count all adjacent pairs
# 3. Merge the most frequent pair into a new token
# 4. Repeat until vocab_size reached (GPT-4: ~100K merges)
🔧

Break It — See What Happens

Toggle these to see how the tokenizer in the Play section above changes. Watch the token count explode or the vocab become unmanageable.

Force character-level tokenization
Force word-level tokenization
📊

Real-World Numbers

ModelVocab |V|d_modelEmbed params% of total
GPT-3 175B12,288
GPT-4~100,000
Llama-3 70B8,192
GPT-2 124M50,257768
Llama-2 7B32,0004,096
💡 Tip · Notice the trend: larger models have a tiny embedding %. GPT-3's embedding is only 0.35% of 175B params. But GPT-2 (124M) spends 31% on embeddings! This is why small models use smaller vocabs — and why Llama-3 could afford to jump to 128K vocab with 70B params.
✨ Insight · Llama-3 doubled vocab from Llama-2's 32K to 128K, mainly for better multilingual and code coverage. Larger vocab = shorter sequences = faster inference, at the cost of a larger embedding table.

Quick check

Trade-off

Llama-3 70B grew its vocab from 32K (Llama-2) to 128K, adding ~786M embedding params. At what model scale does this vocab-size upgrade break even — i.e., when does the sequence-length saving justify the parameter overhead?

Llama-3 70B grew its vocab from 32K (Llama-2) to 128K, adding ~786M embedding params. At what model scale does this vocab-size upgrade break even — i.e., when does the sequence-length saving justify the parameter overhead?
🧠

Key Takeaways

What to remember for interviews

  1. 1BPE starts from raw bytes and iteratively merges the most frequent adjacent pair — no hand-crafted rules
  2. 2Vocabulary size is a key hyperparameter (32K–200K): larger vocab means shorter sequences and faster attention, but a bigger embedding table
  3. 3Tokenization determines the model's 'vision' of text — the model only ever sees token IDs, never characters
  4. 4Multilingual text often uses more tokens per word than English, creating both a cost and a quality gap
🧠

Recap quiz

Derivation

GPT-3 has a 50,257-token vocabulary and d_model=12,288. Its embedding table is 617M params — only 0.35% of 175B total. Why does this fraction shrink so dramatically as model size grows?

GPT-3 has a 50,257-token vocabulary and d_model=12,288. Its embedding table is 617M params — only 0.35% of 175B total. Why does this fraction shrink so dramatically as model size grows?
Trade-off

Llama-3 increased its vocabulary from Llama-2's 32K to 128K tokens. For a fixed input text, what is the primary serving benefit, and what is the cost?

Llama-3 increased its vocabulary from Llama-2's 32K to 128K tokens. For a fixed input text, what is the primary serving benefit, and what is the cost?
Derivation

GPT-2 introduced byte-level BPE with 256 base tokens. What problem does this solve that character-level BPE with 128 ASCII base tokens does not?

GPT-2 introduced byte-level BPE with 256 base tokens. What problem does this solve that character-level BPE with 128 ASCII base tokens does not?
Derivation

GPT-2 ties the input embedding matrix E ∈ ℝ^{|V|×d} with the output projection. For GPT-2 Small (|V|=50,257, d=768), how many parameters does this save, and what is the regularization effect?

GPT-2 ties the input embedding matrix E ∈ ℝ^{|V|×d} with the output projection. For GPT-2 Small (|V|=50,257, d=768), how many parameters does this save, and what is the regularization effect?
Trade-off

Chinese text uses ~2–3× more tokens than English for the same semantic content under GPT-4's cl100k_base tokenizer. What is the practical inference cost implication when serving a Chinese-language chatbot at the same QPS as an English one?

Chinese text uses ~2–3× more tokens than English for the same semantic content under GPT-4's cl100k_base tokenizer. What is the practical inference cost implication when serving a Chinese-language chatbot at the same QPS as an English one?
Trade-off

BPE is deterministic — “tokenization” always splits the same way. SentencePiece Unigram LM samples from a distribution of segmentations during training. What practical quality benefit does this provide, and why doesn't it hurt inference?

BPE is deterministic — “tokenization” always splits the same way. SentencePiece Unigram LM samples from a distribution of segmentations during training. What practical quality benefit does this provide, and why doesn't it hurt inference?
Trade-off

A GPT-2 Small model (124M params total) spends ~31% of its parameters on the embedding table alone. If you were designing a specialized code model with only 50M total parameters, what tokenizer design change would most effectively reduce this fraction?

A GPT-2 Small model (124M params total) spends ~31% of its parameters on the embedding table alone. If you were designing a specialized code model with only 50M total parameters, what tokenizer design change would most effectively reduce this fraction?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 7 of 7

Why BPE over word-level or character-level?

★☆☆
GoogleOpenAI

Why can't LLMs count letters in "strawberry"?

★☆☆
GoogleMeta

Embedding param count — what % of total model?

★★☆
GoogleDatabricks

Weight tying: share embedding and output projection?

★★☆
OpenAIAnthropic

How does the tokenizer handle out-of-vocabulary (OOV) words? Why is this important?

★☆☆
GoogleMeta

What is the relationship between vocabulary size and model performance? What's the tradeoff?

★★☆
DatabricksOpenAI

How would you handle a new language that wasn't in the original training data?

★★☆
AnthropicGoogle