Tokenization — Transformer Math

Module 1 · The Transformer

🔤 Tokenization

Why can't GPT count letters in "strawberry"?

Status:

🎮

Interactive Sandbox

What you’re seeing: the full tokenization-to-embedding pipeline — raw text is split into BPE subword tokens, each token is looked up in the embedding table to produce a dense vector, then positional encodings are added before the first Transformer layer. What to try: use the interactive tokenizer below to see exactly where word boundaries fall and which rare words get split into multiple tokens.

BPE (Byte Pair Encoding) is still the standard tokenizer for all major LLMs in 2024-2025 — GPT-4 (cl100k_base), Llama-3 (128K vocab), Claude, and Mistral all use BPE variants. The algorithm starts with individual bytes/characters, then repeatedly merges the most frequent adjacent pair. After ~100K merges:

Common words like "the" become a single token
Rare words like "tokenization" get split into subwords: "token" + "ization"
Any text can be encoded — no unknown words, ever

Type any text below to see how BPE breaks it down, or switch to character/word mode to see why BPE is the sweet spot.

How BPE builds tokens — iterative pair merges

Token boundaries — where GPT-4 splits “Hello, world! I'm learning.”

Note: “I” and “'m” are separate tokens — the model never sees the full word “I'm”.

Type any text to see how GPT-4's real tokenizer (cl100k_base) splits it

BPE result — real GPT-4 tokenizer (cl100k_base)

Hello9906,11 how1268 are527 you499?30

6 tokens

BPE (Byte Pair Encoding): The sweet spot. Common words are single tokens; rare words split into subword pieces. This is the actual GPT-4 tokenizer — the same one used in production.

Cross-Language Token Comparison

English

6 tokens (GPT-4 BPE)

The791 weather9282 is374 nice6555 today3432.13

6 tokens

Chinese (中文)

9 tokens (GPT-4 BPE)

今37271天36827天36827气30320很242好53901。1811

7 tokens

Key insight: "The weather is nice today." → 6 tokens (26 chars, ~4.3chars/token). "今天天气很好。" → 9 tokens (7 chars, ~0.8 chars/token).

English averages ~4 characters per token. Chinese averages ~1-2 characters per token. Same meaning in Chinese typically costs 1.5-2x more tokens — directly impacting API cost and context window usage. Try typing equivalent sentences in both to compare!

💡

The Intuition

Models don't understand text — only numbers. BPE strikes a balance between character-level and word-level: common words (like "the") become a single token, while rare words (like "tokenization") are split into several subword pieces.

After tokenization, each token looks up a large table (the embedding matrix) to get a high-dimensional vector. This vector is the token's "identity card" in the model's eyes — containing semantic information that can be processed by subsequent Attention layers.

💡 Tip · Think of BPE as a compression algorithm for language. Frequent patterns get short codes (single tokens), rare patterns get longer codes (multiple tokens). Just like how Huffman coding assigns shorter bit sequences to more frequent characters.

Quick check

Trade-off

Character-level tokenization has a vocabulary of only ~100–256 tokens. Why does this make it worse than BPE for a 1B-parameter model, not better?

A tiny vocab saves embedding memory, freeing those params for additional transformer layers.Character vocabularies are harder for the model to learn, requiring proportionally more training data.Character-level models cannot tokenize spaces, newlines, or punctuation reliably.Character sequences are 4–5× longer than subword sequences, so O(n²) attention costs 16–25× more — dwarfing the embedding savings.

Quick Check

Why does Chinese text use ~2-3x more tokens than English for the same meaning?

🔬

SentencePiece & Unigram LM — The Other Major Algorithm

BPE is a bottom-upalgorithm: start with characters, greedily merge the most frequent pair. SentencePiece's Unigram Language Model tokenizer (Kudo & Richardson 2018) is top-down: start with a huge vocabulary, then prune tokens whose removal increases corpus loss the least. This produces a probabilistic model over segmentations, not a single deterministic one.

The key difference is that Unigram LM can assign probabilities to all possible segmentations of a word, not just the greedy BPE one. During training, it samples from this distribution — a form of subword regularization that exposes the model to multiple segmentations of the same word, improving robustness to tokenization variation.

Property	BPE	Unigram LM
Direction	Bottom-up (merge)	Top-down (prune)
Segmentation	Deterministic	Probabilistic (samples)
Training benefit	None	Subword regularization
Used by	GPT-4, Llama, Mistral	T5, mT5, XLNet

✨ Insight · SentencePiece operates directly on raw text, removing the need for pre-tokenization (whitespace splitting). This makes it truly language-agnostic — essential for multilingual models like mT5 that cover 100+ languages (Kudo & Richardson 2018).

Quick check

Trade-off

T5 and mT5 use SentencePiece Unigram LM rather than BPE. Which property of Unigram LM makes it specifically better suited for massively multilingual models covering 100+ languages?

Unigram LM trains faster than BPE because it needs fewer merge iterations.Unigram LM supports lossless round-trip encoding while BPE does not.Unigram LM operates directly on raw text bytes, avoiding language-specific pre-tokenization assumptions.Unigram LM produces a larger vocabulary than BPE for the same token budget.

📐

Step-by-Step Derivation

Step 1: Tokenize text into IDs

Input text is split into subword tokens, each mapped to an integer ID from the vocabulary of size .

Step 2: Embedding lookup

The embedding matrix stores a d-dimensional vector for each token in the vocabulary. Lookup is just indexing a row:

Embedding lookup — token ID selects a row from matrix E

Full sequence embedding

PyTorch: tokenization plus embedding lookup

python

# BPE tokenization + embedding lookup
import torch
import torch.nn as nn
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
token_ids = enc.encode("Hello world")  # [9906, 1917]

embed = nn.Embedding(num_embeddings=100256, embedding_dim=768)
x = embed(torch.tensor(token_ids))  # [2, 768]

✨ Insight · The embedding lookup is not a matrix multiplication — it is a table lookup (indexing rows). This is , not . The embedding matrix is learned during training: similar tokens end up with similar vectors.

Parameter count

GPT-3: , → params out of 175B total =

Quick check

Derivation

GPT-3 175B: |V|=50,257, d_model=12,288, 96 transformer layers. Each layer has ~16d² non-embedding params. Compute the approximate embedding fraction.

617M / 175B ≈ 0.035%617M / 175B ≈ 3.5%617M / 175B ≈ 35%617M / 175B ≈ 0.35%

PyTorch implementation

# BPE tokenization with tiktoken (real GPT-4 tokenizer)
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # GPT-4 / Claude encoding

text = "Hello, world! How are you today?"
token_ids = enc.encode(text)
print(token_ids)       # [9906, 11, 1917, 0, 2650, 527, 499, 3432, 30]
print(len(token_ids))  # 9 tokens (not 8 words — BPE is subword)

# Decode back to text
decoded = enc.decode(token_ids)
print(decoded)         # "Hello, world! How are you today?"

# BPE merge step (simplified):
# 1. Start with character-level tokens
# 2. Count all adjacent pairs
# 3. Merge the most frequent pair into a new token
# 4. Repeat until vocab_size reached (GPT-4: ~100K merges)

🔧

Break It — See What Happens

Toggle these to see how the tokenizer in the Play section above changes. Watch the token count explode or the vocab become unmanageable.

Force character-level tokenization

Force word-level tokenization

📊

Real-World Numbers

Model	Vocab \|V\|	d_model	Embed params	% of total
GPT-3 175B		12,288
GPT-4	~100,000	—	—	—
Llama-3 70B		8,192
GPT-2 124M	50,257	768
Llama-2 7B	32,000	4,096

💡 Tip · Notice the trend: larger models have a tiny embedding %. GPT-3's embedding is only 0.35% of 175B params. But GPT-2 (124M) spends 31% on embeddings! This is why small models use smaller vocabs — and why Llama-3 could afford to jump to 128K vocab with 70B params.

✨ Insight · Llama-3 doubled vocab from Llama-2's 32K to 128K, mainly for better multilingual and code coverage. Larger vocab = shorter sequences = faster inference, at the cost of a larger embedding table.

Quick check

Trade-off

Llama-3 70B grew its vocab from 32K (Llama-2) to 128K, adding ~786M embedding params. At what model scale does this vocab-size upgrade break even — i.e., when does the sequence-length saving justify the parameter overhead?

Only at 1T+ scale, where the quadratic attention savings finally outweigh the embedding overhead.Never — byte-level fallback covers CJK characters at zero extra param cost.Always — the attention-FLOP saving from shorter sequences justifies 786M overhead at any model size.At 70B+ scale, where 786M is <1.5% of total params and the sequence-length saving meaningfully cuts O(n²) attention cost.

🧠

Key Takeaways

What to remember for interviews

1BPE starts from raw bytes and iteratively merges the most frequent adjacent pair — no hand-crafted rules
2Vocabulary size is a key hyperparameter (32K–200K): larger vocab means shorter sequences and faster attention, but a bigger embedding table
3Tokenization determines the model's 'vision' of text — the model only ever sees token IDs, never characters
4Multilingual text often uses more tokens per word than English, creating both a cost and a quality gap

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 7 of 7

Why BPE over word-level or character-level?

★☆☆

GoogleOpenAI

Why can't LLMs count letters in "strawberry"?

★☆☆

GoogleMeta

Embedding param count — what % of total model?

★★☆

GoogleDatabricks

Weight tying: share embedding and output projection?

★★☆

OpenAIAnthropic

How does the tokenizer handle out-of-vocabulary (OOV) words? Why is this important?

★☆☆

GoogleMeta

What is the relationship between vocabulary size and model performance? What's the tradeoff?

★★☆

DatabricksOpenAI

How would you handle a new language that wasn't in the original training data?

★★☆

AnthropicGoogle

←

🏗️ High-Level Overview

📊 Embeddings

→

🔤 Tokenization

Interactive Sandbox

The Intuition

SentencePiece & Unigram LM — The Other Major Algorithm

Step-by-Step Derivation

Break It — See What Happens

Real-World Numbers

Key Takeaways

Recap quiz

Further Reading

Interview Questions