Skip to content

Transformer Math

Module 2 · The Transformer

📊 Embeddings

Why is 'king' - 'man' + 'woman' = 'queen'?

Status:

The tokenizer converts text into integer IDs. But a model can't do math on bare integers — embeddings turn each token ID into a dense vector of learned numbers that encode meaning, syntax, and relationships. This is the first learnable layer in every Transformer.

📐

Embedding Space Geometry

What you're seeing:A 2D projection of a high-dimensional embedding space. Words cluster by semantic category — royalty words (king, queen, prince) land near each other, animals cluster separately, and verbs form their own region. The dashed arrows illustrate the famous analogy: king − man + woman ≈ queen.

What to try:Trace the dashed arrows — subtract the “man” vector from “king”, add “woman”, and the result lands near “queen”. Notice the parallelogram shape: king→queen mirrors man→woman.

dimension 1 →dim 2royaltykingqueenprinceanimalscatdogfishactionsrunwalkjumpmanwoman− man+ womangender offsetking − man + woman ≈ queen
Embedding Lookup: Token ID → Matrix Row → Dense VectorInput tokensThe464cat3797sat3290on319the262mat2603Embedding Matrix E (vocab × d_model)d₁d₂d₃d₄d₅d₆d₇d₈+0.12-0.34+0.78+0.01-0.56+0.23-0.89+0.45+0.67+0.15-0.42+0.88+0.03-0.71+0.29-0.14+0.55+0.08-0.39+0.81+0.11-0.63+0.34-0.22-0.23+0.91+0.05-0.67+0.44+0.18-0.52+0.76+0.11-0.31+0.75-0.02-0.53+0.21-0.85+0.42+0.61+0.19-0.45+0.83+0.07-0.68+0.31-0.18e_cat[0.67, 0.15, −0.42, ...]Lookup is O(d) — just grab row 3797. No matrix multiply needed.
🎮

Embedding Lookup Table

What you're seeing:Each token ID is an index into a giant matrix. The row at that index is the token's embedding vector. Click a row to highlight it.

Sentence: "The cat sat on the mat" → 6 tokens → 6 vector lookups. Real models use 768-4096 dimensions; this shows 8 for clarity.

TokenIDEmbedding Vector (d=8)
The464[0.12, -0.34, 0.78, 0.01, -0.56, 0.23, -0.89, 0.45]
cat(similar)3797[0.67, 0.15, -0.42, 0.88, 0.03, -0.71, 0.29, -0.14]
sat(similar)3290[0.55, 0.08, -0.39, 0.81, 0.11, -0.63, 0.34, -0.22]
on319[-0.23, 0.91, 0.05, -0.67, 0.44, 0.18, -0.52, 0.76]
the262[0.11, -0.31, 0.75, -0.02, -0.53, 0.21, -0.85, 0.42]
mat(similar)2603[0.61, 0.19, -0.45, 0.83, 0.07, -0.68, 0.31, -0.18]
💡 Tip · Notice "cat", "sat", and "mat" have similar vectors (high cosine similarity) because they appear in similar contexts. "on" looks very different — it's a function word, not a noun/verb.
💡

The Intuition

Why not one-hot vectors?A one-hot vector for "cat" in a 50K vocabulary is a 50,000-dimensional vector with a single 1 at position 3,797 and zeros everywhere else. "Cat" and "kitten" have zero similarity— their dot product is 0. You can't do arithmetic on them, and a dense one-hot vector would consume 200KB per token. Embeddings compress this to ~768 dimensions where "cat" and "kitten" are close neighbors ().

RepresentationDimensionscat·kitten similarityMemory per token
One-hot50,2570.0200 KB
Embedding7680.853 KB
One-hot vs Embedding: "cat" (token 3797)One-hot (50,257 dims)1 at position 3797 · · · 0s everywhere elseDims: 50,257Memory: ~200 KBSimilarity(cat, kitten): 0.0Arithmetic: impossiblelearnEmbedding (768 dims)16 of 768 dims shown — all non-zero, learned valuesDims: 768Memory: ~3 KBSimilarity(cat, kitten): ~0.85king − man + woman ≈ queenpositive dimnegative dim65× fewer dimensions · semantic similarity encoded · arithmetic possible

An embedding is a lookup table. The model maintains a matrix of shape . Given token ID 3797 ("cat"), it grabs row 3797 — no multiplication, just an index operation. That row is the learned representation for "cat".

Weight tyingis a key trick: the same embedding matrix is reused at the output layer (transposed) to predict the next token. This means the model's input space and output space share the same learned geometry. If "cat" and "kitten" are nearby in embedding space, the model can naturally predict either as a continuation.

Dimensionalityranges from 768 (GPT-2) to 4096 (Llama-2) to even larger. Each dimension doesn't have a human-interpretable meaning — the model learns to distribute information across all dimensions in whatever way minimizes loss.

✨ Insight · Think of it this way: one-hot encoding puts each word on its own axis (50K dimensions, no similarity between any pair). Embeddings compress those 50K axes down to 768-4096 dimensions where similar words end up near each other. It's learned dimensionality reduction.

Quick check

Trade-off

A 50,257-vocab one-hot vector stored as float32 occupies ~200 KB per token. A 768-dim embedding occupies ~3 KB. Beyond memory, what is the deeper reason embeddings are preferred for learning?

A 50,257-vocab one-hot vector stored as float32 occupies ~200 KB per token. A 768-dim embedding occupies ~3 KB. Beyond memory, what is the deeper reason embeddings are preferred for learning?
Quick Check

Why does weight tying help?

📐

Step-by-Step Derivation

Embedding Lookup

Given a token ID, the embedding is simply the corresponding row of the embedding matrix :

This is equivalent to multiplying a one-hot vector by the embedding matrix: , but the lookup is O(d) while the matmul would be O(V × d).

Weight Tying

The output logits are computed by projecting the hidden state through the transposed embedding matrix:

💡 Tip · The logit for token is the dot product — literally measuring how close the hidden state is to that token's embedding. Higher similarity = higher probability after softmax.

Dimensionality & Parameter Count

The embedding matrix accounts for a significant fraction of total parameters:

For GPT-2: . With weight tying, this is shared with the output layer, saving another 38.6M.

PyTorch implementation
import torch.nn as nn

class TransformerWithTiedEmbeddings(nn.Module):
    def __init__(self, vocab_size, d_model):
        super().__init__()
        # Embedding lookup table: vocab_size rows, d_model columns
        self.embedding = nn.Embedding(vocab_size, d_model)

        # Output projection reuses embedding weights (weight tying)
        self.output_proj = nn.Linear(d_model, vocab_size, bias=False)
        self.output_proj.weight = self.embedding.weight  # TIED!

    def forward(self, token_ids):
        # token_ids: (batch, seq_len) -> integers
        x = self.embedding(token_ids)  # (batch, seq_len, d_model)
        logits = self.output_proj(x)   # (batch, seq_len, vocab_size)
        return logits

# GPT-2 small: 50257 vocab, 768 dim
model = TransformerWithTiedEmbeddings(vocab_size=50257, d_model=768)
print(f"Embedding params: {50257 * 768:,}")  # 38,597,376

Quick check

Derivation

With weight tying in GPT-2, the logit for token j is computed as h_L · e_j (dot product of final hidden state with token j's embedding). What does it mean in geometric terms for the model to “predict token j with high probability”?

With weight tying in GPT-2, the logit for token j is computed as h_L · e_j (dot product of final hidden state with token j's embedding). What does it mean in geometric terms for the model to “predict token j with high probability”?
🔧

Break It — See What Happens

Reduce embedding dimension to 8
Random embeddings (no training)
📊

Real-World Numbers

Modeld_modelVocab SizeEmbed ParamsWeight Tied?
GPT-276850,257Yes
GPT-3 (175B)12,28850,257Yes
GPT-4 (est.) *~4096+~100K (est.)~400M+ (est.)Likely
Llama-2 (7B)4,09632,000No
Llama-3 (8B)4,096128,256No
BERT-base768Yes

* GPT-4 architecture details not officially confirmed by OpenAI — figures are community estimates from leaks and benchmarking.

✨ Insight · Larger vocab = fewer tokens per text = faster inference, but the embedding table becomes a bigger fraction of total params. This is a real engineering tradeoff.

Quick check

Trade-off

Llama-3 (8B, d_model=4096) uses a 128,256-token vocab giving 525M embedding params, while Llama-2 (7B, d_model=4096) used 32,000 tokens giving 131M params. A deployment engineer asks: does the larger vocab help or hurt inference latency?

Llama-3 (8B, d_model=4096) uses a 128,256-token vocab giving 525M embedding params, while Llama-2 (7B, d_model=4096) used 32,000 tokens giving 131M params. A deployment engineer asks: does the larger vocab help or hurt inference latency?
🔬

How Embeddings Are Trained: Word2Vec vs LLM Embeddings

showed that useful embeddings emerge from a simple self-supervised objective: given a center word, predict its context (Skip-gram), or given context words, predict the center (CBOW). The key insight is that the embedding is a side effect of the prediction task — the network is never told what "similar" means, but words that appear in similar contexts end up with similar vectors.

LLM embeddings are trained differently. The embedding table is initialized randomly and updated end-to-end by next-token prediction loss via backpropagation. There is no separate embedding training phase — the embeddings co-evolve with attention and FFN weights across all 96+ layers. This makes LLM embeddings contextual in aggregate: although the table itself is static (one row per token), after attention layers each token's representation depends on its neighbors. The raw embedding table encodes mostly syntactic and frequency-based information; rich semantics emerge only after the first few attention layers.

PropertyWord2VecLLM Embeddings
Training signalContext window predictionNext-token prediction (full LM)
ContextualityStatic — one vector per wordStatic table, contextual after attention
GranularityWord-levelSubword (BPE/Unigram)
Analogy arithmeticStrong (king − man + woman = queen)Weaker in raw table, stronger in context
Vector Space: king − man + woman ≈ queendim 1 (gender-ish)dim 2(royalty)king − man + womangender offsetroyalty offsetmanwomankingqueen← resultParallelogram: the gender vector and royalty vector are consistent across word pairs
💡 Tip · The famous "king − man + woman ≈ queen" result works in because the static vectors directly encode semantic relationships. In GPT-style models, this arithmetic is weaker at the embedding table level — instead, the model distributes semantic reasoning across attention and FFN layers rather than compressing it all into a single lookup table. extended BERT with contrastive fine-tuning to produce sentence embeddings well-calibrated for cosine similarity — the foundation of modern embedding APIs.
🧠

Key Takeaways

What to remember for interviews

  1. 1Embeddings map discrete token IDs to dense vectors where semantic similarity is captured by geometric proximity — 'cat' and 'kitten' are close neighbors, one-hot vectors are not.
  2. 2The embedding matrix is a simple lookup table (vocab_size × d_model); given a token ID, retrieving its vector is an index operation, not a matrix multiplication.
  3. 3Weight tying reuses the embedding matrix as the output projection layer, saving millions of parameters and forcing a shared semantic geometry between input encoding and next-token prediction.
  4. 4Byte-level BPE eliminates out-of-vocabulary entirely — any word decomposes into known subword tokens, each with its own embedding row.
  5. 5Embedding dimension (768–4096) is a capacity-compute tradeoff determined empirically; too small underfits semantic distinctions, too large wastes compute with diminishing returns.
🧠

Recap quiz

Trade-off

GPT-2 Small uses d_model=768 for a 50,257-token vocabulary, giving ~38.6M embedding params out of 124M total. Why is 768 a better choice than 64 or 8192 for this model size?

GPT-2 Small uses d_model=768 for a 50,257-token vocabulary, giving ~38.6M embedding params out of 124M total. Why is 768 a better choice than 64 or 8192 for this model size?
Derivation

GPT-2 uses weight tying: the output projection is W_embed^T. If you un-tied the weights on GPT-2 Small, how many additional parameters would you add, and what geometry constraint would you lose?

GPT-2 uses weight tying: the output projection is W_embed^T. If you un-tied the weights on GPT-2 Small, how many additional parameters would you add, and what geometry constraint would you lose?
Trade-off

Llama-3 expanded vocabulary from 32K to 128K tokens, increasing the embedding table from 131M to 525M params (8B model, d_model=4096). What is the concrete serving benefit, and what is the cost?

Llama-3 expanded vocabulary from 32K to 128K tokens, increasing the embedding table from 131M to 525M params (8B model, d_model=4096). What is the concrete serving benefit, and what is the cost?
Trade-off

Word2Vec Skip-gram and LLM next-token prediction both train embeddings via self-supervision. What is the key structural difference that makes LLM embeddings contextual while Word2Vec embeddings are static?

Word2Vec Skip-gram and LLM next-token prediction both train embeddings via self-supervision. What is the key structural difference that makes LLM embeddings contextual while Word2Vec embeddings are static?
Derivation

A one-hot vector for a token in a 50,257-vocab GPT-2 model stored as float32 consumes ~200 KB. The same token as a 768-dim float32 embedding consumes ~3 KB. Why does this 67× compression not lose information?

A one-hot vector for a token in a 50,257-vocab GPT-2 model stored as float32 consumes ~200 KB. The same token as a 768-dim float32 embedding consumes ~3 KB. Why does this 67× compression not lose information?
Derivation

Why does BPE tokenization with a 50K-token vocabulary guarantee zero out-of-vocabulary (OOV) failures at the embedding lookup step, even for neologisms like “hallucinate” in 2015?

Why does BPE tokenization with a 50K-token vocabulary guarantee zero out-of-vocabulary (OOV) failures at the embedding lookup step, even for neologisms like “hallucinate” in 2015?
Trade-off

BERT-base and GPT-2 Small both use d_model=768 with ~30K and 50K vocab respectively. Why does BERT produce better sentence-level semantic similarity than GPT-2, even though both use the same embedding dimension?

BERT-base and GPT-2 Small both use d_model=768 with ~30K and 50K vocab respectively. Why does BERT produce better sentence-level semantic similarity than GPT-2, even though both use the same embedding dimension?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 6 of 6

Why do large language models use embedding dimensions of 768-4096 instead of, say, 64 or 16384?

★☆☆
GoogleOpenAI

Explain weight tying between input embeddings and the output projection layer. Why does it help?

★★☆
GoogleMeta

How do subword tokenizers like BPE handle out-of-vocabulary (OOV) words at the embedding level?

★☆☆
OpenAIGoogle

How would you measure whether trained embeddings capture semantic similarity? What would you expect?

★★☆
GoogleMeta

Why are learned embeddings superior to one-hot encodings for representing tokens?

★☆☆
GoogleAnthropic

How do positional encodings interact with token embeddings, and why can't embeddings alone capture position?

★★☆
GoogleOpenAI