Embeddings — Transformer Math

Module 2 · The Transformer

📊 Embeddings

Why is 'king' - 'man' + 'woman' = 'queen'?

Status:

The tokenizer converts text into integer IDs. But a model can't do math on bare integers — embeddings turn each token ID into a dense vector of learned numbers that encode meaning, syntax, and relationships. This is the first learnable layer in every Transformer.

📐

Embedding Space Geometry

What you're seeing:A 2D projection of a high-dimensional embedding space. Words cluster by semantic category — royalty words (king, queen, prince) land near each other, animals cluster separately, and verbs form their own region. The dashed arrows illustrate the famous analogy: king − man + woman ≈ queen.

What to try:Trace the dashed arrows — subtract the “man” vector from “king”, add “woman”, and the result lands near “queen”. Notice the parallelogram shape: king→queen mirrors man→woman.

🎮

Embedding Lookup Table

What you're seeing:Each token ID is an index into a giant matrix. The row at that index is the token's embedding vector. Click a row to highlight it.

Sentence: "The cat sat on the mat" → 6 tokens → 6 vector lookups. Real models use 768-4096 dimensions; this shows 8 for clarity.

Token	ID	Embedding Vector (d=8)
The	464	[0.12, -0.34, 0.78, 0.01, -0.56, 0.23, -0.89, 0.45]
cat(similar)	3797	[0.67, 0.15, -0.42, 0.88, 0.03, -0.71, 0.29, -0.14]
sat(similar)	3290	[0.55, 0.08, -0.39, 0.81, 0.11, -0.63, 0.34, -0.22]
on	319	[-0.23, 0.91, 0.05, -0.67, 0.44, 0.18, -0.52, 0.76]
the	262	[0.11, -0.31, 0.75, -0.02, -0.53, 0.21, -0.85, 0.42]
mat(similar)	2603	[0.61, 0.19, -0.45, 0.83, 0.07, -0.68, 0.31, -0.18]

💡 Tip · Notice "cat", "sat", and "mat" have similar vectors (high cosine similarity) because they appear in similar contexts. "on" looks very different — it's a function word, not a noun/verb.

💡

The Intuition

Why not one-hot vectors?A one-hot vector for "cat" in a 50K vocabulary is a 50,000-dimensional vector with a single 1 at position 3,797 and zeros everywhere else. "Cat" and "kitten" have zero similarity— their dot product is 0. You can't do arithmetic on them, and a dense one-hot vector would consume 200KB per token. Embeddings compress this to ~768 dimensions where "cat" and "kitten" are close neighbors ().

Representation	Dimensions	cat·kitten similarity	Memory per token
One-hot	50,257	0.0	200 KB
Embedding	768	0.85	3 KB

An embedding is a lookup table. The model maintains a matrix of shape . Given token ID 3797 ("cat"), it grabs row 3797 — no multiplication, just an index operation. That row is the learned representation for "cat".

Weight tyingis a key trick: the same embedding matrix is reused at the output layer (transposed) to predict the next token. This means the model's input space and output space share the same learned geometry. If "cat" and "kitten" are nearby in embedding space, the model can naturally predict either as a continuation.

Dimensionalityranges from 768 (GPT-2) to 4096 (Llama-2) to even larger. Each dimension doesn't have a human-interpretable meaning — the model learns to distribute information across all dimensions in whatever way minimizes loss.

✨ Insight · Think of it this way: one-hot encoding puts each word on its own axis (50K dimensions, no similarity between any pair). Embeddings compress those 50K axes down to 768-4096 dimensions where similar words end up near each other. It's learned dimensionality reduction.

Quick check

Trade-off

A 50,257-vocab one-hot vector stored as float32 occupies ~200 KB per token. A 768-dim embedding occupies ~3 KB. Beyond memory, what is the deeper reason embeddings are preferred for learning?

Embeddings are faster at runtime because they live in GPU registers rather than slower VRAM banks.One-hot vectors cause gradient explosion during backpropagation, destabilising the embedding layer updates.Embeddings encode learned similarity: “cat” and “kitten” cluster nearby, while one-hot vectors are always orthogonal regardless of meaning.One-hot vectors cannot be represented in float16, making them incompatible with modern LLM training pipelines.

Quick Check

Why does weight tying help?

📐

Step-by-Step Derivation

Embedding Lookup

Given a token ID, the embedding is simply the corresponding row of the embedding matrix :

This is equivalent to multiplying a one-hot vector by the embedding matrix: , but the lookup is O(d) while the matmul would be O(V × d).

Weight Tying

The output logits are computed by projecting the hidden state through the transposed embedding matrix:

💡 Tip · The logit for token is the dot product — literally measuring how close the hidden state is to that token's embedding. Higher similarity = higher probability after softmax.

Dimensionality & Parameter Count

The embedding matrix accounts for a significant fraction of total parameters:

For GPT-2: . With weight tying, this is shared with the output layer, saving another 38.6M.

PyTorch implementation

import torch.nn as nn

class TransformerWithTiedEmbeddings(nn.Module):
    def __init__(self, vocab_size, d_model):
        super().__init__()
        # Embedding lookup table: vocab_size rows, d_model columns
        self.embedding = nn.Embedding(vocab_size, d_model)

        # Output projection reuses embedding weights (weight tying)
        self.output_proj = nn.Linear(d_model, vocab_size, bias=False)
        self.output_proj.weight = self.embedding.weight  # TIED!

    def forward(self, token_ids):
        # token_ids: (batch, seq_len) -> integers
        x = self.embedding(token_ids)  # (batch, seq_len, d_model)
        logits = self.output_proj(x)   # (batch, seq_len, vocab_size)
        return logits

# GPT-2 small: 50257 vocab, 768 dim
model = TransformerWithTiedEmbeddings(vocab_size=50257, d_model=768)
print(f"Embedding params: {50257 * 768:,}")  # 38,597,376

Quick check

Derivation

With weight tying in GPT-2, the logit for token j is computed as h_L · e_j (dot product of final hidden state with token j's embedding). What does it mean in geometric terms for the model to “predict token j with high probability”?

e_j must be projected onto the null space of h_L so its orthogonal component avoids cancellation.h_L must have a larger L2 norm than all other token embeddings to dominate the dot products.The Euclidean distance between h_L and e_j must be minimised to zero for a confident prediction.h_L must point in roughly the same direction as e_j — high cosine similarity raises logit_j and thus its softmax probability.

🔧

Break It — See What Happens

Reduce embedding dimension to 8

Random embeddings (no training)

📊

Real-World Numbers

Model	d_model	Vocab Size	Embed Params	Weight Tied?
GPT-2	768	50,257		Yes
GPT-3 (175B)	12,288	50,257		Yes
GPT-4 (est.) *	~4096+	~100K (est.)	~400M+ (est.)	Likely
Llama-2 (7B)	4,096	32,000		No
Llama-3 (8B)	4,096	128,256		No
BERT-base	768			Yes

* GPT-4 architecture details not officially confirmed by OpenAI — figures are community estimates from leaks and benchmarking.

✨ Insight · Larger vocab = fewer tokens per text = faster inference, but the embedding table becomes a bigger fraction of total params. This is a real engineering tradeoff.

Quick check

Trade-off

Llama-3 (8B, d_model=4096) uses a 128,256-token vocab giving 525M embedding params, while Llama-2 (7B, d_model=4096) used 32,000 tokens giving 131M params. A deployment engineer asks: does the larger vocab help or hurt inference latency?

Hurts: the larger output softmax adds O(V) compute per decode step, which dominates latency for short responses.Helps: denser token encoding → fewer tokens per prompt → faster decode and smaller KV cache. The 394M extra params add ~1.6 GB VRAM but reduce per-request compute.Hurts: the 394M extra params are re-fetched from DRAM on every generated token, creating a bandwidth bottleneck.Neutral: embedding lookup is O(d_model) regardless of vocab size, so any sequence-length differences cancel out.

🔬

How Embeddings Are Trained: Word2Vec vs LLM Embeddings

showed that useful embeddings emerge from a simple self-supervised objective: given a center word, predict its context (Skip-gram), or given context words, predict the center (CBOW). The key insight is that the embedding is a side effect of the prediction task — the network is never told what "similar" means, but words that appear in similar contexts end up with similar vectors.

LLM embeddings are trained differently. The embedding table is initialized randomly and updated end-to-end by next-token prediction loss via backpropagation. There is no separate embedding training phase — the embeddings co-evolve with attention and FFN weights across all 96+ layers. This makes LLM embeddings contextual in aggregate: although the table itself is static (one row per token), after attention layers each token's representation depends on its neighbors. The raw embedding table encodes mostly syntactic and frequency-based information; rich semantics emerge only after the first few attention layers.

Property	Word2Vec	LLM Embeddings
Training signal	Context window prediction	Next-token prediction (full LM)
Contextuality	Static — one vector per word	Static table, contextual after attention
Granularity	Word-level	Subword (BPE/Unigram)
Analogy arithmetic	Strong (king − man + woman = queen)	Weaker in raw table, stronger in context

💡 Tip · The famous "king − man + woman ≈ queen" result works in because the static vectors directly encode semantic relationships. In GPT-style models, this arithmetic is weaker at the embedding table level — instead, the model distributes semantic reasoning across attention and FFN layers rather than compressing it all into a single lookup table. extended BERT with contrastive fine-tuning to produce sentence embeddings well-calibrated for cosine similarity — the foundation of modern embedding APIs.

🧠

Key Takeaways

What to remember for interviews

1Embeddings map discrete token IDs to dense vectors where semantic similarity is captured by geometric proximity — 'cat' and 'kitten' are close neighbors, one-hot vectors are not.
2The embedding matrix is a simple lookup table (vocab_size × d_model); given a token ID, retrieving its vector is an index operation, not a matrix multiplication.
3Weight tying reuses the embedding matrix as the output projection layer, saving millions of parameters and forcing a shared semantic geometry between input encoding and next-token prediction.
4Byte-level BPE eliminates out-of-vocabulary entirely — any word decomposes into known subword tokens, each with its own embedding row.
5Embedding dimension (768–4096) is a capacity-compute tradeoff determined empirically; too small underfits semantic distinctions, too large wastes compute with diminishing returns.

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 6 of 6

Why do large language models use embedding dimensions of 768-4096 instead of, say, 64 or 16384?

★☆☆

GoogleOpenAI

Explain weight tying between input embeddings and the output projection layer. Why does it help?

★★☆

GoogleMeta

How do subword tokenizers like BPE handle out-of-vocabulary (OOV) words at the embedding level?

★☆☆

OpenAIGoogle

How would you measure whether trained embeddings capture semantic similarity? What would you expect?

★★☆

GoogleMeta

Why are learned embeddings superior to one-hot encodings for representing tokens?

★☆☆

GoogleAnthropic

How do positional encodings interact with token embeddings, and why can't embeddings alone capture position?

★★☆

GoogleOpenAI

←

🔤 Tokenization

📍 Positional Encoding

→

📊 Embeddings

Embedding Space Geometry

Embedding Lookup Table

The Intuition

Step-by-Step Derivation

Break It — See What Happens

Real-World Numbers

How Embeddings Are Trained: Word2Vec vs LLM Embeddings

Key Takeaways

Recap quiz

Further Reading

Interview Questions