📊 Embeddings
Why is 'king' - 'man' + 'woman' = 'queen'?
The tokenizer converts text into integer IDs. But a model can't do math on bare integers — embeddings turn each token ID into a dense vector of learned numbers that encode meaning, syntax, and relationships. This is the first learnable layer in every Transformer.
Embedding Space Geometry
What you're seeing:A 2D projection of a high-dimensional embedding space. Words cluster by semantic category — royalty words (king, queen, prince) land near each other, animals cluster separately, and verbs form their own region. The dashed arrows illustrate the famous analogy: king − man + woman ≈ queen.
What to try:Trace the dashed arrows — subtract the “man” vector from “king”, add “woman”, and the result lands near “queen”. Notice the parallelogram shape: king→queen mirrors man→woman.
Embedding Lookup Table
What you're seeing:Each token ID is an index into a giant matrix. The row at that index is the token's embedding vector. Click a row to highlight it.
Sentence: "The cat sat on the mat" → 6 tokens → 6 vector lookups. Real models use 768-4096 dimensions; this shows 8 for clarity.
| Token | ID | Embedding Vector (d=8) |
|---|---|---|
| The | 464 | [0.12, -0.34, 0.78, 0.01, -0.56, 0.23, -0.89, 0.45] |
| cat(similar) | 3797 | [0.67, 0.15, -0.42, 0.88, 0.03, -0.71, 0.29, -0.14] |
| sat(similar) | 3290 | [0.55, 0.08, -0.39, 0.81, 0.11, -0.63, 0.34, -0.22] |
| on | 319 | [-0.23, 0.91, 0.05, -0.67, 0.44, 0.18, -0.52, 0.76] |
| the | 262 | [0.11, -0.31, 0.75, -0.02, -0.53, 0.21, -0.85, 0.42] |
| mat(similar) | 2603 | [0.61, 0.19, -0.45, 0.83, 0.07, -0.68, 0.31, -0.18] |
The Intuition
Why not one-hot vectors?A one-hot vector for "cat" in a 50K vocabulary is a 50,000-dimensional vector with a single 1 at position 3,797 and zeros everywhere else. "Cat" and "kitten" have zero similarity— their dot product is 0. You can't do arithmetic on them, and a dense one-hot vector would consume 200KB per token. Embeddings compress this to ~768 dimensions where "cat" and "kitten" are close neighbors ().
| Representation | Dimensions | cat·kitten similarity | Memory per token |
|---|---|---|---|
| One-hot | 50,257 | 0.0 | 200 KB |
| Embedding | 768 | 0.85 | 3 KB |
An embedding is a lookup table. The model maintains a matrix of shape . Given token ID 3797 ("cat"), it grabs row 3797 — no multiplication, just an index operation. That row is the learned representation for "cat".
Weight tyingis a key trick: the same embedding matrix is reused at the output layer (transposed) to predict the next token. This means the model's input space and output space share the same learned geometry. If "cat" and "kitten" are nearby in embedding space, the model can naturally predict either as a continuation.
Dimensionalityranges from 768 (GPT-2) to 4096 (Llama-2) to even larger. Each dimension doesn't have a human-interpretable meaning — the model learns to distribute information across all dimensions in whatever way minimizes loss.
Quick check
A 50,257-vocab one-hot vector stored as float32 occupies ~200 KB per token. A 768-dim embedding occupies ~3 KB. Beyond memory, what is the deeper reason embeddings are preferred for learning?
Why does weight tying help?
Step-by-Step Derivation
Embedding Lookup
Given a token ID, the embedding is simply the corresponding row of the embedding matrix :
This is equivalent to multiplying a one-hot vector by the embedding matrix: , but the lookup is O(d) while the matmul would be O(V × d).
Weight Tying
The output logits are computed by projecting the hidden state through the transposed embedding matrix:
Dimensionality & Parameter Count
The embedding matrix accounts for a significant fraction of total parameters:
For GPT-2: . With weight tying, this is shared with the output layer, saving another 38.6M.
PyTorch implementation
import torch.nn as nn
class TransformerWithTiedEmbeddings(nn.Module):
def __init__(self, vocab_size, d_model):
super().__init__()
# Embedding lookup table: vocab_size rows, d_model columns
self.embedding = nn.Embedding(vocab_size, d_model)
# Output projection reuses embedding weights (weight tying)
self.output_proj = nn.Linear(d_model, vocab_size, bias=False)
self.output_proj.weight = self.embedding.weight # TIED!
def forward(self, token_ids):
# token_ids: (batch, seq_len) -> integers
x = self.embedding(token_ids) # (batch, seq_len, d_model)
logits = self.output_proj(x) # (batch, seq_len, vocab_size)
return logits
# GPT-2 small: 50257 vocab, 768 dim
model = TransformerWithTiedEmbeddings(vocab_size=50257, d_model=768)
print(f"Embedding params: {50257 * 768:,}") # 38,597,376Quick check
With weight tying in GPT-2, the logit for token j is computed as h_L · e_j (dot product of final hidden state with token j's embedding). What does it mean in geometric terms for the model to “predict token j with high probability”?
Break It — See What Happens
Real-World Numbers
| Model | d_model | Vocab Size | Embed Params | Weight Tied? |
|---|---|---|---|---|
| GPT-2 | 768 | 50,257 | Yes | |
| GPT-3 (175B) | 12,288 | 50,257 | Yes | |
| GPT-4 (est.) * | ~4096+ | ~100K (est.) | ~400M+ (est.) | Likely |
| Llama-2 (7B) | 4,096 | 32,000 | No | |
| Llama-3 (8B) | 4,096 | 128,256 | No | |
| BERT-base | 768 | Yes |
* GPT-4 architecture details not officially confirmed by OpenAI — figures are community estimates from leaks and benchmarking.
Quick check
Llama-3 (8B, d_model=4096) uses a 128,256-token vocab giving 525M embedding params, while Llama-2 (7B, d_model=4096) used 32,000 tokens giving 131M params. A deployment engineer asks: does the larger vocab help or hurt inference latency?
How Embeddings Are Trained: Word2Vec vs LLM Embeddings
showed that useful embeddings emerge from a simple self-supervised objective: given a center word, predict its context (Skip-gram), or given context words, predict the center (CBOW). The key insight is that the embedding is a side effect of the prediction task — the network is never told what "similar" means, but words that appear in similar contexts end up with similar vectors.
LLM embeddings are trained differently. The embedding table is initialized randomly and updated end-to-end by next-token prediction loss via backpropagation. There is no separate embedding training phase — the embeddings co-evolve with attention and FFN weights across all 96+ layers. This makes LLM embeddings contextual in aggregate: although the table itself is static (one row per token), after attention layers each token's representation depends on its neighbors. The raw embedding table encodes mostly syntactic and frequency-based information; rich semantics emerge only after the first few attention layers.
| Property | Word2Vec | LLM Embeddings |
|---|---|---|
| Training signal | Context window prediction | Next-token prediction (full LM) |
| Contextuality | Static — one vector per word | Static table, contextual after attention |
| Granularity | Word-level | Subword (BPE/Unigram) |
| Analogy arithmetic | Strong (king − man + woman = queen) | Weaker in raw table, stronger in context |
Key Takeaways
What to remember for interviews
- 1Embeddings map discrete token IDs to dense vectors where semantic similarity is captured by geometric proximity — 'cat' and 'kitten' are close neighbors, one-hot vectors are not.
- 2The embedding matrix is a simple lookup table (vocab_size × d_model); given a token ID, retrieving its vector is an index operation, not a matrix multiplication.
- 3Weight tying reuses the embedding matrix as the output projection layer, saving millions of parameters and forcing a shared semantic geometry between input encoding and next-token prediction.
- 4Byte-level BPE eliminates out-of-vocabulary entirely — any word decomposes into known subword tokens, each with its own embedding row.
- 5Embedding dimension (768–4096) is a capacity-compute tradeoff determined empirically; too small underfits semantic distinctions, too large wastes compute with diminishing returns.
Recap quiz
GPT-2 Small uses d_model=768 for a 50,257-token vocabulary, giving ~38.6M embedding params out of 124M total. Why is 768 a better choice than 64 or 8192 for this model size?
GPT-2 uses weight tying: the output projection is W_embed^T. If you un-tied the weights on GPT-2 Small, how many additional parameters would you add, and what geometry constraint would you lose?
Llama-3 expanded vocabulary from 32K to 128K tokens, increasing the embedding table from 131M to 525M params (8B model, d_model=4096). What is the concrete serving benefit, and what is the cost?
Word2Vec Skip-gram and LLM next-token prediction both train embeddings via self-supervision. What is the key structural difference that makes LLM embeddings contextual while Word2Vec embeddings are static?
A one-hot vector for a token in a 50,257-vocab GPT-2 model stored as float32 consumes ~200 KB. The same token as a 768-dim float32 embedding consumes ~3 KB. Why does this 67× compression not lose information?
Why does BPE tokenization with a 50K-token vocabulary guarantee zero out-of-vocabulary (OOV) failures at the embedding lookup step, even for neologisms like “hallucinate” in 2015?
BERT-base and GPT-2 Small both use d_model=768 with ~30K and 50K vocab respectively. Why does BERT produce better sentence-level semantic similarity than GPT-2, even though both use the same embedding dimension?
Further Reading
- Efficient Estimation of Word Representations in Vector Space (Word2Vec) — Mikolov et al. 2013 — introduced Skip-gram and CBOW, the foundation of modern word embeddings.
- GloVe: Global Vectors for Word Representation — Pennington et al. 2014 — combines count-based and predictive methods for learning word vectors.
- Using the Output Embedding to Improve Language Models (Weight Tying) — Press & Wolf 2017 — shows tying input and output embedding weights improves perplexity and saves parameters.
- The Illustrated Transformer — Jay Alammar — Visual explanation of how token embeddings are constructed and combined with positional encodings.
- LLM Visualization — Brendan Bycroft — 3D walkthrough showing exactly how the embedding matrix maps token IDs to vectors in a real GPT model.
- Andrej Karpathy — Let's Build GPT — Builds the token embedding table and positional encoding from scratch in PyTorch — best hands-on introduction to the embedding layer.
- 3Blue1Brown — But what is a GPT? Visual intro to Transformers — Visual explanation of how token embeddings work and how meaning is encoded in high-dimensional vector space — geometric intuition.
- The Illustrated BERT — Jay Alammar — how BERT reuses the Transformer encoder with bidirectional attention and masked language modeling, producing contextual embeddings that outperform static word vectors.
- Chris Olah — Deep Learning, NLP, and Representations — Olah 2014 — visual intuition for how neural networks learn representations of language, connecting word embeddings to deeper network features.
- Chris Olah — Neural Networks, Manifolds, and Topology — Olah 2014 — how neural networks warp data manifolds to make them linearly separable, foundational for understanding embedding geometry.
Interview Questions
Showing 6 of 6
Why do large language models use embedding dimensions of 768-4096 instead of, say, 64 or 16384?
★☆☆Explain weight tying between input embeddings and the output projection layer. Why does it help?
★★☆How do subword tokenizers like BPE handle out-of-vocabulary (OOV) words at the embedding level?
★☆☆How would you measure whether trained embeddings capture semantic similarity? What would you expect?
★★☆Why are learned embeddings superior to one-hot encodings for representing tokens?
★☆☆How do positional encodings interact with token embeddings, and why can't embeddings alone capture position?
★★☆