📍 Positional Encoding
"cat ate fish" = "fish ate cat" without this
Interactive Sandbox
What you're seeing: Each row is one dimension of the positional encoding vector. Each column is a position in the sequence (word 0, 1, 2...). Colors show the value: blue = negative, white = zero, red = positive.
Try this: Top rows oscillate fast (high frequency = fine position info), bottom rows oscillate slowly (low frequency = coarse). Click two positions to see how similar their encodings are — nearby positions have high dot products.
The Intuition
Attention is permutation-equivariant — without positional information, the self-attention operation has no inherent notion of order. The representation computed for each token depends only on content, not position: "cat ate fish" and "fish ate cat" would produce the same set of output vectors, just in a different order (the outputs permute with the inputs). This is clearly unacceptable.
Diagram 1 — Without PE, attention sees a bag of tokens, not a sequence
Positional Encoding gives each position a unique "fingerprint", letting the model know who comes before whom. This fingerprint is added to the token embedding: . The heatmap above shows the PE value at each position and dimension.
Diagram 2 — Adding positional fingerprint to the token embedding
Diagram 3 — Each dimension pair uses a different frequency
Quick check
Without PE, “cat ate fish” and “fish ate cat” produce the same set of output vectors in permuted order. An interviewer asks: is this permutation-invariant or permutation-equivariant?
Without positional encoding, what would attention compute for 'cat ate fish' vs 'fish ate cat'?
Step-by-Step Derivation
Step 1: The sinusoidal encoding formula
For position , at dimensions and :
PyTorch: sinusoidal positional encoding
# Sinusoidal positional encoding
import math
import torch
def positional_encoding(seq_len, d_model):
pe = torch.zeros(seq_len, d_model)
pos = torch.arange(seq_len).unsqueeze(1).float()
div = torch.exp(
torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(pos * div)
pe[:, 1::2] = torch.cos(pos * div)
return pe # [seq_len, d_model]
pe = positional_encoding(128, 512)Step 2: Why sin AND cos? — Rotation for relative position
Key insight: can be expressed as a linear transformation (rotation matrix) of , where the rotation angle depends only on the offset :
where . This follows from the trigonometric identities (product-to-sum):
Step 3: Multi-scale frequencies
The wavelength for each dimension pair :
At , wavelength = (high frequency, distinguishes adjacent positions). At , wavelength = (low frequency, encodes long-range distances). You can clearly see this frequency gradient in the heatmap.
Quick check
The sinusoidal PE formula uses 10000 as the base: PE(pos,2i) = sin(pos / 10000^(2i/d)). At i=d/2-1, the wavelength is 2π×10000 ≈ 62,832. What design constraint does this satisfy?
PyTorch implementation
import math
import torch
import torch.nn as nn
class SinusoidalPE(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float()
div_term = torch.exp(
torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term) # even dims
pe[:, 1::2] = torch.cos(position * div_term) # odd dims
self.register_buffer('pe', pe.unsqueeze(0)) # (1, max_len, d_model)
def forward(self, x):
return x + self.pe[:, :x.size(1)]PyTorch: Rotary Position Embedding (RoPE)
def apply_rotary_emb(x, freqs):
"""Apply RoPE to query or key tensors.
x: (batch, seq_len, n_heads, d_k)
freqs: (seq_len, d_k // 2)
"""
# Split into pairs and rotate
x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))
freqs_complex = torch.polar(torch.ones_like(freqs), freqs)
# Broadcast freqs to match x: (seq_len, d/2) → (1, seq_len, 1, d/2)
freqs_complex = freqs_complex[None, :, None, :]
x_rotated = torch.view_as_real(x_complex * freqs_complex).flatten(-2)
return x_rotated.type_as(x)Break It — See What Happens
Real-World Numbers
| Model | d_model | PE Type | Max Length | Notes |
|---|---|---|---|---|
| Transformer (2017) | Sinusoidal | The original — sin/cos with d=512 | ||
| GPT-3 (2020) | Learned | Learned absolute PE, can't extrapolate | ||
| Llama-3.1 70B (2024) | RoPE | Rotary PE — the modern standard | ||
| MPT-7B-StoryWriter (2023) | 4096 | No PE — linear attention bias |
Quick check
You are designing a new 7B language model and must choose between RoPE and ALiBi. Your primary requirement is 128K-token context with efficient fine-tuning for extension. Which do you choose?
RoPE — Rotary Position Embedding in Detail
Sinusoidal PE adds position information to the token embedding once, before it enters the Transformer. instead rotates the Query and Key vectors at every attention layer, encoding position directly into the dot product that determines attention weights.
For a query at position and a key at position , RoPE applies a rotation matrix before the attention dot product:
This is the key advantage: the model inherently learns relative position relationships without any special parameterization.
In practice, RoPE splits each head dimension into pairs and applies a 2D rotation with frequency per pair — exactly the same geometric intuition as sinusoidal PE, but applied inside the attention computation rather than at the input:
PyTorch: rotary position embedding
import torch
def precompute_freqs_cis(d_head: int, seq_len: int, base: float = 10000.0):
freqs = 1.0 / (base ** (torch.arange(0, d_head, 2).float() / d_head))
t = torch.arange(seq_len)
freqs = torch.outer(t, freqs)
return torch.polar(torch.ones_like(freqs), freqs)
def apply_rotary_emb(xq, xk, freqs_cis):
# xq/xk: (batch, seq_len, n_heads, d_head)
xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
# Reshape freqs for broadcasting: (seq_len, d/2) → (1, seq_len, 1, d/2)
freqs_cis = freqs_cis[None, :, None, :]
xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
return xq_out.type_as(xq), xk_out.type_as(xk)Quick check
RoPE is trained on sequences up to 4K tokens and guarantees the attention score between positions m and n depends only on (m-n). At inference on an 8K-token sequence, what can you say about token pairs with relative distance ≤ 4K?
Key Takeaways
What to remember for interviews
- 1Self-attention is permutation-equivariant — outputs permute with the inputs. Without PE, 'cat ate fish' and 'fish ate cat' produce the same set of output vectors in different order.
- 2Sinusoidal PE is absolute: each position gets a fixed fingerprint added to the input embedding before the Transformer sees it.
- 3RoPE encodes relative position by rotating Query and Key vectors at every attention layer — the dot product q·k depends only on the offset (m−n), not the absolute positions m or n.
- 4RoPE enables better length extrapolation than learned or sinusoidal PE, and is now the de facto standard (Llama, Mistral, Qwen, DeepSeek).
Recap quiz
Without positional encoding, self-attention is said to be permutation-equivariant. What does this mean precisely?
The sinusoidal PE uses base 10000 in the divisor term. If you changed the base to 100, what would break first?
RoPE (Rotary PE) and sinusoidal PE both use the same frequency progression (base 10000). What is the structural advantage of RoPE that makes it better for length extrapolation?
ALiBi adds no positional embedding at all — instead it subtracts m * |i-j| from attention scores, where m is head-specific. Why does this approach extrapolate to longer sequences better than learned absolute PE?
The sinusoidal PE has the property that PE(pos+k) = R(k) * PE(pos) for a rotation matrix R(k). What does this enable the Transformer to do that raw absolute positions (e.g., integer 0, 1, 2...) cannot?
The original Transformer paper (2017) tested learned absolute PE vs sinusoidal PE and found nearly identical results. Why do most modern models (GPT, Llama) prefer RoPE over both?
Further Reading
- Attention Is All You Need — Vaswani et al. 2017 — introduced sinusoidal positional encodings in the original Transformer.
- RoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE) — Su et al. 2021 — rotary embeddings used by LLaMA, Mistral, and most modern LLMs. Encodes relative position via rotation matrices.
- Train Short, Test Long: Attention with Linear Biases (ALiBi) — Press et al. 2022 — adds linear bias to attention scores instead of positional embeddings. Enables length extrapolation.
- The Illustrated Transformer — Jay Alammar — Visual walkthrough of sinusoidal positional encodings — shows how sine/cosine patterns encode position.
- Rotary Embeddings: A Relative Revolution (EleutherAI blog) — Intuitive introduction to RoPE with derivations — explains why rotation in complex space encodes relative position elegantly.
- YaRN: Efficient Context Window Extension of Large Language Models — Peng et al. 2023 — extends RoPE context windows without full fine-tuning via interpolation. Technique used by Llama models for context extension.
Interview Questions
Showing 7 of 7
Sin/cos vs learned PE — tradeoffs?
★★☆How does sin/cos encode RELATIVE position? Derive using product-to-sum.
★★★If PE(pos+k) is a linear function of PE(pos), what kind of geometric transformation is it?
Why different frequencies for different dimensions?
★★★What would happen if all dimensions used the same frequency?
Length extrapolation problem → RoPE / ALiBi
★★☆How does RoPE (Rotary Position Embedding) differ from sinusoidal PE? Why has it become standard?
★★★What is ALiBi (Attention with Linear Biases)? How does it compare to RoPE?
★★☆If you had to design a position encoding from scratch for very long sequences (1M+ tokens), what would you consider?
★★★