Positional Encoding — Transformer Math

Module 3 · The Transformer

📍 Positional Encoding

"cat ate fish" = "fish ate cat" without this

Status:

🎮

Interactive Sandbox

What you're seeing: Each row is one dimension of the positional encoding vector. Each column is a position in the sequence (word 0, 1, 2...). Colors show the value: blue = negative, white = zero, red = positive.

Try this: Top rows oscillate fast (high frequency = fine position info), bottom rows oscillate slowly (low frequency = coarse). Click two positions to see how similar their encodings are — nearby positions have high dot products.

Positions:128Dimensions:64

Click two positions to compare their PE dot product.

-10+1

💡

The Intuition

Attention is permutation-equivariant — without positional information, the self-attention operation has no inherent notion of order. The representation computed for each token depends only on content, not position: "cat ate fish" and "fish ate cat" would produce the same set of output vectors, just in a different order (the outputs permute with the inputs). This is clearly unacceptable.

Diagram 1 — Without PE, attention sees a bag of tokens, not a sequence

Positional Encoding gives each position a unique "fingerprint", letting the model know who comes before whom. This fingerprint is added to the token embedding: . The heatmap above shows the PE value at each position and dimension.

Diagram 2 — Adding positional fingerprint to the token embedding

Diagram 3 — Each dimension pair uses a different frequency

✨ Insight · Try clicking two different positions above — you'll see their PE vector dot product. Closer positions have larger dot products; farther positions have smaller ones. This is how the model perceives "distance".

Quick check

Recall

Without PE, “cat ate fish” and “fish ate cat” produce the same set of output vectors in permuted order. An interviewer asks: is this permutation-invariant or permutation-equivariant?

Permutation-equivariant: each output token moves with its input token, so output order mirrors input order.Permutation-invariant: the same output is produced regardless of word order.Neither — without PE, attention is undefined because queries and keys are in different spaces.Both — the attention weights are invariant but the output vectors are equivariant.

Quick Check

Without positional encoding, what would attention compute for 'cat ate fish' vs 'fish ate cat'?

📐

Step-by-Step Derivation

Step 1: The sinusoidal encoding formula

For position , at dimensions and :

PyTorch: sinusoidal positional encoding

# Sinusoidal positional encoding
import math
import torch

def positional_encoding(seq_len, d_model):
    pe = torch.zeros(seq_len, d_model)
    pos = torch.arange(seq_len).unsqueeze(1).float()
    div = torch.exp(
        torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
    )
    pe[:, 0::2] = torch.sin(pos * div)
    pe[:, 1::2] = torch.cos(pos * div)
    return pe  # [seq_len, d_model]

pe = positional_encoding(128, 512)

Step 2: Why sin AND cos? — Rotation for relative position

Key insight: can be expressed as a linear transformation (rotation matrix) of , where the rotation angle depends only on the offset :

where . This follows from the trigonometric identities (product-to-sum):

💡 Tip · Because the rotation matrix depends only on and not on , is purely a function of k. The model can learn relative position relationships like "3 tokens ago" regardless of the absolute position.

Step 3: Multi-scale frequencies

The wavelength for each dimension pair :

At , wavelength = (high frequency, distinguishes adjacent positions). At , wavelength = (low frequency, encodes long-range distances). You can clearly see this frequency gradient in the heatmap.

Quick check

Derivation

The sinusoidal PE formula uses 10000 as the base: PE(pos,2i) = sin(pos / 10000^(2i/d)). At i=d/2-1, the wavelength is 2π×10000 ≈ 62,832. What design constraint does this satisfy?

10000 ensures adjacent-position similarity is always > 0.9, which is required for the model to learn local patterns.Low-frequency dimensions complete less than one full cycle for sequences up to ~10,000 tokens, giving each position a unique encoding.The base matches the GPT-2 vocabulary size (~50K / 5 = 10K), making positional and token embeddings commensurate.10000 was chosen so that PE gradients are exactly 1.0 at each layer, preventing vanishing signals during training.

PyTorch implementation

import math
import torch
import torch.nn as nn

class SinusoidalPE(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)  # even dims
        pe[:, 1::2] = torch.cos(position * div_term)  # odd dims
        self.register_buffer('pe', pe.unsqueeze(0))    # (1, max_len, d_model)

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

PyTorch: Rotary Position Embedding (RoPE)

def apply_rotary_emb(x, freqs):
    """Apply RoPE to query or key tensors.
    x: (batch, seq_len, n_heads, d_k)
    freqs: (seq_len, d_k // 2)
    """
    # Split into pairs and rotate
    x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))
    freqs_complex = torch.polar(torch.ones_like(freqs), freqs)
    # Broadcast freqs to match x: (seq_len, d/2) → (1, seq_len, 1, d/2)
    freqs_complex = freqs_complex[None, :, None, :]
    x_rotated = torch.view_as_real(x_complex * freqs_complex).flatten(-2)
    return x_rotated.type_as(x)

🔧

Break It — See What Happens

Remove Positional Encoding

📊

Real-World Numbers

Model	d_model	PE Type	Notes
Transformer (2017)		Sinusoidal	The original — sin/cos with d=512
GPT-3 (2020)		Learned	Learned absolute PE, can't extrapolate
Llama-3.1 70B (2024)		RoPE	Rotary PE — the modern standard
MPT-7B-StoryWriter (2023)	4096		No PE — linear attention bias

💡 Tip · Almost no modern model still uses the original sinusoidal PE. RoPE has become the de facto standard (Llama, Mistral, Qwen, DeepSeek all use RoPE). The value of sinusoidal PE lies in understanding the core idea — position information encoded through frequencies, relative position achieved through rotation. RoPE is essentially this idea applied directly to Q/K vectors.

Quick check

Trade-off

You are designing a new 7B language model and must choose between RoPE and ALiBi. Your primary requirement is 128K-token context with efficient fine-tuning for extension. Which do you choose?

ALiBi — it has no learned parameters so fine-tuning is unnecessary; linear decay automatically handles any length.Sinusoidal PE — it is the most mathematically elegant solution and requires no additional GPU memory or training.Learned absolute PE — it achieves the best perplexity on the training distribution but cannot generalize beyond training length.RoPE — it supports context extension via NTK/YaRN interpolation without full retraining, and is the dominant standard in modern LLMs.

🔬

RoPE — Rotary Position Embedding in Detail

Sinusoidal PE adds position information to the token embedding once, before it enters the Transformer. instead rotates the Query and Key vectors at every attention layer, encoding position directly into the dot product that determines attention weights.

For a query at position and a key at position , RoPE applies a rotation matrix before the attention dot product:

This is the key advantage: the model inherently learns relative position relationships without any special parameterization.

In practice, RoPE splits each head dimension into pairs and applies a 2D rotation with frequency per pair — exactly the same geometric intuition as sinusoidal PE, but applied inside the attention computation rather than at the input:

PyTorch: rotary position embedding

import torch

def precompute_freqs_cis(d_head: int, seq_len: int, base: float = 10000.0):
    freqs = 1.0 / (base ** (torch.arange(0, d_head, 2).float() / d_head))
    t = torch.arange(seq_len)
    freqs = torch.outer(t, freqs)
    return torch.polar(torch.ones_like(freqs), freqs)

def apply_rotary_emb(xq, xk, freqs_cis):
    # xq/xk: (batch, seq_len, n_heads, d_head)
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    # Reshape freqs for broadcasting: (seq_len, d/2) → (1, seq_len, 1, d/2)
    freqs_cis = freqs_cis[None, :, None, :]
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk)

✨ Insight · RoPE naturally decays attention scores with distance: as grows, high-frequency dimensions rotate rapidly and their contribution to the dot product averages to zero. This gives RoPE a built-in locality bias without any explicit masking — distant tokens are automatically de-emphasized (Su et al. 2021).

Quick check

Trade-off

RoPE is trained on sequences up to 4K tokens and guarantees the attention score between positions m and n depends only on (m-n). At inference on an 8K-token sequence, what can you say about token pairs with relative distance ≤ 4K?

The attention scores are undefined for any position > 4K because the rotation matrix is not precomputed.All 8K-token attention scores degrade equally because the positional rotation matrix was never trained beyond position 4K.Their attention scores match training — every relative offset ≤ 4K was seen, so positions 0–4K decode correctly.RoPE hard-masks all tokens beyond position 4K, treating them as if they do not exist in the context window.

🧠

Key Takeaways

What to remember for interviews

1Self-attention is permutation-equivariant — outputs permute with the inputs. Without PE, 'cat ate fish' and 'fish ate cat' produce the same set of output vectors in different order.
2Sinusoidal PE is absolute: each position gets a fixed fingerprint added to the input embedding before the Transformer sees it.
3RoPE encodes relative position by rotating Query and Key vectors at every attention layer — the dot product q·k depends only on the offset (m−n), not the absolute positions m or n.
4RoPE enables better length extrapolation than learned or sinusoidal PE, and is now the de facto standard (Llama, Mistral, Qwen, DeepSeek).

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 7 of 7

Sin/cos vs learned PE — tradeoffs?

★★☆

GoogleOpenAI

How does sin/cos encode RELATIVE position? Derive using product-to-sum.

★★★

OpenAIAnthropic

If PE(pos+k) is a linear function of PE(pos), what kind of geometric transformation is it?

Why different frequencies for different dimensions?

★★★

OpenAIAnthropic

What would happen if all dimensions used the same frequency?

Length extrapolation problem → RoPE / ALiBi

★★☆

GoogleOpenAIAnthropicMetaDatabricks

How does RoPE (Rotary Position Embedding) differ from sinusoidal PE? Why has it become standard?

★★★

OpenAIAnthropic

What is ALiBi (Attention with Linear Biases)? How does it compare to RoPE?

★★☆

DatabricksAnthropic

If you had to design a position encoding from scratch for very long sequences (1M+ tokens), what would you consider?

★★★

AnthropicOpenAI