Skip to content

Transformer Math

Module 3 · The Transformer

📍 Positional Encoding

"cat ate fish" = "fish ate cat" without this

Status:
🎮

Interactive Sandbox

What you're seeing: Each row is one dimension of the positional encoding vector. Each column is a position in the sequence (word 0, 1, 2...). Colors show the value: blue = negative, white = zero, red = positive.

Try this: Top rows oscillate fast (high frequency = fine position info), bottom rows oscillate slowly (low frequency = coarse). Click two positions to see how similar their encodings are — nearby positions have high dot products.

Click two positions to compare their PE dot product.
-10+1
💡

The Intuition

Attention is permutation-equivariant — without positional information, the self-attention operation has no inherent notion of order. The representation computed for each token depends only on content, not position: "cat ate fish" and "fish ate cat" would produce the same set of output vectors, just in a different order (the outputs permute with the inputs). This is clearly unacceptable.

Diagram 1 — Without PE, attention sees a bag of tokens, not a sequence

Sentence AcatatefishAttentionOutput vectors: [v₀, v₁, v₂]Sentence BfishatecatAttentionOutput vectors: [v₀, v₁, v₂]=same outputs, permuted order

Positional Encoding gives each position a unique "fingerprint", letting the model know who comes before whom. This fingerprint is added to the token embedding: . The heatmap above shows the PE value at each position and dimension.

Diagram 2 — Adding positional fingerprint to the token embedding

Token embed"cat" at pos 00.32−0.710.550.18...+PE(pos=0)sinusoidal1.000.000.990.01...=Position-awareembedding(fed into Transformer)

Diagram 3 — Each dimension pair uses a different frequency

Position (0 → 20)Valuedim 0 — high freq (nearby positions differ)dim 100 — mid freqdim 511 — low freq (far positions differ)
✨ Insight · Try clicking two different positions above — you'll see their PE vector dot product. Closer positions have larger dot products; farther positions have smaller ones. This is how the model perceives "distance".

Quick check

Recall

Without PE, “cat ate fish” and “fish ate cat” produce the same set of output vectors in permuted order. An interviewer asks: is this permutation-invariant or permutation-equivariant?

Without PE, “cat ate fish” and “fish ate cat” produce the same set of output vectors in permuted order. An interviewer asks: is this permutation-invariant or permutation-equivariant?
Quick Check

Without positional encoding, what would attention compute for 'cat ate fish' vs 'fish ate cat'?

📐

Step-by-Step Derivation

Step 1: The sinusoidal encoding formula

For position , at dimensions and :

PyTorch: sinusoidal positional encoding

# Sinusoidal positional encoding
import math
import torch

def positional_encoding(seq_len, d_model):
    pe = torch.zeros(seq_len, d_model)
    pos = torch.arange(seq_len).unsqueeze(1).float()
    div = torch.exp(
        torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
    )
    pe[:, 0::2] = torch.sin(pos * div)
    pe[:, 1::2] = torch.cos(pos * div)
    return pe  # [seq_len, d_model]

pe = positional_encoding(128, 512)

Step 2: Why sin AND cos? — Rotation for relative position

Key insight: can be expressed as a linear transformation (rotation matrix) of , where the rotation angle depends only on the offset :

where . This follows from the trigonometric identities (product-to-sum):

💡 Tip · Because the rotation matrix depends only on and not on , is purely a function of k. The model can learn relative position relationships like "3 tokens ago" regardless of the absolute position.

Step 3: Multi-scale frequencies

The wavelength for each dimension pair :

At , wavelength = (high frequency, distinguishes adjacent positions). At , wavelength = (low frequency, encodes long-range distances). You can clearly see this frequency gradient in the heatmap.

Quick check

Derivation

The sinusoidal PE formula uses 10000 as the base: PE(pos,2i) = sin(pos / 10000^(2i/d)). At i=d/2-1, the wavelength is 2π×10000 ≈ 62,832. What design constraint does this satisfy?

The sinusoidal PE formula uses 10000 as the base: PE(pos,2i) = sin(pos / 10000^(2i/d)). At i=d/2-1, the wavelength is 2π×10000 ≈ 62,832. What design constraint does this satisfy?
PyTorch implementation
import math
import torch
import torch.nn as nn

class SinusoidalPE(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)  # even dims
        pe[:, 1::2] = torch.cos(position * div_term)  # odd dims
        self.register_buffer('pe', pe.unsqueeze(0))    # (1, max_len, d_model)

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

PyTorch: Rotary Position Embedding (RoPE)

def apply_rotary_emb(x, freqs):
    """Apply RoPE to query or key tensors.
    x: (batch, seq_len, n_heads, d_k)
    freqs: (seq_len, d_k // 2)
    """
    # Split into pairs and rotate
    x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))
    freqs_complex = torch.polar(torch.ones_like(freqs), freqs)
    # Broadcast freqs to match x: (seq_len, d/2) → (1, seq_len, 1, d/2)
    freqs_complex = freqs_complex[None, :, None, :]
    x_rotated = torch.view_as_real(x_complex * freqs_complex).flatten(-2)
    return x_rotated.type_as(x)
🔧

Break It — See What Happens

Remove Positional Encoding
📊

Real-World Numbers

Modeld_modelPE TypeMax LengthNotes
Transformer (2017)SinusoidalThe original — sin/cos with d=512
GPT-3 (2020)LearnedLearned absolute PE, can't extrapolate
Llama-3.1 70B (2024)RoPERotary PE — the modern standard
MPT-7B-StoryWriter (2023)4096No PE — linear attention bias
💡 Tip · Almost no modern model still uses the original sinusoidal PE. RoPE has become the de facto standard (Llama, Mistral, Qwen, DeepSeek all use RoPE). The value of sinusoidal PE lies in understanding the core idea — position information encoded through frequencies, relative position achieved through rotation. RoPE is essentially this idea applied directly to Q/K vectors.

Quick check

Trade-off

You are designing a new 7B language model and must choose between RoPE and ALiBi. Your primary requirement is 128K-token context with efficient fine-tuning for extension. Which do you choose?

You are designing a new 7B language model and must choose between RoPE and ALiBi. Your primary requirement is 128K-token context with efficient fine-tuning for extension. Which do you choose?
🔬

RoPE — Rotary Position Embedding in Detail

Sinusoidal PE adds position information to the token embedding once, before it enters the Transformer. instead rotates the Query and Key vectors at every attention layer, encoding position directly into the dot product that determines attention weights.

For a query at position and a key at position , RoPE applies a rotation matrix before the attention dot product:

This is the key advantage: the model inherently learns relative position relationships without any special parameterization.

In practice, RoPE splits each head dimension into pairs and applies a 2D rotation with frequency per pair — exactly the same geometric intuition as sinusoidal PE, but applied inside the attention computation rather than at the input:

PyTorch: rotary position embedding

import torch

def precompute_freqs_cis(d_head: int, seq_len: int, base: float = 10000.0):
    freqs = 1.0 / (base ** (torch.arange(0, d_head, 2).float() / d_head))
    t = torch.arange(seq_len)
    freqs = torch.outer(t, freqs)
    return torch.polar(torch.ones_like(freqs), freqs)

def apply_rotary_emb(xq, xk, freqs_cis):
    # xq/xk: (batch, seq_len, n_heads, d_head)
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    # Reshape freqs for broadcasting: (seq_len, d/2) → (1, seq_len, 1, d/2)
    freqs_cis = freqs_cis[None, :, None, :]
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk)
✨ Insight · RoPE naturally decays attention scores with distance: as grows, high-frequency dimensions rotate rapidly and their contribution to the dot product averages to zero. This gives RoPE a built-in locality bias without any explicit masking — distant tokens are automatically de-emphasized (Su et al. 2021).

Quick check

Trade-off

RoPE is trained on sequences up to 4K tokens and guarantees the attention score between positions m and n depends only on (m-n). At inference on an 8K-token sequence, what can you say about token pairs with relative distance ≤ 4K?

RoPE is trained on sequences up to 4K tokens and guarantees the attention score between positions m and n depends only on (m-n). At inference on an 8K-token sequence, what can you say about token pairs with relative distance ≤ 4K?
🧠

Key Takeaways

What to remember for interviews

  1. 1Self-attention is permutation-equivariant — outputs permute with the inputs. Without PE, 'cat ate fish' and 'fish ate cat' produce the same set of output vectors in different order.
  2. 2Sinusoidal PE is absolute: each position gets a fixed fingerprint added to the input embedding before the Transformer sees it.
  3. 3RoPE encodes relative position by rotating Query and Key vectors at every attention layer — the dot product q·k depends only on the offset (m−n), not the absolute positions m or n.
  4. 4RoPE enables better length extrapolation than learned or sinusoidal PE, and is now the de facto standard (Llama, Mistral, Qwen, DeepSeek).
🧠

Recap quiz

Recall

Without positional encoding, self-attention is said to be permutation-equivariant. What does this mean precisely?

Without positional encoding, self-attention is said to be permutation-equivariant. What does this mean precisely?
Derivation

The sinusoidal PE uses base 10000 in the divisor term. If you changed the base to 100, what would break first?

The sinusoidal PE uses base 10000 in the divisor term. If you changed the base to 100, what would break first?
Trade-off

RoPE (Rotary PE) and sinusoidal PE both use the same frequency progression (base 10000). What is the structural advantage of RoPE that makes it better for length extrapolation?

RoPE (Rotary PE) and sinusoidal PE both use the same frequency progression (base 10000). What is the structural advantage of RoPE that makes it better for length extrapolation?
Trade-off

ALiBi adds no positional embedding at all — instead it subtracts m * |i-j| from attention scores, where m is head-specific. Why does this approach extrapolate to longer sequences better than learned absolute PE?

ALiBi adds no positional embedding at all — instead it subtracts m * |i-j| from attention scores, where m is head-specific. Why does this approach extrapolate to longer sequences better than learned absolute PE?
Derivation

The sinusoidal PE has the property that PE(pos+k) = R(k) * PE(pos) for a rotation matrix R(k). What does this enable the Transformer to do that raw absolute positions (e.g., integer 0, 1, 2...) cannot?

The sinusoidal PE has the property that PE(pos+k) = R(k) * PE(pos) for a rotation matrix R(k). What does this enable the Transformer to do that raw absolute positions (e.g., integer 0, 1, 2...) cannot?
Trade-off

The original Transformer paper (2017) tested learned absolute PE vs sinusoidal PE and found nearly identical results. Why do most modern models (GPT, Llama) prefer RoPE over both?

The original Transformer paper (2017) tested learned absolute PE vs sinusoidal PE and found nearly identical results. Why do most modern models (GPT, Llama) prefer RoPE over both?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 7 of 7

Sin/cos vs learned PE — tradeoffs?

★★☆
GoogleOpenAI

How does sin/cos encode RELATIVE position? Derive using product-to-sum.

★★★
OpenAIAnthropic

If PE(pos+k) is a linear function of PE(pos), what kind of geometric transformation is it?

Why different frequencies for different dimensions?

★★★
OpenAIAnthropic

What would happen if all dimensions used the same frequency?

Length extrapolation problem → RoPE / ALiBi

★★☆
GoogleOpenAIAnthropicMetaDatabricks

How does RoPE (Rotary Position Embedding) differ from sinusoidal PE? Why has it become standard?

★★★
OpenAIAnthropic

What is ALiBi (Attention with Linear Biases)? How does it compare to RoPE?

★★☆
DatabricksAnthropic

If you had to design a position encoding from scratch for very long sequences (1M+ tokens), what would you consider?

★★★
AnthropicOpenAI