Skip to content

Transformer Math

Module 30 · Architectures

👁️ Vision Transformers & CLIP

Split a photo into 196 patches and a Transformer sees it as text

Status:

CNNs dominated computer vision for a decade. Vision Transformers (ViT) showed that the same transformer architecture used for language works for images too — just split the image into patches and treat them as tokens. CLIP took this further by training ViT with language supervision, enabling zero-shot visual understanding.

👁️

From Pixels to Patches to Tokens

ViT slices a 224×224 image into non-overlapping patches, projects each patch to a d_model-dimensional vector, prepends a learnable [CLS] token, adds position embeddings, and feeds the into a standard Transformer encoder. The final [CLS] embedding goes to a linear classification head.

What to notice: the image loses its 2D structure immediately after patching — the Transformer sees a flat sequence, just like text. Position embeddings re-inject spatial information.

224×224 image14×14 = 196 patches(16×16 px each)flatten256-dLinearProjd_model[CLS]P1P2P3P4P5P196+ pos embed197 tokensTransformerEncoderMHAFFN×L[CLS]d_modelMLP head“cat” 🐱① Image② Flatten + Project③ Tokens④ Encoder⑤ Output
🎮

ViT Pipeline: Image to Classification

A Vision Transformer processes an image in four steps — patches become tokens, just like words in a sentence:

224×224 image16×16 patches → 196Linear ProjectionE ∈ ℝ(768 × D)patch → D-dim embedToken Sequence (197)[CLS]p1·Ep2·Ep3·Ep4·Ep5·E...+ positional embeddings E_posTransformer Encoder (L layers)[CLS] → MLP → classCLIP: Contrastive LearningImage Encoder (ViT)Text Encoderimg embedtext embedcosine similaritycontrastive lossZero-shot: "a photo of a {class}"→ pick highest similarity

Quick check

Derivation

A 224×224 image uses 16×16 patches. What is the sequence length fed to the Transformer encoder (including the [CLS] token)?

A 224×224 image uses 16×16 patches. What is the sequence length fed to the Transformer encoder (including the [CLS] token)?
💡

The Intuition

Image patches = text tokens. This is the key ViT insight. A 224x224 image split into gives you “visual words.” Each patch is flattened and projected into an embedding — exactly like a word embedding. Positional embeddings tell the model where each patch was in the image, and a [CLS] token aggregates everything for classification (just like BERT).

CLIP: align images and text in a shared space. Instead of training a classifier on fixed labels, CLIP trains two encoders (image + text) to put matching pairs close together and non-matching pairs far apart. Given a batch of N pairs, there are N correct matches and negatives. This contrastive objective creates an embedding space where “a photo of a golden retriever” is near actual golden retriever images.

Zero-shot classification falls out naturally.To classify an image, encode it and encode text prompts for each class (“a photo of a dog,” “a photo of a cat,” etc.). Pick the class whose text embedding has the highest cosine similarity to the image embedding. No fine-tuning, no labeled training data for the target task.

MAE: self-supervised ViT without labels. Masked Autoencoders (He et al., 2021) pre-train ViT by masking and training the model to reconstruct the missing pixels. No captions, no class labels — just reconstruction from partial views. This works because images are spatially redundant: nearby patches are correlated, so the model must learn semantic structure to reconstruct masked regions, not just copy neighbors. MAE-pretrained ViT-H reaches after end-to-end fine-tuning — comparable to supervised training, at a fraction of the annotation cost. (Linear probing is much lower, around 73.5% in the same paper.) The masking rate is intentionally aggressive; lower rates (25-50%) are too easy and don't force semantic understanding.

✨ Insight · CNNs bake in assumptions about images (local filters, translation equivariance). ViT assumes nothing — it must learn everything from data. This makes ViT data-hungry but ultimately more powerful: with enough data, no inductive bias beats the wrong inductive bias.

Quick check

Trade-off

ViT and CNN are both fed the same 1,000 labeled images. Which model achieves higher validation accuracy and why?

ViT and CNN are both fed the same 1,000 labeled images. Which model achieves higher validation accuracy and why?
Quick Check

Why does ViT need more training data than CNNs to achieve the same accuracy?

📐

Step-by-Step Derivation

ViT Patch Embedding

An image is split into patches of size . Each patch is flattened and projected via :

where is a learnable classification token and encodes spatial position. For a 224x224 image with P=16: patches.

CLIP Contrastive Loss (InfoNCE)

For a batch of image-text pairs, maximize similarity for matching pairs while minimizing for non-matching. Temperature controls distribution sharpness:

where is cosine similarity. The symmetric text-to-image loss swaps the roles. Total loss = .

Cosine Similarity

The core distance metric in CLIP's shared embedding space. Ranges from -1 (opposite) to +1 (identical direction):

PyTorch: ViT Patch Embedding

python
import torch
import torch.nn as nn

class PatchEmbedding(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_channels=3, embed_dim=768):
        super().__init__()
        self.num_patches = (img_size // patch_size) ** 2  # 196 for 224/16

        # Conv2d with kernel=stride=patch_size is equivalent to
        # splitting into patches and linearly projecting each one
        self.proj = nn.Conv2d(
            in_channels, embed_dim,
            kernel_size=patch_size, stride=patch_size
        )

        self.cls_token = nn.Parameter(torch.randn(1, 1, embed_dim))
        self.pos_embed = nn.Parameter(torch.randn(1, self.num_patches + 1, embed_dim))

    def forward(self, x):  # x: (B, 3, 224, 224)
        B = x.shape[0]
        x = self.proj(x)                  # (B, D, 14, 14)
        x = x.flatten(2).transpose(1, 2)  # (B, 196, D)

        cls = self.cls_token.expand(B, -1, -1)  # (B, 1, D)
        x = torch.cat([cls, x], dim=1)          # (B, 197, D)
        x = x + self.pos_embed                   # add positional encoding
        return x  # (B, 197, D) — ready for transformer encoder
PyTorch implementation
# ViT patch embedding: Conv2d is equivalent to splitting + linear projection
import torch
import torch.nn as nn

class PatchEmbed(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_ch=3, embed_dim=768):
        super().__init__()
        self.n_patches = (img_size // patch_size) ** 2  # 196 for 224/16
        # kernel=stride=patch_size → each conv window = one patch, no overlap
        self.proj = nn.Conv2d(in_ch, embed_dim, kernel_size=patch_size, stride=patch_size)
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        self.pos_embed = nn.Parameter(torch.zeros(1, self.n_patches + 1, embed_dim))

    def forward(self, x):          # x: (B, 3, 224, 224)
        x = self.proj(x)                    # (B, D, 14, 14)
        x = x.flatten(2).transpose(1, 2)  # (B, 196, D)
        cls = self.cls_token.expand(x.shape[0], -1, -1)
        x = torch.cat([cls, x], dim=1) + self.pos_embed  # (B, 197, D)
        return x

Quick check

Derivation

In the ViT patch embedding, the linear projection matrix E has shape P²C × D. For ViT-B/16 on RGB images (P=16, C=3, D=768), what is E's parameter count?

In the ViT patch embedding, the linear projection matrix E has shape P²C × D. For ViT-B/16 on RGB images (P=16, C=3, D=768), what is E's parameter count?
🔧

Break It — See What Happens

Use 32x32 patches instead of 16x16
Remove position embeddings

Quick check

Trade-off

Switching from 16×16 to 32×32 patches cuts the sequence from 196 to 49 tokens. Roughly how much does this reduce self-attention FLOPs, and what accuracy risk does it introduce?

Switching from 16×16 to 32×32 patches cuts the sequence from 196 to 49 tokens. Roughly how much does this reduce self-attention FLOPs, and what accuracy risk does it introduce?
📊

Real-World Numbers

ModelTypeDetails
ViT-L/16Vision Transformer, (pre-trained on JFT-300M)
ViT-H/14Vision Transformer, , 16 heads, 32 layers
CLIP ViT-L/14@336pxContrastive V-L, (paper's best CLIP model; base ViT-L/14 at 224px is ~75.3%)
DINOv2 ViT-g/14Self-supervised1.1B params, 86.5% ImageNet linear probe, no labels during pre-training
SigLIPContrastive V-LSigmoid loss (no softmax over batch), better at small batch sizes, used in PaLI and Gemini
✨ Insight · The evolution: ViT proved transformers work for vision (2020). CLIP proved language supervision beats labels (2021). DINOv2 proved self-supervised ViT features rival supervised ones (2023). SigLIP simplified contrastive training. Today, ViT is the default vision backbone for multimodal models.
🧠

Key Takeaways

What to remember for interviews

  1. 1ViT splits an image into fixed-size patches (typically 16×16), projects each into an embedding, and feeds the sequence to a standard transformer encoder — treating patches exactly like text tokens.
  2. 2A learnable [CLS] token is prepended; its final representation is used for classification. Positional embeddings encode where each patch was in the original image.
  3. 3ViT is data-hungry because it lacks CNN inductive biases (local filters, translation equivariance) — it must learn spatial relationships from scratch.
  4. 4CLIP trains image and text encoders together with a contrastive loss on 400M image-text pairs, creating a shared embedding space that enables zero-shot classification without any task-specific labels.
  5. 5DINOv2 learns rich visual features purely from images (no text supervision) via self-distillation, excelling at dense prediction tasks like segmentation and depth estimation.
🧠

Recap quiz

Derivation

A ViT-B/16 receives a 224×224 RGB image. How many tokens does the Transformer encoder see, and why is it 197 rather than 196?

A ViT-B/16 receives a 224×224 RGB image. How many tokens does the Transformer encoder see, and why is it 197 rather than 196?
Derivation

You halve the patch size from 16×16 to 8×8 on a 224×224 image. By what factor does self-attention compute cost increase, and what accuracy tradeoff do you get?

You halve the patch size from 16×16 to 8×8 on a 224×224 image. By what factor does self-attention compute cost increase, and what accuracy tradeoff do you get?
Trade-off

ViT uses a [CLS] token for classification while ResNet uses global average pooling (GAP). What is the key functional difference, and when does [CLS] have an advantage?

ViT uses a [CLS] token for classification while ResNet uses global average pooling (GAP). What is the key functional difference, and when does [CLS] have an advantage?
Trade-off

ViT-B/16 underperforms ResNet-50 when trained only on ImageNet-1K (1.3M images) but surpasses it after pre-training on JFT-300M (300M images). What architectural property makes ViT data-hungry?

ViT-B/16 underperforms ResNet-50 when trained only on ImageNet-1K (1.3M images) but surpasses it after pre-training on JFT-300M (300M images). What architectural property makes ViT data-hungry?
Derivation

CLIP achieves 76.2% ImageNet zero-shot without training on any ImageNet labels. What is the exact inference procedure for zero-shot classification?

CLIP achieves 76.2% ImageNet zero-shot without training on any ImageNet labels. What is the exact inference procedure for zero-shot classification?
Trade-off

MAE (He et al., 2021) uses a 75% patch masking rate during pre-training rather than the 15% token masking rate used in BERT. Why is the higher masking rate beneficial for images but not for text?

MAE (He et al., 2021) uses a 75% patch masking rate during pre-training rather than the 15% token masking rate used in BERT. Why is the higher masking rate beneficial for images but not for text?
Trade-off

If you remove positional embeddings from ViT, what happens to the model's representation and what class of tasks is most affected?

If you remove positional embeddings from ViT, what happens to the model's representation and what class of tasks is most affected?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 6 of 6

How does ViT convert an image into tokens for a transformer?

★★☆
GoogleMeta

Explain CLIP's contrastive learning objective and how it enables zero-shot classification.

★★☆
GoogleOpenAI

Why does ViT need large datasets but CNNs don't?

★★☆
GoogleMeta

Compare CLIP's zero-shot classification to supervised approaches. What are the tradeoffs?

★★★
MetaOpenAI

What are the resolution/token tradeoffs in vision transformers?

★★★
GoogleAnthropic

How does DINOv2 differ from CLIP, and what are its advantages for downstream tasks?

★★★
MetaGoogle