Vision Transformers & CLIP — Transformer Math

Module 30 · Architectures

👁️ Vision Transformers & CLIP

Split a photo into 196 patches and a Transformer sees it as text

Status:

CNNs dominated computer vision for a decade. Vision Transformers (ViT) showed that the same transformer architecture used for language works for images too — just split the image into patches and treat them as tokens. CLIP took this further by training ViT with language supervision, enabling zero-shot visual understanding.

👁️

From Pixels to Patches to Tokens

ViT slices a 224×224 image into non-overlapping patches, projects each patch to a d_model-dimensional vector, prepends a learnable [CLS] token, adds position embeddings, and feeds the into a standard Transformer encoder. The final [CLS] embedding goes to a linear classification head.

What to notice: the image loses its 2D structure immediately after patching — the Transformer sees a flat sequence, just like text. Position embeddings re-inject spatial information.

🎮

ViT Pipeline: Image to Classification

A Vision Transformer processes an image in four steps — patches become tokens, just like words in a sentence:

Quick check

Derivation

A 224×224 image uses 16×16 patches. What is the sequence length fed to the Transformer encoder (including the [CLS] token)?

197 (196 patch tokens + 1 [CLS])224 (one token per pixel row)49 (32×32 patches on 224×224 image)196 (patch tokens only, [CLS] not counted)

💡

The Intuition

Image patches = text tokens. This is the key ViT insight. A 224x224 image split into gives you “visual words.” Each patch is flattened and projected into an embedding — exactly like a word embedding. Positional embeddings tell the model where each patch was in the image, and a [CLS] token aggregates everything for classification (just like BERT).

CLIP: align images and text in a shared space. Instead of training a classifier on fixed labels, CLIP trains two encoders (image + text) to put matching pairs close together and non-matching pairs far apart. Given a batch of N pairs, there are N correct matches and negatives. This contrastive objective creates an embedding space where “a photo of a golden retriever” is near actual golden retriever images.

Zero-shot classification falls out naturally.To classify an image, encode it and encode text prompts for each class (“a photo of a dog,” “a photo of a cat,” etc.). Pick the class whose text embedding has the highest cosine similarity to the image embedding. No fine-tuning, no labeled training data for the target task.

MAE: self-supervised ViT without labels. Masked Autoencoders (He et al., 2021) pre-train ViT by masking and training the model to reconstruct the missing pixels. No captions, no class labels — just reconstruction from partial views. This works because images are spatially redundant: nearby patches are correlated, so the model must learn semantic structure to reconstruct masked regions, not just copy neighbors. MAE-pretrained ViT-H reaches after end-to-end fine-tuning — comparable to supervised training, at a fraction of the annotation cost. (Linear probing is much lower, around 73.5% in the same paper.) The masking rate is intentionally aggressive; lower rates (25-50%) are too easy and don't force semantic understanding.

✨ Insight · CNNs bake in assumptions about images (local filters, translation equivariance). ViT assumes nothing — it must learn everything from data. This makes ViT data-hungry but ultimately more powerful: with enough data, no inductive bias beats the wrong inductive bias.

Quick check

Trade-off

ViT and CNN are both fed the same 1,000 labeled images. Which model achieves higher validation accuracy and why?

They tie: both have the same number of parameters when properly scaled.ViT wins: self-supervised pre-training (MAE) removes the data requirement entirely.CNN wins: built-in locality and weight sharing let it learn texture/edge priors from few examples.ViT wins: global attention sees long-range structure that CNN's local filters miss on small datasets.

Quick Check

Why does ViT need more training data than CNNs to achieve the same accuracy?

📐

Step-by-Step Derivation

ViT Patch Embedding

An image is split into patches of size . Each patch is flattened and projected via :

where is a learnable classification token and encodes spatial position. For a 224x224 image with P=16: patches.

CLIP Contrastive Loss (InfoNCE)

For a batch of image-text pairs, maximize similarity for matching pairs while minimizing for non-matching. Temperature controls distribution sharpness:

where is cosine similarity. The symmetric text-to-image loss swaps the roles. Total loss = .

Cosine Similarity

The core distance metric in CLIP's shared embedding space. Ranges from -1 (opposite) to +1 (identical direction):

PyTorch: ViT Patch Embedding

python

import torch
import torch.nn as nn

class PatchEmbedding(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_channels=3, embed_dim=768):
        super().__init__()
        self.num_patches = (img_size // patch_size) ** 2  # 196 for 224/16

        # Conv2d with kernel=stride=patch_size is equivalent to
        # splitting into patches and linearly projecting each one
        self.proj = nn.Conv2d(
            in_channels, embed_dim,
            kernel_size=patch_size, stride=patch_size
        )

        self.cls_token = nn.Parameter(torch.randn(1, 1, embed_dim))
        self.pos_embed = nn.Parameter(torch.randn(1, self.num_patches + 1, embed_dim))

    def forward(self, x):  # x: (B, 3, 224, 224)
        B = x.shape[0]
        x = self.proj(x)                  # (B, D, 14, 14)
        x = x.flatten(2).transpose(1, 2)  # (B, 196, D)

        cls = self.cls_token.expand(B, -1, -1)  # (B, 1, D)
        x = torch.cat([cls, x], dim=1)          # (B, 197, D)
        x = x + self.pos_embed                   # add positional encoding
        return x  # (B, 197, D) — ready for transformer encoder

PyTorch implementation

# ViT patch embedding: Conv2d is equivalent to splitting + linear projection
import torch
import torch.nn as nn

class PatchEmbed(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_ch=3, embed_dim=768):
        super().__init__()
        self.n_patches = (img_size // patch_size) ** 2  # 196 for 224/16
        # kernel=stride=patch_size → each conv window = one patch, no overlap
        self.proj = nn.Conv2d(in_ch, embed_dim, kernel_size=patch_size, stride=patch_size)
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        self.pos_embed = nn.Parameter(torch.zeros(1, self.n_patches + 1, embed_dim))

    def forward(self, x):          # x: (B, 3, 224, 224)
        x = self.proj(x)                    # (B, D, 14, 14)
        x = x.flatten(2).transpose(1, 2)  # (B, 196, D)
        cls = self.cls_token.expand(x.shape[0], -1, -1)
        x = torch.cat([cls, x], dim=1) + self.pos_embed  # (B, 197, D)
        return x

Quick check

Derivation

In the ViT patch embedding, the linear projection matrix E has shape P²C × D. For ViT-B/16 on RGB images (P=16, C=3, D=768), what is E's parameter count?

3,072 — the raw flattened patch dimension P²C = 16² × 3, which is the input to E, not E's parameter count.150,994,944 — one separate E matrix per patch: 196 × 768 × 768, since each patch learns its own projection.589,824 — computed correctly as P²C × D = (16² × 3) × 768 = 768 × 768.589,824 — but computed as d_model × d_model = 768 × 768, treating input and output dims as both equal to D.

🔧

Break It — See What Happens

Use 32x32 patches instead of 16x16

Remove position embeddings

Quick check

Trade-off

Switching from 16×16 to 32×32 patches cuts the sequence from 196 to 49 tokens. Roughly how much does this reduce self-attention FLOPs, and what accuracy risk does it introduce?

~16× fewer FLOPs with no accuracy risk, because 49 tokens are still sufficient for 1,000 ImageNet classes.~16× fewer attention FLOPs; risk is losing fine-grained detail needed to separate visually similar classes.~2× fewer FLOPs; the model compensates via deeper layers at the same total compute.~4× fewer FLOPs; risk is the model loses positional information since fewer tokens span the image.

📊

Real-World Numbers

Model	Type	Details
ViT-L/16	Vision Transformer	, (pre-trained on JFT-300M)
ViT-H/14	Vision Transformer	, , 16 heads, 32 layers
CLIP ViT-L/14@336px	Contrastive V-L	, (paper's best CLIP model; base ViT-L/14 at 224px is ~75.3%)
DINOv2 ViT-g/14	Self-supervised	1.1B params, 86.5% ImageNet linear probe, no labels during pre-training
SigLIP	Contrastive V-L	Sigmoid loss (no softmax over batch), better at small batch sizes, used in PaLI and Gemini

✨ Insight · The evolution: ViT proved transformers work for vision (2020). CLIP proved language supervision beats labels (2021). DINOv2 proved self-supervised ViT features rival supervised ones (2023). SigLIP simplified contrastive training. Today, ViT is the default vision backbone for multimodal models.

🧠

Key Takeaways

What to remember for interviews

1ViT splits an image into fixed-size patches (typically 16×16), projects each into an embedding, and feeds the sequence to a standard transformer encoder — treating patches exactly like text tokens.
2A learnable [CLS] token is prepended; its final representation is used for classification. Positional embeddings encode where each patch was in the original image.
3ViT is data-hungry because it lacks CNN inductive biases (local filters, translation equivariance) — it must learn spatial relationships from scratch.
4CLIP trains image and text encoders together with a contrastive loss on 400M image-text pairs, creating a shared embedding space that enables zero-shot classification without any task-specific labels.
5DINOv2 learns rich visual features purely from images (no text supervision) via self-distillation, excelling at dense prediction tasks like segmentation and depth estimation.

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 6 of 6

How does ViT convert an image into tokens for a transformer?

★★☆

GoogleMeta

Explain CLIP's contrastive learning objective and how it enables zero-shot classification.

★★☆

GoogleOpenAI

Why does ViT need large datasets but CNNs don't?

★★☆

GoogleMeta

Compare CLIP's zero-shot classification to supervised approaches. What are the tradeoffs?

★★★

MetaOpenAI

What are the resolution/token tradeoffs in vision transformers?

★★★

GoogleAnthropic

How does DINOv2 differ from CLIP, and what are its advantages for downstream tasks?

★★★

MetaGoogle