👁️ Vision Transformers & CLIP
Split a photo into 196 patches and a Transformer sees it as text
CNNs dominated computer vision for a decade. Vision Transformers (ViT) showed that the same transformer architecture used for language works for images too — just split the image into patches and treat them as tokens. CLIP took this further by training ViT with language supervision, enabling zero-shot visual understanding.
From Pixels to Patches to Tokens
ViT slices a 224×224 image into non-overlapping patches, projects each patch to a d_model-dimensional vector, prepends a learnable [CLS] token, adds position embeddings, and feeds the into a standard Transformer encoder. The final [CLS] embedding goes to a linear classification head.
What to notice: the image loses its 2D structure immediately after patching — the Transformer sees a flat sequence, just like text. Position embeddings re-inject spatial information.
ViT Pipeline: Image to Classification
A Vision Transformer processes an image in four steps — patches become tokens, just like words in a sentence:
Quick check
A 224×224 image uses 16×16 patches. What is the sequence length fed to the Transformer encoder (including the [CLS] token)?
The Intuition
Image patches = text tokens. This is the key ViT insight. A 224x224 image split into gives you “visual words.” Each patch is flattened and projected into an embedding — exactly like a word embedding. Positional embeddings tell the model where each patch was in the image, and a [CLS] token aggregates everything for classification (just like BERT).
CLIP: align images and text in a shared space. Instead of training a classifier on fixed labels, CLIP trains two encoders (image + text) to put matching pairs close together and non-matching pairs far apart. Given a batch of N pairs, there are N correct matches and negatives. This contrastive objective creates an embedding space where “a photo of a golden retriever” is near actual golden retriever images.
Zero-shot classification falls out naturally.To classify an image, encode it and encode text prompts for each class (“a photo of a dog,” “a photo of a cat,” etc.). Pick the class whose text embedding has the highest cosine similarity to the image embedding. No fine-tuning, no labeled training data for the target task.
MAE: self-supervised ViT without labels. Masked Autoencoders (He et al., 2021) pre-train ViT by masking and training the model to reconstruct the missing pixels. No captions, no class labels — just reconstruction from partial views. This works because images are spatially redundant: nearby patches are correlated, so the model must learn semantic structure to reconstruct masked regions, not just copy neighbors. MAE-pretrained ViT-H reaches after end-to-end fine-tuning — comparable to supervised training, at a fraction of the annotation cost. (Linear probing is much lower, around 73.5% in the same paper.) The masking rate is intentionally aggressive; lower rates (25-50%) are too easy and don't force semantic understanding.
Quick check
ViT and CNN are both fed the same 1,000 labeled images. Which model achieves higher validation accuracy and why?
Why does ViT need more training data than CNNs to achieve the same accuracy?
Step-by-Step Derivation
ViT Patch Embedding
An image is split into patches of size . Each patch is flattened and projected via :
where is a learnable classification token and encodes spatial position. For a 224x224 image with P=16: patches.
CLIP Contrastive Loss (InfoNCE)
For a batch of image-text pairs, maximize similarity for matching pairs while minimizing for non-matching. Temperature controls distribution sharpness:
where is cosine similarity. The symmetric text-to-image loss swaps the roles. Total loss = .
Cosine Similarity
The core distance metric in CLIP's shared embedding space. Ranges from -1 (opposite) to +1 (identical direction):
PyTorch: ViT Patch Embedding
import torch
import torch.nn as nn
class PatchEmbedding(nn.Module):
def __init__(self, img_size=224, patch_size=16, in_channels=3, embed_dim=768):
super().__init__()
self.num_patches = (img_size // patch_size) ** 2 # 196 for 224/16
# Conv2d with kernel=stride=patch_size is equivalent to
# splitting into patches and linearly projecting each one
self.proj = nn.Conv2d(
in_channels, embed_dim,
kernel_size=patch_size, stride=patch_size
)
self.cls_token = nn.Parameter(torch.randn(1, 1, embed_dim))
self.pos_embed = nn.Parameter(torch.randn(1, self.num_patches + 1, embed_dim))
def forward(self, x): # x: (B, 3, 224, 224)
B = x.shape[0]
x = self.proj(x) # (B, D, 14, 14)
x = x.flatten(2).transpose(1, 2) # (B, 196, D)
cls = self.cls_token.expand(B, -1, -1) # (B, 1, D)
x = torch.cat([cls, x], dim=1) # (B, 197, D)
x = x + self.pos_embed # add positional encoding
return x # (B, 197, D) — ready for transformer encoderPyTorch implementation
# ViT patch embedding: Conv2d is equivalent to splitting + linear projection
import torch
import torch.nn as nn
class PatchEmbed(nn.Module):
def __init__(self, img_size=224, patch_size=16, in_ch=3, embed_dim=768):
super().__init__()
self.n_patches = (img_size // patch_size) ** 2 # 196 for 224/16
# kernel=stride=patch_size → each conv window = one patch, no overlap
self.proj = nn.Conv2d(in_ch, embed_dim, kernel_size=patch_size, stride=patch_size)
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
self.pos_embed = nn.Parameter(torch.zeros(1, self.n_patches + 1, embed_dim))
def forward(self, x): # x: (B, 3, 224, 224)
x = self.proj(x) # (B, D, 14, 14)
x = x.flatten(2).transpose(1, 2) # (B, 196, D)
cls = self.cls_token.expand(x.shape[0], -1, -1)
x = torch.cat([cls, x], dim=1) + self.pos_embed # (B, 197, D)
return xQuick check
In the ViT patch embedding, the linear projection matrix E has shape P²C × D. For ViT-B/16 on RGB images (P=16, C=3, D=768), what is E's parameter count?
Break It — See What Happens
Quick check
Switching from 16×16 to 32×32 patches cuts the sequence from 196 to 49 tokens. Roughly how much does this reduce self-attention FLOPs, and what accuracy risk does it introduce?
Real-World Numbers
| Model | Type | Details |
|---|---|---|
| ViT-L/16 | Vision Transformer | , (pre-trained on JFT-300M) |
| ViT-H/14 | Vision Transformer | , , 16 heads, 32 layers |
| CLIP ViT-L/14@336px | Contrastive V-L | , (paper's best CLIP model; base ViT-L/14 at 224px is ~75.3%) |
| DINOv2 ViT-g/14 | Self-supervised | 1.1B params, 86.5% ImageNet linear probe, no labels during pre-training |
| SigLIP | Contrastive V-L | Sigmoid loss (no softmax over batch), better at small batch sizes, used in PaLI and Gemini |
Key Takeaways
What to remember for interviews
- 1ViT splits an image into fixed-size patches (typically 16×16), projects each into an embedding, and feeds the sequence to a standard transformer encoder — treating patches exactly like text tokens.
- 2A learnable [CLS] token is prepended; its final representation is used for classification. Positional embeddings encode where each patch was in the original image.
- 3ViT is data-hungry because it lacks CNN inductive biases (local filters, translation equivariance) — it must learn spatial relationships from scratch.
- 4CLIP trains image and text encoders together with a contrastive loss on 400M image-text pairs, creating a shared embedding space that enables zero-shot classification without any task-specific labels.
- 5DINOv2 learns rich visual features purely from images (no text supervision) via self-distillation, excelling at dense prediction tasks like segmentation and depth estimation.
Recap quiz
A ViT-B/16 receives a 224×224 RGB image. How many tokens does the Transformer encoder see, and why is it 197 rather than 196?
You halve the patch size from 16×16 to 8×8 on a 224×224 image. By what factor does self-attention compute cost increase, and what accuracy tradeoff do you get?
ViT uses a [CLS] token for classification while ResNet uses global average pooling (GAP). What is the key functional difference, and when does [CLS] have an advantage?
ViT-B/16 underperforms ResNet-50 when trained only on ImageNet-1K (1.3M images) but surpasses it after pre-training on JFT-300M (300M images). What architectural property makes ViT data-hungry?
CLIP achieves 76.2% ImageNet zero-shot without training on any ImageNet labels. What is the exact inference procedure for zero-shot classification?
MAE (He et al., 2021) uses a 75% patch masking rate during pre-training rather than the 15% token masking rate used in BERT. Why is the higher masking rate beneficial for images but not for text?
If you remove positional embeddings from ViT, what happens to the model's representation and what class of tasks is most affected?
Further Reading
- An Image is Worth 16x16 Words (ViT) — Dosovitskiy et al. 2020 — the original Vision Transformer paper showing pure-attention models match CNNs at scale.
- Learning Transferable Visual Models From Natural Language Supervision (CLIP) — Radford et al. 2021 — contrastive learning between image and text encoders enabling zero-shot classification.
- The Illustrated ViT — Jay Alammar — Visual walkthrough of patch embedding, position encoding, and the CLS token in Vision Transformers.
Interview Questions
Showing 6 of 6
How does ViT convert an image into tokens for a transformer?
★★☆Explain CLIP's contrastive learning objective and how it enables zero-shot classification.
★★☆Why does ViT need large datasets but CNNs don't?
★★☆Compare CLIP's zero-shot classification to supervised approaches. What are the tradeoffs?
★★★What are the resolution/token tradeoffs in vision transformers?
★★★How does DINOv2 differ from CLIP, and what are its advantages for downstream tasks?
★★★