🖼️ Multimodal LLMs
GPT-4V sees your image as 85 extra tokens in the prompt
Vision Transformers encode images into tokens. Multimodal LLMs take the next step: feed those visual tokens into a language model alongside text, enabling models like GPT-4V, Claude, and Gemini to see, reason about, and describe images. The architecture is surprisingly simple — a vision encoder, a projection layer, and an LLM.
How Multimodal Models See
Early fusion architecture — used by LLaVA and believed to be employed by GPT-4V and Gemini (closed-source; architecture not publicly confirmed). An image and a text prompt are encoded into separate token streams, then concatenated and fed to a single LLM decoder.
The projection layeris the critical bridge: it maps vision encoder outputs (e.g., 1024-dim CLIP space) into the LLM's embedding dimension (e.g., 4096-dim), so the LLM can treat image patches as ordinary tokens.
Multimodal LLM Pipeline
What you’re seeing:the four-stage pixel-to-token pipeline — vision encoder (CLIP) extracts patch features, a projection layer maps them into the LLM's embedding space, they are concatenated with text tokens, and the decoder generates output. What to try: trace the image patch count from encoder to projection — that number is the visual token budget the LLM must attend over, which is why Q-Former and token compression exist.
The Intuition
The core idea is remarkably simple.A vision encoder (ViT/CLIP) converts an image into a sequence of embeddings. A projection layer maps those embeddings into the LLM's token space. Then you just concatenate visual tokens with text tokens and let the LLM's self-attention handle cross-modal reasoning. The LLM doesn't need to “know” it's looking at an image — it just sees more tokens.
LLaVA showed this works with a linear layer. Connect a frozen to Vicuna with a single linear projection. Stage 1: train only the projection on (feature alignment). Stage 2: fine-tune projection + LLM on visual instruction examples (visual instruction tuning). That's it — surprisingly competitive with far more complex architectures.
Cross-attention vs early fusion. Flamingo uses cross-attention: the text model attends to visual features via separate cross-attention layers (Q from text, K/V from vision). This is parameter-efficient and allows caching visual features. LLaVA uses early fusion (concatenate all tokens and let self-attention mix them freely); GPT-4V/Claude/Gemini are widely assumed to follow a similar early-fusion-style design, though their exact internals are not publicly disclosed. More expensive but allows richer cross-modal interaction at every layer.
Resolution is the key tradeoff. More pixels means more patches means more tokens means more context consumed. A — a huge chunk of context window. Production models use tiling (split into crops), dynamic resolution, and token compression to manage this.
Q-Former: compressing visual tokens.LLaVA's linear projection passes all patch tokens to the LLM. BLIP-2 (Li et al., 2023) introduced the Query Transformer (Q-Former): a small transformer with a fixed set of learned query tokens (e.g., 32) that attend to the visual features via cross-attention and produce a fixed-length output regardless of image resolution. This decouples the LLM from the vision encoder — the LLM always sees exactly 32 tokens, never 196. The tradeoff: Q-Former loses fine-grained spatial detail (bad for OCR, counting), but dramatically reduces context consumption. Flamingo's perceiver resampler uses the same idea, compressing any image into . The field has largely moved away from Q-Former toward tile-based high-res encoding because fine-grained detail matters more for real tasks than token count.
Quick check
LLaVA skips Stage 2 (visual instruction tuning) and only does Stage 1 alignment on 595K pairs. What breaks?
What is the key difference between cross-attention and early fusion in multimodal LLMs?
Step-by-Step Derivation
Cross-Attention for Vision-Language
In cross-attention architectures (e.g., Flamingo), text queries attend to visual keys and values. The text representation is enriched with visual information while keeping the streams separate:
Visual Token Count
The number of visual tokens depends on image resolution and patch size. For tiled high-resolution images:
Example: a 1024x1024 image tiled into 4 crops of 512x512, with P=16: total visual tokens (4 tile encodings + 1 overview at 224x224).
Visual Projection (LLaVA-style)
Map vision encoder outputs to the LLM embedding space via a learned projection:
LLaVA uses a single linear layer. More complex adapters (Q-Former in BLIP-2, perceiver resampler in Flamingo) compress variable-length visual features into a fixed number of tokens.
PyTorch: Multimodal Forward Pass (LLaVA-style)
import torch
import torch.nn as nn
class MultimodalLLM(nn.Module):
def __init__(self, vision_encoder, llm, vision_dim=1024, llm_dim=4096):
super().__init__()
self.vision_encoder = vision_encoder # frozen CLIP ViT
self.llm = llm # pre-trained LLM
self.visual_proj = nn.Linear(vision_dim, llm_dim) # the bridge
def forward(self, image, text_input_ids, text_embeds):
# Step 1: Encode image into visual features
with torch.no_grad():
visual_features = self.vision_encoder(image) # (B, 576, Dv)
# Step 2: Project visual features into LLM embedding space
visual_tokens = self.visual_proj(visual_features) # (B, 576, Dt)
# Step 3: Interleave with text embeddings
# text_embeds: (B, T, Dt) from LLM's embedding layer
multimodal_input = torch.cat([
visual_tokens, # visual tokens first
text_embeds # then text tokens
], dim=1) # (B, 576 + T, Dt)
# Step 4: LLM processes everything via self-attention
output = self.llm(inputs_embeds=multimodal_input)
return output # next-token predictions over full sequencePyTorch implementation
# CLIP-style contrastive loss (InfoNCE)
import torch
import torch.nn as nn
import torch.nn.functional as F
class CLIPContrastiveLoss(nn.Module):
def __init__(self, temperature: float = 0.07):
super().__init__()
self.temperature = nn.Parameter(torch.tensor(temperature))
def forward(self, image_features: torch.Tensor, text_features: torch.Tensor):
# Normalize embeddings to unit sphere
img = F.normalize(image_features, dim=-1) # (B, D)
txt = F.normalize(text_features, dim=-1) # (B, D)
# Cosine similarity matrix, scaled by temperature
logits = (img @ txt.T) / self.temperature # (B, B)
# Symmetric InfoNCE: diagonal = positive pairs
labels = torch.arange(len(logits), device=logits.device)
loss_i2t = F.cross_entropy(logits, labels)
loss_t2i = F.cross_entropy(logits.T, labels)
return (loss_i2t + loss_t2i) / 2Quick check
A 1024×1024 image at 16×16 patches = 4096 tokens. You want to keep it under 500 tokens using tiling. What is the minimum number of 512×512 tiles needed, and how many tokens does each tile produce?
Break It — See What Happens
Quick check
You remove the vision encoder entirely and instead pass raw pixel values as floating-point tokens to the LLM. What is the main failure mode?
Real-World Numbers
| Model | Architecture | Details |
|---|---|---|
| GPT-4V | Likely early fusion (unconfirmed) | OpenAI docs: low-detail mode uses a 512×512 representation; high-detail mode adds extra 512×512 tiles () on top of the low-detail overview. |
| Claude Vision | Likely early fusion (unconfirmed) | Up to ~1500 visual tokens, strong on documents, charts, and structured visual content |
| Gemini | Natively multimodal | Trained multimodal from scratch (not bolted-on), handles image/audio/video natively |
| LLaVA-1.5 | Early fusion | + Vicuna-13B, , open-source |
| Flamingo | Cross-attention | , perceiver resampler compresses to , few-shot learning pioneer |
Quick check
GPT-4V low-detail mode uses ~85 tokens for any image. High-detail mode adds ~170 tokens per 512×512 tile. For a 1024×1024 image in high-detail mode, what is the total token count?
Native Multimodal Frontier (2024–2025)
Through 2023, multimodal models were LLMs with vision adapters bolted on. 2024–2025 saw a shift to native multimodality: text, images, audio, and video processed end-to-end in the same token stream from pre-training, not fine-tuning.
GPT-4o (May 2024) — first natively multimodal frontier model
Prior GPT-4V was text-only GPT-4 with a bolted-on vision encoder. GPT-4o was trained end-to-end on text, audio, and image tokens simultaneously. Key architectural consequence: the model can reason across modalities in a single forward pass — no intermediate text-only bottleneck. Voice response latency dropped from ~2.8s (GPT-4 + TTS pipeline) to ~320ms (per OpenAI demo, community measure).
Source: OpenAI, “Hello GPT-4o”, May 2024
Deep dive: Llama 4 early fusion (Apr 2025)
Llama 4 (Scout and Maverick) uses early fusion: image patch tokens are interleaved with text tokens from the very first transformer layer. This contrasts with the late-fusion approach (separate vision encoder → projection → concatenate before the last N layers) used in most prior models.
Early fusion allows every transformer layer to attend image-to-text and text-to-image. The architectural cost is that the image tokeniser must produce tokens in the same embedding space as the text vocabulary from the start — requiring joint pre-training rather than the simpler two-stage LLaVA recipe. Llama 4 also combines early fusion with MoE (interleaved dense and MoE layers), making it one of the first open-weight native-multimodal MoE models.
Source: Meta AI, Llama 4 blog, Apr 2025
Deep dive: Gemini 2.5 Flash/Pro — 1M context over video + audio
Gemini 2.5 Flash and Pro support a natively over video frames and audio, not just text. A 1-hour video at 1 fps produces ~3,600 frames; at 256 tokens/frame (ViT-16 patch size on 224×224), that is ~921K tokens — approaching the full 1M context budget.
The engineering challenge at 1M context is attention memory: naive O(n²) attention requires ~400 GB for 1M tokens at float16. Gemini uses ring attention / flash attention variants with aggressive sequence parallelism across TPU pods to make this tractable.
Deep dive: Qwen2.5-VL and V-JEPA 2 — open-source + world models
Qwen2.5-VL (2025) is the open-source SOTA for document and video understanding as of early 2025. It introduces native dynamic resolution (no fixed crop grid) and a position-aware vision encoder that preserves spatial relationships across tiled crops — critical for document-layout tasks where LLaVA-style global average pooling loses positional structure.
Source: Qwen2.5-VL, arxiv:2502.13923
V-JEPA 2 (Meta, 2025) takes a different axis: rather than generative multimodality, it learns a self-supervised video world model trained on 1M hours of video without labels. The model predicts future video representations in latent space (not pixel space), making it computationally feasible. V-JEPA 2 achieves strong physical reasoning benchmarks (object permanence, intuitive physics) without language supervision.
Source: V-JEPA 2, arxiv:2506.09985
Frontier Mentions (2024–2025)
Five domains where multimodal architectures pushed well beyond text+image in 2024–2025. Each is a pointer — follow the source for depth.
| Domain | System (date) | What's new |
|---|---|---|
| AI for science | AlphaFold 3 (Google DeepMind, May 2024); ESM-3 (EvolutionaryScale, Jun 2024) | AF3 unified diffusion head predicts protein + DNA + RNA + ligand structures jointly (Nature, May 2024). nature.com. ESM-3 is a generative protein language model reasoning jointly over sequence, structure, and function tokens. evolutionaryscale.ai |
| Embodied / robotics | π0 (Physical Intelligence, Oct 2024); NVIDIA Cosmos (Jan 2025) | π0 is the first general robot foundation model: a VLM backbone with a flow-matching diffusion action head, trained across diverse robot morphologies. pi.website. Cosmos generates physically plausible world-state videos for robot simulation and data augmentation. nvidia.com |
| World models | Genie 2 (DeepMind, Dec 2024) | Single image → interactive 3D world with consistent physics, lighting, and agent control. Demonstrates that video world models can generalize from a single frame, closing the gap between generative video and simulation. deepmind.google |
| Native speech | Moshi (Kyutai, Sep 2024); Whisper large-v3-turbo (OpenAI, Oct 2024) | Moshi is a full-duplex speech LLM — listens and speaks simultaneously with (arxiv:2410.00037). Whisper large-v3-turbo cuts the decoder to 4 layers for vs large-v3 with minimal accuracy loss. huggingface.co |
| Open-source SOTA | Qwen2.5-VL (2025); Llama 4 Scout/Maverick (2025) | Covered in the Deep Dives above. See also this module's frontier section and Long Context for how 1M+ context intersects multimodal video processing. |
Key Takeaways
What to remember for interviews
- 1Multimodal LLMs work by encoding images into patch tokens with a ViT, projecting them into the LLM's embedding space, then concatenating with text tokens for joint self-attention.
- 2LLaVA proved a single linear projection layer is sufficient to bridge a frozen CLIP encoder and an LLM — architectural complexity matters less than training data quality.
- 3Early fusion (LLaVA, GPT-4V, Gemini) concatenates all tokens and lets self-attention mix modalities freely; cross-attention (Flamingo) keeps streams separate and is more parameter-efficient but limits interaction depth.
- 4Resolution is the core tradeoff: a 1024×1024 image at 16×16 patches produces 4096 tokens. Production models use tiling, dynamic resolution, and token compression to manage context window cost.
- 5Visual instruction tuning (Stage 2 of LLaVA training) is essential — without it, the model can describe images but cannot follow complex instructions about visual content.
- 6Native multimodality (GPT-4o, Llama 4) processes all modalities end-to-end from pre-training; late-fusion adapters (LLaVA, BLIP-2) are cheaper to train but create a modality bottleneck that limits cross-modal reasoning.
Recap quiz
LLaVA Stage 1 trains on 595K pairs and Stage 2 on 158K examples. Which components are updated in each stage?
A 1024×1024 image at 16×16 patches produces 4096 tokens. A 336×336 image with CLIP ViT-L/14 produces 576. Which tiling strategy for a 1024px image reduces this to ~340 tokens?
Flamingo compresses variable-length visual features to 64 tokens via a perceiver resampler. What is the primary cost of this compression?
An early-fusion multimodal LLM processes a 576-token image + 50-token text prompt. How does total attention cost compare to text-only with 50 tokens?
Cross-attention (Flamingo) vs early fusion (LLaVA): when is cross-attention the better architectural choice?
LLaVA uses a simple linear projection to bridge CLIP and the LLM. What does this imply about where the cross-modal understanding actually comes from?
Further Reading
- Visual Instruction Tuning (LLaVA) — Liu et al. 2023 — visual instruction tuning connecting a vision encoder to an LLM
- Flamingo: a Visual Language Model for Few-Shot Learning — Alayrac et al. 2022 — few-shot multimodal learning with interleaved image-text inputs
- GPT-4V System Card — OpenAI 2023 — safety evaluations and capabilities of GPT-4 with vision
- LLaVA-OneVision: Easy Visual Task Transfer — Li et al. 2024 — single model handles single-image, multi-image, and video tasks; shows how to unify vision tasks with one instruction-tuned model
- InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks — Chen et al. 2023 — scaling ViT to 6B parameters and aligning with LLMs; shows dynamic resolution handling for OCR-heavy tasks
- Lilian Weng — Generalized Visual Language Models — Comprehensive overview of vision-language model architectures — from dual encoders (CLIP) to decoder-only multimodal LLMs
Interview Questions
Showing 5 of 5
How do multimodal LLMs like GPT-4V process images alongside text?
★★☆What is the difference between cross-attention and early fusion for vision-language models?
★★★How does LLaVA achieve visual instruction following?
★★☆What are the resolution/token tradeoffs in multimodal LLMs, and how do production models handle them?
★★★Compare the architectural approaches of GPT-4V, Claude Vision, Gemini, and LLaVA. What are the key design tradeoffs?
★★★