Multimodal LLMs — Transformer Math

Module 31 · Architectures

🖼️ Multimodal LLMs

GPT-4V sees your image as 85 extra tokens in the prompt

Status:

Vision Transformers encode images into tokens. Multimodal LLMs take the next step: feed those visual tokens into a language model alongside text, enabling models like GPT-4V, Claude, and Gemini to see, reason about, and describe images. The architecture is surprisingly simple — a vision encoder, a projection layer, and an LLM.

🖼️

How Multimodal Models See

Early fusion architecture — used by LLaVA and believed to be employed by GPT-4V and Gemini (closed-source; architecture not publicly confirmed). An image and a text prompt are encoded into separate token streams, then concatenated and fed to a single LLM decoder.

The projection layeris the critical bridge: it maps vision encoder outputs (e.g., 1024-dim CLIP space) into the LLM's embedding dimension (e.g., 4096-dim), so the LLM can treat image patches as ordinary tokens.

🎮

Multimodal LLM Pipeline

What you’re seeing:the four-stage pixel-to-token pipeline — vision encoder (CLIP) extracts patch features, a projection layer maps them into the LLM's embedding space, they are concatenated with text tokens, and the decoder generates output. What to try: trace the image patch count from encoder to projection — that number is the visual token budget the LLM must attend over, which is why Q-Former and token compression exist.

💡

The Intuition

The core idea is remarkably simple.A vision encoder (ViT/CLIP) converts an image into a sequence of embeddings. A projection layer maps those embeddings into the LLM's token space. Then you just concatenate visual tokens with text tokens and let the LLM's self-attention handle cross-modal reasoning. The LLM doesn't need to “know” it's looking at an image — it just sees more tokens.

LLaVA showed this works with a linear layer. Connect a frozen to Vicuna with a single linear projection. Stage 1: train only the projection on (feature alignment). Stage 2: fine-tune projection + LLM on visual instruction examples (visual instruction tuning). That's it — surprisingly competitive with far more complex architectures.

Cross-attention vs early fusion. Flamingo uses cross-attention: the text model attends to visual features via separate cross-attention layers (Q from text, K/V from vision). This is parameter-efficient and allows caching visual features. LLaVA uses early fusion (concatenate all tokens and let self-attention mix them freely); GPT-4V/Claude/Gemini are widely assumed to follow a similar early-fusion-style design, though their exact internals are not publicly disclosed. More expensive but allows richer cross-modal interaction at every layer.

Resolution is the key tradeoff. More pixels means more patches means more tokens means more context consumed. A — a huge chunk of context window. Production models use tiling (split into crops), dynamic resolution, and token compression to manage this.

Q-Former: compressing visual tokens.LLaVA's linear projection passes all patch tokens to the LLM. BLIP-2 (Li et al., 2023) introduced the Query Transformer (Q-Former): a small transformer with a fixed set of learned query tokens (e.g., 32) that attend to the visual features via cross-attention and produce a fixed-length output regardless of image resolution. This decouples the LLM from the vision encoder — the LLM always sees exactly 32 tokens, never 196. The tradeoff: Q-Former loses fine-grained spatial detail (bad for OCR, counting), but dramatically reduces context consumption. Flamingo's perceiver resampler uses the same idea, compressing any image into . The field has largely moved away from Q-Former toward tile-based high-res encoding because fine-grained detail matters more for real tasks than token count.

✨ Insight · The common pattern in open models (LLaVA, and likely the closed systems GPT-4V/Claude/Gemini) is: encode the image into tokens, project into the LLM's space, and let self-attention handle the rest. The innovation is in the details — how you handle resolution, how many tokens you use, and whether you train the vision encoder or freeze it.

Quick check

Trade-off

LLaVA skips Stage 2 (visual instruction tuning) and only does Stage 1 alignment on 595K pairs. What breaks?

Training diverges because the LLM is not updated to handle visual tokens.The model can describe images but cannot follow complex visual instructions.The model can still do OCR and counting but fails on image captions.The projection layer produces garbage outputs because Stage 1 alone is insufficient for alignment.

Quick Check

What is the key difference between cross-attention and early fusion in multimodal LLMs?

📐

Step-by-Step Derivation

Cross-Attention for Vision-Language

In cross-attention architectures (e.g., Flamingo), text queries attend to visual keys and values. The text representation is enriched with visual information while keeping the streams separate:

💡 Tip · Cross-attention lets each text token selectively attend to relevant image regions. Early fusion (concatenating all tokens) lets both modalities attend to each other at every layer — more expressive but more expensive.

Visual Token Count

The number of visual tokens depends on image resolution and patch size. For tiled high-resolution images:

Example: a 1024x1024 image tiled into 4 crops of 512x512, with P=16: total visual tokens (4 tile encodings + 1 overview at 224x224).

Visual Projection (LLaVA-style)

Map vision encoder outputs to the LLM embedding space via a learned projection:

LLaVA uses a single linear layer. More complex adapters (Q-Former in BLIP-2, perceiver resampler in Flamingo) compress variable-length visual features into a fixed number of tokens.

PyTorch: Multimodal Forward Pass (LLaVA-style)

python

import torch
import torch.nn as nn

class MultimodalLLM(nn.Module):
    def __init__(self, vision_encoder, llm, vision_dim=1024, llm_dim=4096):
        super().__init__()
        self.vision_encoder = vision_encoder  # frozen CLIP ViT
        self.llm = llm                        # pre-trained LLM
        self.visual_proj = nn.Linear(vision_dim, llm_dim)  # the bridge

    def forward(self, image, text_input_ids, text_embeds):
        # Step 1: Encode image into visual features
        with torch.no_grad():
            visual_features = self.vision_encoder(image)  # (B, 576, Dv)

        # Step 2: Project visual features into LLM embedding space
        visual_tokens = self.visual_proj(visual_features)  # (B, 576, Dt)

        # Step 3: Interleave with text embeddings
        # text_embeds: (B, T, Dt) from LLM's embedding layer
        multimodal_input = torch.cat([
            visual_tokens,   # visual tokens first
            text_embeds       # then text tokens
        ], dim=1)            # (B, 576 + T, Dt)

        # Step 4: LLM processes everything via self-attention
        output = self.llm(inputs_embeds=multimodal_input)
        return output  # next-token predictions over full sequence

PyTorch implementation

# CLIP-style contrastive loss (InfoNCE)
import torch
import torch.nn as nn
import torch.nn.functional as F

class CLIPContrastiveLoss(nn.Module):
    def __init__(self, temperature: float = 0.07):
        super().__init__()
        self.temperature = nn.Parameter(torch.tensor(temperature))

    def forward(self, image_features: torch.Tensor, text_features: torch.Tensor):
        # Normalize embeddings to unit sphere
        img = F.normalize(image_features, dim=-1)   # (B, D)
        txt = F.normalize(text_features, dim=-1)    # (B, D)

        # Cosine similarity matrix, scaled by temperature
        logits = (img @ txt.T) / self.temperature   # (B, B)

        # Symmetric InfoNCE: diagonal = positive pairs
        labels = torch.arange(len(logits), device=logits.device)
        loss_i2t = F.cross_entropy(logits, labels)
        loss_t2i = F.cross_entropy(logits.T, labels)
        return (loss_i2t + loss_t2i) / 2

Quick check

Derivation

A 1024×1024 image at 16×16 patches = 4096 tokens. You want to keep it under 500 tokens using tiling. What is the minimum number of 512×512 tiles needed, and how many tokens does each tile produce?

2 tiles at 512×512 each; 2 × 1024 = 2048 tokens, already under 4096.4 tiles with compressed per-tile encoding (e.g., ~85 tokens each) totaling ~340 tokens.4 tiles; each 512×512 tile at 16px patches = 1024 tokens per tile.1 tile covering the whole image, but downsampled to 512×512 first.

🔧

Break It — See What Happens

Low resolution input (very few visual tokens)

Skip visual instruction tuning (alignment only)

Quick check

Derivation

You remove the vision encoder entirely and instead pass raw pixel values as floating-point tokens to the LLM. What is the main failure mode?

The projection layer cannot map raw pixels to the LLM embedding space.The model loses color information because LLMs tokenize only grayscale.Raw pixels cause gradient explosions during Stage 1 alignment training.The LLM sees a 336×336×3 = 338,688-dimensional flat vector — far beyond any reasonable context length.

📊

Real-World Numbers

Model	Architecture	Details
GPT-4V	Likely early fusion (unconfirmed)	OpenAI docs: low-detail mode uses a 512×512 representation; high-detail mode adds extra 512×512 tiles () on top of the low-detail overview.
Claude Vision	Likely early fusion (unconfirmed)	Up to ~1500 visual tokens, strong on documents, charts, and structured visual content
Gemini	Natively multimodal	Trained multimodal from scratch (not bolted-on), handles image/audio/video natively
LLaVA-1.5	Early fusion	+ Vicuna-13B, , open-source
Flamingo	Cross-attention	, perceiver resampler compresses to , few-shot learning pioneer

✨ Insight · The trend: early fusion is winning over cross-attention. LLaVA proved a linear projection is enough. Production models focus on resolution handling (tiling, dynamic resize) and training data quality (visual instruction tuning) rather than architectural complexity. The vision encoder is typically frozen CLIP/SigLIP.

Quick check

Derivation

GPT-4V low-detail mode uses ~85 tokens for any image. High-detail mode adds ~170 tokens per 512×512 tile. For a 1024×1024 image in high-detail mode, what is the total token count?

4 × 85 = 340 tokens — four tiles each encoded at low-detail.85 (overview) + 4 × 170 (tiles) = 765 tokens total.1024 + 85 = 1109 tokens — the naive patch count plus the overview.170 tokens — the high-detail single-tile count for the whole image.

🌐

Native Multimodal Frontier (2024–2025)

Through 2023, multimodal models were LLMs with vision adapters bolted on. 2024–2025 saw a shift to native multimodality: text, images, audio, and video processed end-to-end in the same token stream from pre-training, not fine-tuning.

GPT-4o (May 2024) — first natively multimodal frontier model

Prior GPT-4V was text-only GPT-4 with a bolted-on vision encoder. GPT-4o was trained end-to-end on text, audio, and image tokens simultaneously. Key architectural consequence: the model can reason across modalities in a single forward pass — no intermediate text-only bottleneck. Voice response latency dropped from ~2.8s (GPT-4 + TTS pipeline) to ~320ms (per OpenAI demo, community measure).

Source: OpenAI, “Hello GPT-4o”, May 2024

Deep dive: Llama 4 early fusion (Apr 2025)

Llama 4 (Scout and Maverick) uses early fusion: image patch tokens are interleaved with text tokens from the very first transformer layer. This contrasts with the late-fusion approach (separate vision encoder → projection → concatenate before the last N layers) used in most prior models.

Early fusion allows every transformer layer to attend image-to-text and text-to-image. The architectural cost is that the image tokeniser must produce tokens in the same embedding space as the text vocabulary from the start — requiring joint pre-training rather than the simpler two-stage LLaVA recipe. Llama 4 also combines early fusion with MoE (interleaved dense and MoE layers), making it one of the first open-weight native-multimodal MoE models.

Source: Meta AI, Llama 4 blog, Apr 2025

Deep dive: Gemini 2.5 Flash/Pro — 1M context over video + audio

Gemini 2.5 Flash and Pro support a natively over video frames and audio, not just text. A 1-hour video at 1 fps produces ~3,600 frames; at 256 tokens/frame (ViT-16 patch size on 224×224), that is ~921K tokens — approaching the full 1M context budget.

The engineering challenge at 1M context is attention memory: naive O(n²) attention requires ~400 GB for 1M tokens at float16. Gemini uses ring attention / flash attention variants with aggressive sequence parallelism across TPU pods to make this tractable.

Deep dive: Qwen2.5-VL and V-JEPA 2 — open-source + world models

Qwen2.5-VL (2025) is the open-source SOTA for document and video understanding as of early 2025. It introduces native dynamic resolution (no fixed crop grid) and a position-aware vision encoder that preserves spatial relationships across tiled crops — critical for document-layout tasks where LLaVA-style global average pooling loses positional structure.

Source: Qwen2.5-VL, arxiv:2502.13923

V-JEPA 2 (Meta, 2025) takes a different axis: rather than generative multimodality, it learns a self-supervised video world model trained on 1M hours of video without labels. The model predicts future video representations in latent space (not pixel space), making it computationally feasible. V-JEPA 2 achieves strong physical reasoning benchmarks (object permanence, intuitive physics) without language supervision.

Source: V-JEPA 2, arxiv:2506.09985

✨ Insight · The architectural divide in 2025: generative native multimodal (GPT-4o, Gemini 2.5, Llama 4) produces outputs across modalities; predictiveworld models (V-JEPA 2) learn physical structure without generation. Both are “multimodal” but optimise for different downstream tasks — the generative path wins on instruction following, the predictive path wins on physical reasoning and data efficiency.

🔭

Frontier Mentions (2024–2025)

Five domains where multimodal architectures pushed well beyond text+image in 2024–2025. Each is a pointer — follow the source for depth.

Domain	System (date)	What's new
AI for science	AlphaFold 3 (Google DeepMind, May 2024); ESM-3 (EvolutionaryScale, Jun 2024)	AF3 unified diffusion head predicts protein + DNA + RNA + ligand structures jointly (Nature, May 2024). nature.com. ESM-3 is a generative protein language model reasoning jointly over sequence, structure, and function tokens. evolutionaryscale.ai
Embodied / robotics	π0 (Physical Intelligence, Oct 2024); NVIDIA Cosmos (Jan 2025)	π0 is the first general robot foundation model: a VLM backbone with a flow-matching diffusion action head, trained across diverse robot morphologies. pi.website. Cosmos generates physically plausible world-state videos for robot simulation and data augmentation. nvidia.com
World models	Genie 2 (DeepMind, Dec 2024)	Single image → interactive 3D world with consistent physics, lighting, and agent control. Demonstrates that video world models can generalize from a single frame, closing the gap between generative video and simulation. deepmind.google
Native speech	Moshi (Kyutai, Sep 2024); Whisper large-v3-turbo (OpenAI, Oct 2024)	Moshi is a full-duplex speech LLM — listens and speaks simultaneously with (arxiv:2410.00037). Whisper large-v3-turbo cuts the decoder to 4 layers for vs large-v3 with minimal accuracy loss. huggingface.co
Open-source SOTA	Qwen2.5-VL (2025); Llama 4 Scout/Maverick (2025)	Covered in the Deep Dives above. See also this module's frontier section and Long Context for how 1M+ context intersects multimodal video processing.

✨ Insight · 2024–2025 pattern: multimodality is leaving text+image behind. Science (proteins, molecules), robotics (physical actions), world simulation (3D physics), and speech (full-duplex) all became first-class modalities. The unifying architecture is still a Transformer backbone — but the input tokenizers and output heads are domain-specific diffusion or regression heads, not just next-token prediction.

🧠

Key Takeaways

What to remember for interviews

1Multimodal LLMs work by encoding images into patch tokens with a ViT, projecting them into the LLM's embedding space, then concatenating with text tokens for joint self-attention.
2LLaVA proved a single linear projection layer is sufficient to bridge a frozen CLIP encoder and an LLM — architectural complexity matters less than training data quality.
3Early fusion (LLaVA, GPT-4V, Gemini) concatenates all tokens and lets self-attention mix modalities freely; cross-attention (Flamingo) keeps streams separate and is more parameter-efficient but limits interaction depth.
4Resolution is the core tradeoff: a 1024×1024 image at 16×16 patches produces 4096 tokens. Production models use tiling, dynamic resolution, and token compression to manage context window cost.
5Visual instruction tuning (Stage 2 of LLaVA training) is essential — without it, the model can describe images but cannot follow complex instructions about visual content.
6Native multimodality (GPT-4o, Llama 4) processes all modalities end-to-end from pre-training; late-fusion adapters (LLaVA, BLIP-2) are cheaper to train but create a modality bottleneck that limits cross-modal reasoning.

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 5 of 5

How do multimodal LLMs like GPT-4V process images alongside text?

★★☆

GoogleOpenAIAnthropic

What is the difference between cross-attention and early fusion for vision-language models?

★★★

GoogleMeta

How does LLaVA achieve visual instruction following?

★★☆

🖼️ Multimodal LLMs

How Multimodal Models See

Multimodal LLM Pipeline

The Intuition

Step-by-Step Derivation

Break It — See What Happens

Real-World Numbers

Native Multimodal Frontier (2024–2025)

Frontier Mentions (2024–2025)

Key Takeaways

Recap quiz

Further Reading

Interview Questions