Skip to content

Transformer Math

Module 22 · Training

🧬 Model Merging

Three fine-tuned models. SLERP into one. The merge wins on tasks none of them mastered alone — and you spent zero GPU-hours.

Status:

You have already spent the GPU budget. Three separate fine-tunes exist on disk: one specialized for code, one for math reasoning, one for medical Q&A. None of them is best at all three. Retraining a single model on a combined dataset would take days and could degrade each specialty.

Model merging is the alternative: arithmetic on weight checkpoints. Add the task-specific deltas— the difference between a fine-tuned model and its base — directly to the base model's weights. No forward pass. No gradient. Zero GPU-hours. The merged model often outperforms any individual fine-tune on held-out benchmarks.

This is not an ensemble (which multiplies inference cost). The merged checkpoint is a single model with the same parameter count as the base. Serving cost is identical.

What you’re seeing

Two fine-tuned weight vectors (A and B) on the unit sphere. The straight chord is LERP — it cuts through the interior and shrinks the magnitude. The arc is SLERP — it stays on the sphere and preserves the norm.

Model AModel BLERPSLERPLERP midpoint: norm < max(||A||, ||B||)SLERP midpoint: norm = ||A|| = ||B||

What to try: run mergekit-yaml with merge_method: slerp and sweep t from 0.3 to 0.7 to find the blend that best satisfies your eval suite.

💡

Why Does Merging Work?

The problem.Two fine-tunes that start from the same pre-trained checkpoint are not independent. The pre-training landscape has already carved out a basin of weights that know about language. Fine-tuning keeps the model inside that basin — it moves the weights to a new region that also knows the target task, but it doesn't escape the original basin. Both fine-tunes live in the same valley.

Linear mode connectivity (Frankle et al., 2020) is the empirical finding that when two models share the same initialization, there exists a low-loss path between their weight vectors. You can walk from one to the other along a straight line in weight space without crossing a high-loss barrier. This is what makes interpolation work — you stay in the valley the whole time.

Task vectors. Ilharco et al. (2022) framed this as arithmetic on task vectors: . Each task vector encodes the directional change from base to fine-tune. Adding two task vectors to the base adds both sets of task-specific changes. The result is a model that has been pulled in both directions simultaneously. Empirically, for tasks that do not conflict, the pulls reinforce.

✨ Insight ·

Merging vs Ensembling — Ensembling averages output logits at inference time. It costs 2× memory and compute per forward pass, and scales linearly with the number of models. Merging averages weights before inference. The merged checkpoint is a single model — same memory, same latency, same throughput as any individual fine-tune. For production serving, merging is almost always preferred over ensembling when quality parity holds.

MethodKey ideaHandles sign conflictsGPU cost
LERP / Task ArithmeticWeighted sum of task vectorsNo0
SLERPArc interpolation on unit sphereNo0
TIESTrim + elect sign + disjoint avgYes0
DARE + TIESRandom drop + rescale + TIESYes0
Model SoupsGreedy checkpoint averagingPartial0
EnsemblingAverage output logits at runtimeN/AN × base cost
Quick Check

Two models are fine-tuned from Llama-3-8B. Model A specializes in Spanish translation; Model B in Python code. You merge them with SLERP at t=0.5. Which claim best describes the result?

📐

Merge Algorithms: The Math

1. SLERP

Spherical Linear Interpolation treats parameter tensors as vectors on a high-dimensional unit sphere. For two weight vectors and separated by angle :

SLERP — spherical interpolation at blending factor t ∈ [0,1]

At you get ; at you get . The sine weighting keeps for all when .

2. TIES-Merging (Trim · Elect Sign · Disjoint Merge)

Given fine-tunes, compute task vectors for each. Then:

  1. Trim: . Zeroes out the (100-k)% lowest-magnitude changes.
  2. Elect Sign: . Majority-vote sign per parameter position.
  3. Disjoint Merge: , where . Only models that agree with the elected sign are averaged at each position.

TIES final merge (per parameter position p)

3. DARE

Drop And REscale sparsifies each task vector before merging:

DARE: drop fraction p independently per parameter, rescale by 1/(1−p)

With , 90% of deltas are zeroed. The survivors are scaled up by 10× so their expectation matches the original delta. DARE is typically composed with TIES after the drop: DARE-TIES.

4. Model Soups (Weight Averaging)

Average the weights of multiple checkpoints fine-tuned from the same base (different hyperparameters or epochs). The “greedy soup” variant adds a checkpoint only if the running average improves validation accuracy:

Uniform model soup (K checkpoints)

Model soups exploit flat loss basins: interpolating between nearby minima in the same basin reaches regions with lower curvature and better generalization than any single minimum.

Python: merge two HuggingFace models with Mergekit
python
# pip install mergekit
import yaml
import subprocess

# Define a TIES merge of two Mistral-7B fine-tunes
merge_config = {
    "merge_method": "ties",
    "base_model": "mistralai/Mistral-7B-v0.1",
    "models": [
        {
            "model": "your-org/mistral-7b-math",
            "parameters": {"density": 0.5, "weight": 1.0},
        },
        {
            "model": "your-org/mistral-7b-code",
            "parameters": {"density": 0.5, "weight": 1.0},
        },
    ],
    "parameters": {
        "normalize": True,
        "int8_mask": True,
    },
    "dtype": "bfloat16",
}

with open("/tmp/merge_config.yml", "w") as f:
    yaml.dump(merge_config, f)

subprocess.run([
    "mergekit-yaml",
    "/tmp/merge_config.yml",
    "/tmp/merged-model",
    "--cuda",          # GPU acceleration
    "--low-cpu-memory",  # stream layers to avoid OOM
], check=True)

# Load and test the merged model
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("/tmp/merged-model")
model = AutoModelForCausalLM.from_pretrained(
    "/tmp/merged-model", torch_dtype="auto"
)
print("Merge successful. Parameter count:", sum(p.numel() for p in model.parameters()))
💥

Break It

SLERP two models that disagree on most tasks
Merge models from different base checkpoints
🔢

Real Numbers

TIES-Merging benchmark (Yadav et al., 2023)

Merging 8 fine-tuned ViT-B/32 models with TIES achieves 72.1% average accuracy across 8 tasks, vs 68.7% for Task Arithmetic and 65.0% for uniform averaging. Source: Table 2 in arxiv:2306.01708.

DARE on LLaMA-family models (Yu et al., 2023)

With drop rate p = 0.9 on WizardMath-7B fine-tuned from LLaMA-2-7B, DARE retains accuracy within 1 percentage point of the original fine-tune while zeroing 90% of delta parameters. Source: arxiv:2311.03099 Table 3.

Model Soups on ViT-G/14 CLIP (Wortsman et al., 2022)

Greedy soup over 72 fine-tunes achieves 90.94% ImageNet top-1 accuracy, vs 90.74% for the best single fine-tune — +0.2pp“for free” by averaging weights. Source: arxiv:2203.05482 Table 2.

Open LLM Leaderboard — merged models dominate top-10 (2024)

Multiple top-10 entries on the HuggingFace Open LLM Leaderboard (circa early 2024) used mergekit to combine Mistral-7B or LLaMA-2-13B fine-tunes. Community observation documented in the Mergekit GitHub readme (arcee-ai/mergekit).

🧠

Key Takeaways

What to remember for interviews

  1. 1Model merging combines fine-tuned checkpoints via weight-space arithmetic — zero GPU cost at merge time, same inference cost as a single model.
  2. 2Linear mode connectivity is the prerequisite: both models must be fine-tuned from the same pre-trained base to share a loss basin.
  3. 3SLERP preserves vector magnitude on the weight hypersphere; LERP chord-interpolation shrinks it, degrading performance.
  4. 4TIES resolves inter-task interference by voting on the sign of each parameter delta before averaging, typically outperforming naive Task Arithmetic by 3–5pp on multi-task benchmarks.
  5. 5DARE drops 90% of task-vector deltas at random and rescales survivors by 1/(1-p), enabling high-sparsity merges with near-zero accuracy loss.
  6. 6Model Soups average checkpoint weights from nearby minima to reach flatter loss regions with better generalization than any single run.
🧠

Check Your Understanding

Derivation

Why does SLERP work better than naive linear interpolation (LERP) when merging two fine-tuned models?

Why does SLERP work better than naive linear interpolation (LERP) when merging two fine-tuned models?
Recall

TIES-Merging trims the task vectors before combining them. What is the primary purpose of the trimming step?

TIES-Merging trims the task vectors before combining them. What is the primary purpose of the trimming step?
Trade-off

DARE randomly drops parameter deltas and then rescales the survivors. Why is the rescaling step necessary?

DARE randomly drops parameter deltas and then rescales the survivors. Why is the rescaling step necessary?
Trade-off

Model Soups and SLERP both rely on linear mode connectivity. When does this property break down?

Model Soups and SLERP both rely on linear mode connectivity. When does this property break down?
🎯

Interview Questions

🎯

Interview Questions

Difficulty:
Company:

Showing 4 of 4

You have two Mistral-7B fine-tunes: one for code generation, one for math reasoning. Explain how TIES-Merging would combine them. Walk through all three steps — Trim, Elect Sign, Disjoint Merge — and describe what each step prevents.

★★☆
GoogleAnthropicMeta

Why can you not use model merging to combine two models that were fine-tuned from different base checkpoints (e.g., Llama-3-8B and Mistral-7B-v0.3)? What is the theoretical reason?

★★☆
OpenAIAnthropic

A team merges five LoRA fine-tunes using Task Arithmetic (simple addition of task vectors with equal weight). Benchmark scores are below the best individual fine-tune. What went wrong, and how would you fix it?

★★★
GoogleMeta

Explain Model Soups and why averaging model checkpoints from the same training run (greedy soup) outperforms any single checkpoint.

★★☆
GoogleMetaAnthropic
📚

Further Reading