🧬 Model Merging
Three fine-tuned models. SLERP into one. The merge wins on tasks none of them mastered alone — and you spent zero GPU-hours.
You have already spent the GPU budget. Three separate fine-tunes exist on disk: one specialized for code, one for math reasoning, one for medical Q&A. None of them is best at all three. Retraining a single model on a combined dataset would take days and could degrade each specialty.
Model merging is the alternative: arithmetic on weight checkpoints. Add the task-specific deltas— the difference between a fine-tuned model and its base — directly to the base model's weights. No forward pass. No gradient. Zero GPU-hours. The merged model often outperforms any individual fine-tune on held-out benchmarks.
This is not an ensemble (which multiplies inference cost). The merged checkpoint is a single model with the same parameter count as the base. Serving cost is identical.
What you’re seeing
Two fine-tuned weight vectors (A and B) on the unit sphere. The straight chord is LERP — it cuts through the interior and shrinks the magnitude. The arc is SLERP — it stays on the sphere and preserves the norm.
What to try: run mergekit-yaml with merge_method: slerp and sweep t from 0.3 to 0.7 to find the blend that best satisfies your eval suite.
Why Does Merging Work?
The problem.Two fine-tunes that start from the same pre-trained checkpoint are not independent. The pre-training landscape has already carved out a basin of weights that know about language. Fine-tuning keeps the model inside that basin — it moves the weights to a new region that also knows the target task, but it doesn't escape the original basin. Both fine-tunes live in the same valley.
Linear mode connectivity (Frankle et al., 2020) is the empirical finding that when two models share the same initialization, there exists a low-loss path between their weight vectors. You can walk from one to the other along a straight line in weight space without crossing a high-loss barrier. This is what makes interpolation work — you stay in the valley the whole time.
Task vectors. Ilharco et al. (2022) framed this as arithmetic on task vectors: . Each task vector encodes the directional change from base to fine-tune. Adding two task vectors to the base adds both sets of task-specific changes. The result is a model that has been pulled in both directions simultaneously. Empirically, for tasks that do not conflict, the pulls reinforce.
Merging vs Ensembling — Ensembling averages output logits at inference time. It costs 2× memory and compute per forward pass, and scales linearly with the number of models. Merging averages weights before inference. The merged checkpoint is a single model — same memory, same latency, same throughput as any individual fine-tune. For production serving, merging is almost always preferred over ensembling when quality parity holds.
| Method | Key idea | Handles sign conflicts | GPU cost |
|---|---|---|---|
| LERP / Task Arithmetic | Weighted sum of task vectors | No | 0 |
| SLERP | Arc interpolation on unit sphere | No | 0 |
| TIES | Trim + elect sign + disjoint avg | Yes | 0 |
| DARE + TIES | Random drop + rescale + TIES | Yes | 0 |
| Model Soups | Greedy checkpoint averaging | Partial | 0 |
| Ensembling | Average output logits at runtime | N/A | N × base cost |
Two models are fine-tuned from Llama-3-8B. Model A specializes in Spanish translation; Model B in Python code. You merge them with SLERP at t=0.5. Which claim best describes the result?
Merge Algorithms: The Math
1. SLERP
Spherical Linear Interpolation treats parameter tensors as vectors on a high-dimensional unit sphere. For two weight vectors and separated by angle :
SLERP — spherical interpolation at blending factor t ∈ [0,1]
At you get ; at you get . The sine weighting keeps for all when .
2. TIES-Merging (Trim · Elect Sign · Disjoint Merge)
Given fine-tunes, compute task vectors for each. Then:
- Trim: . Zeroes out the (100-k)% lowest-magnitude changes.
- Elect Sign: . Majority-vote sign per parameter position.
- Disjoint Merge: , where . Only models that agree with the elected sign are averaged at each position.
TIES final merge (per parameter position p)
3. DARE
Drop And REscale sparsifies each task vector before merging:
DARE: drop fraction p independently per parameter, rescale by 1/(1−p)
With , 90% of deltas are zeroed. The survivors are scaled up by 10× so their expectation matches the original delta. DARE is typically composed with TIES after the drop: DARE-TIES.
4. Model Soups (Weight Averaging)
Average the weights of multiple checkpoints fine-tuned from the same base (different hyperparameters or epochs). The “greedy soup” variant adds a checkpoint only if the running average improves validation accuracy:
Uniform model soup (K checkpoints)
Model soups exploit flat loss basins: interpolating between nearby minima in the same basin reaches regions with lower curvature and better generalization than any single minimum.
Python: merge two HuggingFace models with Mergekit
# pip install mergekit
import yaml
import subprocess
# Define a TIES merge of two Mistral-7B fine-tunes
merge_config = {
"merge_method": "ties",
"base_model": "mistralai/Mistral-7B-v0.1",
"models": [
{
"model": "your-org/mistral-7b-math",
"parameters": {"density": 0.5, "weight": 1.0},
},
{
"model": "your-org/mistral-7b-code",
"parameters": {"density": 0.5, "weight": 1.0},
},
],
"parameters": {
"normalize": True,
"int8_mask": True,
},
"dtype": "bfloat16",
}
with open("/tmp/merge_config.yml", "w") as f:
yaml.dump(merge_config, f)
subprocess.run([
"mergekit-yaml",
"/tmp/merge_config.yml",
"/tmp/merged-model",
"--cuda", # GPU acceleration
"--low-cpu-memory", # stream layers to avoid OOM
], check=True)
# Load and test the merged model
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("/tmp/merged-model")
model = AutoModelForCausalLM.from_pretrained(
"/tmp/merged-model", torch_dtype="auto"
)
print("Merge successful. Parameter count:", sum(p.numel() for p in model.parameters()))Break It
Real Numbers
TIES-Merging benchmark (Yadav et al., 2023)
Merging 8 fine-tuned ViT-B/32 models with TIES achieves 72.1% average accuracy across 8 tasks, vs 68.7% for Task Arithmetic and 65.0% for uniform averaging. Source: Table 2 in arxiv:2306.01708.
DARE on LLaMA-family models (Yu et al., 2023)
With drop rate p = 0.9 on WizardMath-7B fine-tuned from LLaMA-2-7B, DARE retains accuracy within 1 percentage point of the original fine-tune while zeroing 90% of delta parameters. Source: arxiv:2311.03099 Table 3.
Model Soups on ViT-G/14 CLIP (Wortsman et al., 2022)
Greedy soup over 72 fine-tunes achieves 90.94% ImageNet top-1 accuracy, vs 90.74% for the best single fine-tune — +0.2pp“for free” by averaging weights. Source: arxiv:2203.05482 Table 2.
Open LLM Leaderboard — merged models dominate top-10 (2024)
Multiple top-10 entries on the HuggingFace Open LLM Leaderboard (circa early 2024) used mergekit to combine Mistral-7B or LLaMA-2-13B fine-tunes. Community observation documented in the Mergekit GitHub readme (arcee-ai/mergekit).
Key Takeaways
What to remember for interviews
- 1Model merging combines fine-tuned checkpoints via weight-space arithmetic — zero GPU cost at merge time, same inference cost as a single model.
- 2Linear mode connectivity is the prerequisite: both models must be fine-tuned from the same pre-trained base to share a loss basin.
- 3SLERP preserves vector magnitude on the weight hypersphere; LERP chord-interpolation shrinks it, degrading performance.
- 4TIES resolves inter-task interference by voting on the sign of each parameter delta before averaging, typically outperforming naive Task Arithmetic by 3–5pp on multi-task benchmarks.
- 5DARE drops 90% of task-vector deltas at random and rescales survivors by 1/(1-p), enabling high-sparsity merges with near-zero accuracy loss.
- 6Model Soups average checkpoint weights from nearby minima to reach flatter loss regions with better generalization than any single run.
Check Your Understanding
Why does SLERP work better than naive linear interpolation (LERP) when merging two fine-tuned models?
TIES-Merging trims the task vectors before combining them. What is the primary purpose of the trimming step?
DARE randomly drops parameter deltas and then rescales the survivors. Why is the rescaling step necessary?
Model Soups and SLERP both rely on linear mode connectivity. When does this property break down?
Interview Questions
Interview Questions
Showing 4 of 4
You have two Mistral-7B fine-tunes: one for code generation, one for math reasoning. Explain how TIES-Merging would combine them. Walk through all three steps — Trim, Elect Sign, Disjoint Merge — and describe what each step prevents.
★★☆Why can you not use model merging to combine two models that were fine-tuned from different base checkpoints (e.g., Llama-3-8B and Mistral-7B-v0.3)? What is the theoretical reason?
★★☆A team merges five LoRA fine-tunes using Task Arithmetic (simple addition of task vectors with equal weight). Benchmark scores are below the best individual fine-tune. What went wrong, and how would you fix it?
★★★Explain Model Soups and why averaging model checkpoints from the same training run (greedy soup) outperforms any single checkpoint.
★★☆Further Reading
- Editing Models with Task Arithmetic (Ilharco et al., 2022) — Foundational paper showing that arithmetic on fine-tuned weight differences enables multi-task editing, forgetting, and analogy without retraining.
- TIES-Merging (Yadav et al., 2023) — Trim-Elect Sign-Disjoint Merge: three-step algorithm that cuts inter-task interference by resolving parameter sign conflicts before averaging.
- DARE: Language Model Merging by Random Delta Drop (Yu et al., 2023) — Drop 90% of task-vector deltas at random, rescale survivors by 1/(1-p), then merge — enables combining many models with near-zero accuracy loss.
- Model Soups (Wortsman et al., 2022) — Average fine-tuned checkpoint weights from the same pre-trained base to reach flatter loss regions and improve accuracy over any single run.
- Mergekit — toolkit for merging large language models — Production-grade open-source library implementing SLERP, TIES, DARE, Task Arithmetic, and Model Soups on HuggingFace models.
- Sebastian Raschka — Model Merging, Mixtures of Experts, and Towards Continual Learning — Accessible walkthrough of merge algorithms with intuitive diagrams and concrete Mergekit YAML examples.