🔧 Fine-tuning & LoRA
LoRA trains 0.1% of parameters and matches full fine-tuning
Full fine-tuning updates every parameter — expensive and risks catastrophic forgetting. LoRA adds tiny trainable matrices that adapt the model with . QLoRA adds 4-bit quantization so you can fine-tune a (per the original paper; the technique generalizes to 70B-class models).
LoRA Architecture
What you’re seeing:a frozen pre-trained weight matrix W alongside the trainable low-rank decomposition BA — during the forward pass, the adapter output BA·x is added to the frozen W·x, so only the small A and B matrices are updated. What to try: adjust the rank slider in the configurator below to see how it trades off parameter count against expressiveness.
Configure LoRA settings and see the impact on memory and trainable parameters. Adjust rank, alpha, target modules, base model, and quantization to explore the tradeoffs.
| Metric | Value |
|---|---|
| Trainable Parameters | 16.78M |
| % of Total Params | 0.2397% |
| Model Weights | 14.0 GB |
| LoRA Adapters | 34 MB |
| Optimizer States | 137 MB |
| Activations (est.) | 1.6 GB |
| Total VRAM | 15.8 GB |
| Fits on GPU | 24GB Yes 48GB Yes 80GB Yes |
The Intuition
Full fine-tuning updates all parameters. For a 70B model, that means 140GB of gradients and optimizer states — you need multiple A100s just for the optimizer. Worse, updating all params risks catastrophic forgetting: the model improves on your task but forgets general capabilities.
LoRA exploits a key insight: the weight updates during fine-tuning have low intrinsic rank. Instead of updating the full matrix, LoRA adds where and with .
QLoRA goes further: quantize the frozen base model to 4-bit, keep LoRA adapters in FP16. The base model shrinks from 140GB to ~35GB, and only the tiny LoRA matrices need gradients. Result: fine-tune Llama-3 70B on a single 48GB GPU.
Catastrophic Forgetting — Quantified: Full fine-tuning on a narrow task dataset can dramatically degrade performance on tasks the model was not fine-tuned on. Luo et al. (2023) showed that continual instruction tuning of BLOOMZ-7.1B degraded MMLU accuracy by . LoRA largely avoids this: because the base weights are frozen and the adapter adds only a low-rank correction, the original knowledge is structurally preserved. Empirically, . This structural forgetting resistance is the second major reason (after memory cost) that LoRA has become the default fine-tuning method.
Worked Example — LoRA Parameter Count (Llama-2 7B)
Setup: d_model = 4096, 32 layers, LoRA applied to Q, K, V, O projections (4 matrices per layer), rank r = 16.
Per matrix: 2 × 4096 × 16 =
Per layer: 4 × 131,072 = 524,288 params
Total: 32 × 524,288 = 16,777,216 ≈ 16.8M params
Share of 8B model: 16.8M / 8,000M = — yet matches full fine-tuning quality on most tasks.
DoRA — Weight-Decomposed LoRA: Liu et al. (2024) observed that LoRA updates tend to change the direction of weight vectors much more easily than their magnitude — but full fine-tuning updates both. DoRA explicitly decomposes each weight matrix into magnitude (a scalar per output channel) and direction (a unit-norm matrix), then applies LoRA only to the direction component while keeping the magnitude trainable directly. This better mimics the update pattern of full fine-tuning and produces consistent quality gains over plain LoRA with the same rank and no additional inference cost, since the decomposition can be merged back before deployment.
LoRA+ (2024): Hayou et al. observed that LoRA treats the A and B matrices symmetrically with the same learning rate, but they play different roles — A initializes with random Kaiming uniform values and projects input down, while B initializes to zero and projects back up. LoRA+ sets a higher learning rate for B than A (typically a fixed ratio of ~16x), since B is closer to the output and starts from zero. This asymmetric LR schedule better matches the gradient flow dynamics of full fine-tuning. The practical result: ~2% quality improvement on standard benchmarks with no additional parameters or inference overhead — just a one-line hyperparameter change.
Spectrum (2024):Rather than applying LoRA uniformly to all layers, Spectrum uses Signal-to-Noise Ratio (SNR) per layer to decide which layers to fine-tune. The intuition: layers with high SNR (signal dominates noise in their weight spectra) carry more learned structure and benefit more from adaptation, while low-SNR layers are noisier and contribute less. Spectrum computes SNR from the singular value decomposition of each weight matrix — high singular values relative to noise floor indicate high SNR. Fine-tuning only the top-SNR layers reduces compute by 30–50% with minimal quality loss, and often outperforms full-layer LoRA because it avoids adapting layers that aren't meaningfully task-relevant.
Quick check
Why does LoRA use r ≪ d instead of training the full d×d weight matrix directly?
Method Comparison
| Full FT | LoRA | QLoRA | DoRA | Prefix Tuning | |
|---|---|---|---|---|---|
| Trainable params | 100% | 0.01-1% | 0.01-1% | 0.01-1% | ~0.001% |
| GPU memory (7B) | ~112GB | ~16GB | ~10GB | ~16GB | ~14GB |
| Quality vs full FT | 100% | ~98% | ~97% | ~99%+ | ~90% |
| Inference overhead | None | None (merged) | None (merged) | None (merged) | Extra tokens |
| Multi-tenant | No | Yes (hot-swap) | Yes | Yes | Yes |
LoRA with rank r=16 on a 4096x4096 weight matrix trains how many parameters?
Step-by-Step Derivation
LoRA: Low-Rank Adaptation
Freeze the pre-trained weight and add a low-rank update. During inference, merge into — zero additional latency:
Original forward pass (frozen):
LoRA forward pass (only A and B train):
Merged at inference (zero latency overhead):
Parameter Savings
Full matrix: parameters. LoRA: parameters. Concrete numbers for :
At the adapter is 65K params vs 16.7M — a 256× reduction. At it's 524K (3%), still 32× cheaper:
| Rank r | Trainable params | % of d×d |
|---|---|---|
| r=8 | 65,536 | 0.4% |
| r=16 | 131,072 | 0.8% |
| r=64 | 524,288 | 3.1% |
| full | 16,777,216 | 100% |
PyTorch: LoRA Layer (rank=8, alpha=16)
class LoRALinear(nn.Module):
def __init__(self, in_dim, out_dim, rank=8, alpha=16):
super().__init__()
self.W = nn.Linear(in_dim, out_dim, bias=False)
self.W.weight.requires_grad_(False) # Freeze base
self.A = nn.Linear(in_dim, rank, bias=False)
self.B = nn.Linear(rank, out_dim, bias=False)
nn.init.zeros_(self.B.weight) # B=0 → adapter is a no-op at init (ΔW=0)
self.scale = alpha / rank
def forward(self, x):
return self.W(x) + self.B(self.A(x)) * self.scalePyTorch: LoRA Layer (with merge for inference)
class LoRALinear(nn.Module):
def __init__(self, in_dim, out_dim, rank=16, alpha=32):
super().__init__()
self.W = nn.Linear(in_dim, out_dim, bias=False)
self.W.weight.requires_grad = False # freeze base
self.A = nn.Linear(in_dim, rank, bias=False) # down-project
self.B = nn.Linear(rank, out_dim, bias=False) # up-project
self.scale = alpha / rank
nn.init.kaiming_uniform_(self.A.weight)
nn.init.zeros_(self.B.weight) # start at zero delta
def forward(self, x):
return self.W(x) + self.B(self.A(x)) * self.scale
def merge(self):
"""Merge adapter into base weight — zero inference overhead."""
self.W.weight.data += (self.B.weight @ self.A.weight) * self.scale
self.W.weight.requires_grad = False
# After merge, A and B can be discardedPyTorch implementation
# LoRA: inject low-rank adapters into attention projections
import torch, torch.nn as nn
class LoRALinear(nn.Module):
"""Drop-in replacement for nn.Linear with LoRA adaptation."""
def __init__(self, in_dim, out_dim, rank=8, alpha=16):
super().__init__()
self.W0 = nn.Linear(in_dim, out_dim, bias=False)
self.W0.weight.requires_grad_(False) # freeze base weights
self.A = nn.Linear(in_dim, rank, bias=False)
self.B = nn.Linear(rank, out_dim, bias=False)
nn.init.kaiming_uniform_(self.A.weight)
nn.init.zeros_(self.B.weight) # B=0: adapter contributes zero update at init
self.scale = alpha / rank
def forward(self, x):
return self.W0(x) + self.B(self.A(x)) * self.scale # W0 + BA
def merge(self):
"""Merge into W0 for zero-overhead inference."""
self.W0.weight.data += (self.B.weight @ self.A.weight) * self.scalePractical: Fine-tuning Workflow
1. Data Preparation:Format as instruction/response pairs in the model's chat template. Mask instruction tokens in the loss — train only on responses.
2. OpenAI Fine-tuning API: Upload JSONL with {"messages": [{"role": "system", ...}, {"role": "user", ...}, {"role": "assistant", ...}]}. Cost: ~$8/1M training tokens (GPT-3.5-turbo era pricing; check openai.com/pricing for current rates — newer models are priced per training hour).
3. HuggingFace + QLoRA: Use peft library with BitsAndBytesConfig(load_in_4bit=True). Set r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"].
4. Evaluation: Track train/val loss, run qualitative evals every N steps. Watch for train-val divergence (overfitting signal). Merge adapters for production serving.
Python: QLoRA Setup with Hugging Face PEFT
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, TaskType
import torch
# 1. Load base model in 4-bit (NF4 quantization)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 — best for LLMs
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # nested quantization saves ~0.4 GB
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
# 2. Attach LoRA adapters in BF16
lora_config = LoraConfig(
r=16, # rank — start here, tune if needed
lora_alpha=32, # scale = alpha/r = 2.0
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 16,777,216 || all params: 6,742,609,920 || trainable%: 0.2489
# 3. Train with SFTTrainer (trl library)
# 4. Merge + save for production
model = model.merge_and_unload() # folds BA into W0, removes adapterChoosing Rank r — Decision Guide
| Task type | Recommended rank | Rationale |
|---|---|---|
| Style / format change | r=4–8 | Low-rank subspace sufficient for surface changes |
| Instruction following | r=16 | Standard choice; matches most published results |
| Domain adaptation | r=32–64 | Larger shift requires more expressive adapter |
| Code / math (large shift) | r=64–128 | High intrinsic dimensionality of task; validate on held-out set |
Rule of thumb: start at r=16, double if val loss plateaus, halve if train/val diverge. Never go above r=128 without profiling memory — beyond that you are paying full fine-tuning cost.
Quick check
The LoRA scaling factor alpha/r is set to alpha=32, r=16, giving scale=2.0. What happens if you double the rank to r=32 but keep alpha=32?
Break It — See What Happens
Quick check
If LoRA is applied only to attention projection matrices (Q, K, V, O) and not to FFN layers, which fine-tuning task is most likely to suffer?
Real-World Numbers
Method Comparison (7B model)
| Method | Trainable % | Memory | Quality vs Full FT |
|---|---|---|---|
| Full fine-tune | 100% | 4× model size | Baseline |
| LoRA r=8 | 0.1–0.4% | 1.1× model size | 95–100% |
| LoRA r=64 | 0.8–3% | 1.3× model size | 98–100% |
| QLoRA (4-bit) | 0.1% | 0.3× model size | 93–97% |
Concrete GPU Requirements
| Setup | Trainable Params | GPU Memory |
|---|---|---|
| Llama-2 7B full FT | 7B (100%) | ~56GB (multi-GPU) |
| Llama-2 7B LoRA r=16 | 4.2M (0.06%) | ~16GB (single GPU) |
| Llama-2 7B QLoRA r=16 | 4.2M (0.06%) | ~6GB (single GPU) |
| Llama-2 70B QLoRA r=16 | 33M (0.05%) | |
| OpenAI fine-tuning API | Unknown (managed) | ~$8/1M training tokens (GPT-3.5-turbo era; see openai.com/pricing) |
SFT Dataset Reference Points
Production SFT pipelines are smaller than you might expect. , followed by . . Both demonstrate that data quality and labeler expertise matter far more than raw example count.
Data Quality vs. Quantity — LIMA (2023)
Zhou et al. (2023) showed that . The implication for fine-tuning strategy: invest in data curation — format consistency, response quality, instruction diversity — rather than raw volume. A small, high-quality dataset beats a large noisy one.
Key Takeaways
What to remember for interviews
- 1LoRA exploits the low intrinsic rank of weight updates — trains B·A where r << d
- 2QLoRA quantizes the base model to 4-bit, keeping LoRA adapters in FP16
- 3DoRA decomposes weights into magnitude + direction, applying LoRA only to direction
- 4LoRA adapters are tiny (10-50MB) and can be hot-swapped at serving time
- 5Always validate on a held-out set — if train loss drops but val loss doesn't, reduce rank
Recap quiz
LoRA rank r=16 is applied to a single 4096×4096 attention projection. How many trainable parameters does this add?
QLoRA lets you fine-tune a 65B model on one 48 GB GPU. Which combination of techniques makes this possible?
LIMA showed 1,000 curated examples match Alpaca (52K examples) on human preference evals. What does this imply for fine-tuning data strategy?
A team is fine-tuning a 7B model on 500 code-generation examples. They see train loss drop to 0.1 but val loss plateau at 1.8 after epoch 2. What is the most likely fix?
LoRA initializes B=0 and A with Kaiming uniform. What is the effect at the start of training?
Why can't you full fine-tune a 70B model on a single 48 GB GPU, even with FP16 mixed precision?
Further Reading
- LoRA: Low-Rank Adaptation of Large Language Models — The original LoRA paper — freeze base weights, train low-rank decomposition matrices A and B.
- QLoRA: Efficient Finetuning of Quantized LLMs — 4-bit NormalFloat quantization + LoRA adapters, enabling 65B model fine-tuning on a single 48GB GPU.
- PEFT: Parameter-Efficient Fine-Tuning (Hugging Face docs) — Hugging Face library implementing LoRA, prefix tuning, prompt tuning, and other PEFT methods.
- Lilian Weng's Blog — In-depth posts on fine-tuning methods, PEFT variants, and alignment techniques.
- The Power of Scale for Parameter-Efficient Prompt Tuning — Lester et al. 2021 — prompt tuning trains only soft prompt tokens while freezing the entire model. Matches full fine-tuning at 11B scale.
- Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning — Xu et al. 2023 — comprehensive survey comparing LoRA, adapters, prefix tuning, and prompt tuning across benchmarks.
- Andrej Karpathy — The State of GPT (Microsoft Build 2023) — Covers the SFT and RLHF fine-tuning pipeline end-to-end, including practical data requirements and training tips.
Interview Questions
Showing 6 of 6
Explain LoRA. Why does it work despite training so few parameters?
★★☆What is QLoRA and how does it enable fine-tuning 70B models on a single GPU?
★★★How do you choose the LoRA rank r? What are the tradeoffs?
★★☆Compare full fine-tuning, LoRA, and prompt tuning. When would you use each?
★☆☆How do you prepare data for fine-tuning? What are common pitfalls?
★★☆How do you detect and prevent overfitting during fine-tuning?
★★☆