Fine-tuning & LoRA — Transformer Math

Module 17 · Training

🔧 Fine-tuning & LoRA

LoRA trains 0.1% of parameters and matches full fine-tuning

Status:

Full fine-tuning updates every parameter — expensive and risks catastrophic forgetting. LoRA adds tiny trainable matrices that adapt the model with . QLoRA adds 4-bit quantization so you can fine-tune a (per the original paper; the technique generalizes to 70B-class models).

🎮

LoRA Architecture

What you’re seeing:a frozen pre-trained weight matrix W alongside the trainable low-rank decomposition BA — during the forward pass, the adapter output BA·x is added to the frozen W·x, so only the small A and B matrices are updated. What to try: adjust the rank slider in the configurator below to see how it trades off parameter count against expressiveness.

Configure LoRA settings and see the impact on memory and trainable parameters. Adjust rank, alpha, target modules, base model, and quantization to explore the tradeoffs.

LoRA Rank (r): 16

1128

Alpha: 32(scale = 2.0)

Target Modules

Base Model

Quantization

Metric	Value
Trainable Parameters	16.78M
% of Total Params	0.2397%
Model Weights	14.0 GB
LoRA Adapters	34 MB
Optimizer States	137 MB
Activations (est.)	1.6 GB
Total VRAM	15.8 GB
Fits on GPU	24GB Yes 48GB Yes 80GB Yes

💡

The Intuition

Full fine-tuning updates all parameters. For a 70B model, that means 140GB of gradients and optimizer states — you need multiple A100s just for the optimizer. Worse, updating all params risks catastrophic forgetting: the model improves on your task but forgets general capabilities.

LoRA exploits a key insight: the weight updates during fine-tuning have low intrinsic rank. Instead of updating the full matrix, LoRA adds where and with .

QLoRA goes further: quantize the frozen base model to 4-bit, keep LoRA adapters in FP16. The base model shrinks from 140GB to ~35GB, and only the tiny LoRA matrices need gradients. Result: fine-tune Llama-3 70B on a single 48GB GPU.

✨ Insight · Think of LoRA as adding a small correction lens to a telescope. The main mirror (pre-trained weights) stays fixed. The correction lens (LoRA matrices) is tiny but precisely shaped to fix the specific aberration (task adaptation) you care about.

Catastrophic Forgetting — Quantified: Full fine-tuning on a narrow task dataset can dramatically degrade performance on tasks the model was not fine-tuned on. Luo et al. (2023) showed that continual instruction tuning of BLOOMZ-7.1B degraded MMLU accuracy by . LoRA largely avoids this: because the base weights are frozen and the adapter adds only a low-rank correction, the original knowledge is structurally preserved. Empirically, . This structural forgetting resistance is the second major reason (after memory cost) that LoRA has become the default fine-tuning method.

Worked Example — LoRA Parameter Count (Llama-2 7B)

Setup: d_model = 4096, 32 layers, LoRA applied to Q, K, V, O projections (4 matrices per layer), rank r = 16.

Per matrix: 2 × 4096 × 16 =

Per layer: 4 × 131,072 = 524,288 params

Total: 32 × 524,288 = 16,777,216 ≈ 16.8M params

Share of 8B model: 16.8M / 8,000M = — yet matches full fine-tuning quality on most tasks.

DoRA — Weight-Decomposed LoRA: Liu et al. (2024) observed that LoRA updates tend to change the direction of weight vectors much more easily than their magnitude — but full fine-tuning updates both. DoRA explicitly decomposes each weight matrix into magnitude (a scalar per output channel) and direction (a unit-norm matrix), then applies LoRA only to the direction component while keeping the magnitude trainable directly. This better mimics the update pattern of full fine-tuning and produces consistent quality gains over plain LoRA with the same rank and no additional inference cost, since the decomposition can be merged back before deployment.

LoRA+ (2024): Hayou et al. observed that LoRA treats the A and B matrices symmetrically with the same learning rate, but they play different roles — A initializes with random Kaiming uniform values and projects input down, while B initializes to zero and projects back up. LoRA+ sets a higher learning rate for B than A (typically a fixed ratio of ~16x), since B is closer to the output and starts from zero. This asymmetric LR schedule better matches the gradient flow dynamics of full fine-tuning. The practical result: ~2% quality improvement on standard benchmarks with no additional parameters or inference overhead — just a one-line hyperparameter change.

Spectrum (2024):Rather than applying LoRA uniformly to all layers, Spectrum uses Signal-to-Noise Ratio (SNR) per layer to decide which layers to fine-tune. The intuition: layers with high SNR (signal dominates noise in their weight spectra) carry more learned structure and benefit more from adaptation, while low-SNR layers are noisier and contribute less. Spectrum computes SNR from the singular value decomposition of each weight matrix — high singular values relative to noise floor indicate high SNR. Fine-tuning only the top-SNR layers reduces compute by 30–50% with minimal quality loss, and often outperforms full-layer LoRA because it avoids adapting layers that aren't meaningfully task-relevant.

Quick check

Trade-off

Why does LoRA use r ≪ d instead of training the full d×d weight matrix directly?

Weight updates during fine-tuning have low intrinsic rank, so a low-rank subspace captures most of the task-specific signal.Low rank forces the model to unlearn base knowledge, making adaptation faster.Training a full d×d matrix is numerically unstable for fine-tuning tasks.Low rank ensures the adapter can be quantized to 4-bit without quality loss.

📋

Method Comparison

	Full FT	LoRA	QLoRA	DoRA	Prefix Tuning
Trainable params	100%	0.01-1%	0.01-1%	0.01-1%	~0.001%
GPU memory (7B)	~112GB	~16GB	~10GB	~16GB	~14GB
Quality vs full FT	100%	~98%	~97%	~99%+	~90%
Inference overhead	None	None (merged)	None (merged)	None (merged)	Extra tokens
Multi-tenant	No	Yes (hot-swap)	Yes	Yes	Yes

Quick Check

LoRA with rank r=16 on a 4096x4096 weight matrix trains how many parameters?

📐

Step-by-Step Derivation

LoRA: Low-Rank Adaptation

Freeze the pre-trained weight and add a low-rank update. During inference, merge into — zero additional latency:

Original forward pass (frozen):

LoRA forward pass (only A and B train):

Merged at inference (zero latency overhead):

✨ Insight · B initializes to zero so the adapter starts as a no-op (ΔW = BA = 0) — training begins from the pre-trained model's exact output, not a random perturbation.

Parameter Savings

Full matrix: parameters. LoRA: parameters. Concrete numbers for :

At the adapter is 65K params vs 16.7M — a 256× reduction. At it's 524K (3%), still 32× cheaper:

Rank r	Trainable params	% of d×d
r=8	65,536	0.4%
r=16	131,072	0.8%
r=64	524,288	3.1%
full	16,777,216	100%

💡 Tip · LoRA is typically applied to the attention projection matrices (Q, K, V, O). For a model with layers and 4 projections each, total trainable params = .

PyTorch: LoRA Layer (rank=8, alpha=16)

python

class LoRALinear(nn.Module):
    def __init__(self, in_dim, out_dim, rank=8, alpha=16):
        super().__init__()
        self.W = nn.Linear(in_dim, out_dim, bias=False)
        self.W.weight.requires_grad_(False)  # Freeze base
        self.A = nn.Linear(in_dim, rank, bias=False)
        self.B = nn.Linear(rank, out_dim, bias=False)
        nn.init.zeros_(self.B.weight)  # B=0 → adapter is a no-op at init (ΔW=0)
        self.scale = alpha / rank

    def forward(self, x):
        return self.W(x) + self.B(self.A(x)) * self.scale

PyTorch: LoRA Layer (with merge for inference)

python

class LoRALinear(nn.Module):
    def __init__(self, in_dim, out_dim, rank=16, alpha=32):
        super().__init__()
        self.W = nn.Linear(in_dim, out_dim, bias=False)
        self.W.weight.requires_grad = False  # freeze base

        self.A = nn.Linear(in_dim, rank, bias=False)   # down-project
        self.B = nn.Linear(rank, out_dim, bias=False)   # up-project
        self.scale = alpha / rank

        nn.init.kaiming_uniform_(self.A.weight)
        nn.init.zeros_(self.B.weight)  # start at zero delta

    def forward(self, x):
        return self.W(x) + self.B(self.A(x)) * self.scale

    def merge(self):
        """Merge adapter into base weight — zero inference overhead."""
        self.W.weight.data += (self.B.weight @ self.A.weight) * self.scale
        self.W.weight.requires_grad = False
        # After merge, A and B can be discarded

PyTorch implementation

# LoRA: inject low-rank adapters into attention projections
import torch, torch.nn as nn

class LoRALinear(nn.Module):
    """Drop-in replacement for nn.Linear with LoRA adaptation."""
    def __init__(self, in_dim, out_dim, rank=8, alpha=16):
        super().__init__()
        self.W0 = nn.Linear(in_dim, out_dim, bias=False)
        self.W0.weight.requires_grad_(False)      # freeze base weights
        self.A = nn.Linear(in_dim, rank, bias=False)
        self.B = nn.Linear(rank, out_dim, bias=False)
        nn.init.kaiming_uniform_(self.A.weight)
        nn.init.zeros_(self.B.weight)             # B=0: adapter contributes zero update at init
        self.scale = alpha / rank

    def forward(self, x):
        return self.W0(x) + self.B(self.A(x)) * self.scale  # W0 + BA

    def merge(self):
        """Merge into W0 for zero-overhead inference."""
        self.W0.weight.data += (self.B.weight @ self.A.weight) * self.scale

Practical: Fine-tuning Workflow

1. Data Preparation:Format as instruction/response pairs in the model's chat template. Mask instruction tokens in the loss — train only on responses.

2. OpenAI Fine-tuning API: Upload JSONL with {"messages": [{"role": "system", ...}, {"role": "user", ...}, {"role": "assistant", ...}]}. Cost: ~$8/1M training tokens (GPT-3.5-turbo era pricing; check openai.com/pricing for current rates — newer models are priced per training hour).

3. HuggingFace + QLoRA: Use peft library with BitsAndBytesConfig(load_in_4bit=True). Set r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"].

4. Evaluation: Track train/val loss, run qualitative evals every N steps. Watch for train-val divergence (overfitting signal). Merge adapters for production serving.

Python: QLoRA Setup with Hugging Face PEFT

python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, TaskType
import torch

# 1. Load base model in 4-bit (NF4 quantization)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",        # NormalFloat4 — best for LLMs
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,   # nested quantization saves ~0.4 GB
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

# 2. Attach LoRA adapters in BF16
lora_config = LoraConfig(
    r=16,                             # rank — start here, tune if needed
    lora_alpha=32,                    # scale = alpha/r = 2.0
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 16,777,216 || all params: 6,742,609,920 || trainable%: 0.2489

# 3. Train with SFTTrainer (trl library)
# 4. Merge + save for production
model = model.merge_and_unload()   # folds BA into W0, removes adapter

Choosing Rank r — Decision Guide

Task type	Recommended rank	Rationale
Style / format change	r=4–8	Low-rank subspace sufficient for surface changes
Instruction following	r=16	Standard choice; matches most published results
Domain adaptation	r=32–64	Larger shift requires more expressive adapter
Code / math (large shift)	r=64–128	High intrinsic dimensionality of task; validate on held-out set

Rule of thumb: start at r=16, double if val loss plateaus, halve if train/val diverge. Never go above r=128 without profiling memory — beyond that you are paying full fine-tuning cost.

Quick check

Derivation

The LoRA scaling factor alpha/r is set to alpha=32, r=16, giving scale=2.0. What happens if you double the rank to r=32 but keep alpha=32?

The scale factor only affects inference, not training gradients.Scale halves to 1.0, so you must also double alpha to preserve the effective update magnitude.Scale stays 2.0 because alpha and r cancel independently.Doubling r automatically doubles the gradient magnitude, compensating for the lower scale.

🔧

Break It — See What Happens

Set rank too low (r=1)

Set rank too high (r=512)

Full fine-tune on small dataset (1K examples)

LoRA rank too high (r=256)

Apply LoRA only to attention layers (skip FFN)

Quick check

Trade-off

If LoRA is applied only to attention projection matrices (Q, K, V, O) and not to FFN layers, which fine-tuning task is most likely to suffer?

Chat format adaptation — teaching the model to follow a new conversation template.Multilingual transfer — adapting an English model to French outputs.Factual knowledge injection — teaching the model new domain facts stored primarily in FFN weights.Response length control — shortening or lengthening outputs.

📊

Real-World Numbers

Method Comparison (7B model)

Method	Trainable %	Memory	Quality vs Full FT
Full fine-tune	100%	4× model size	Baseline
LoRA r=8	0.1–0.4%	1.1× model size	95–100%
LoRA r=64	0.8–3%	1.3× model size	98–100%
QLoRA (4-bit)	0.1%	0.3× model size	93–97%

Concrete GPU Requirements

Setup	Trainable Params	GPU Memory
Llama-2 7B full FT	7B (100%)	~56GB (multi-GPU)
Llama-2 7B LoRA r=16	4.2M (0.06%)	~16GB (single GPU)
Llama-2 7B QLoRA r=16	4.2M (0.06%)	~6GB (single GPU)
Llama-2 70B QLoRA r=16	33M (0.05%)
OpenAI fine-tuning API	Unknown (managed)	~$8/1M training tokens (GPT-3.5-turbo era; see openai.com/pricing)

✨ Insight · Llama-2 7B full fine-tuning requires ~112GB VRAM with standard mixed-precision training (FP16 weights + FP32 Adam states). The ~56GB figure in the table assumes 2-GPU FSDP sharding. QLoRA cuts single-GPU memory to ~6GB — the same task on a consumer RTX 4090. For 70B models: full FT needs ~560GB (8×A100), QLoRA fits in ~48GB (1×A100).

💡 Tip · LoRA adapters are tiny files (10–50MB) that can be hot-swapped at serving time. This enables multi-tenant setups: one base model, many task-specific adapters loaded on demand. , making per-customer fine-tuned models economically viable.

SFT Dataset Reference Points

Production SFT pipelines are smaller than you might expect. , followed by . . Both demonstrate that data quality and labeler expertise matter far more than raw example count.

Data Quality vs. Quantity — LIMA (2023)

Zhou et al. (2023) showed that . The implication for fine-tuning strategy: invest in data curation — format consistency, response quality, instruction diversity — rather than raw volume. A small, high-quality dataset beats a large noisy one.

🧠

Key Takeaways

What to remember for interviews

1LoRA exploits the low intrinsic rank of weight updates — trains B·A where r << d
2QLoRA quantizes the base model to 4-bit, keeping LoRA adapters in FP16
3DoRA decomposes weights into magnitude + direction, applying LoRA only to direction
4LoRA adapters are tiny (10-50MB) and can be hot-swapped at serving time
5Always validate on a held-out set — if train loss drops but val loss doesn't, reduce rank

🧠

Recap quiz

📚

Interview Questions

Difficulty:

Company:

Showing 6 of 6

Explain LoRA. Why does it work despite training so few parameters?

★★☆

MetaOpenAI

What is QLoRA and how does it enable fine-tuning 70B models on a single GPU?

★★★

MetaDatabricks

How do you choose the LoRA rank r? What are the tradeoffs?

★★☆

OpenAIMeta

Compare full fine-tuning, LoRA, and prompt tuning. When would you use each?

★☆☆

GoogleOpenAI

How do you prepare data for fine-tuning? What are common pitfalls?

★★☆

OpenAIDatabricks

How do you detect and prevent overfitting during fine-tuning?

★★☆

AnthropicOpenAI

←

🖥️ Distributed Training

🎯 SFT & Post-Training Pipeline

→

🔧 Fine-tuning & LoRA

LoRA Architecture

The Intuition

Method Comparison

Step-by-Step Derivation

Break It — See What Happens

Real-World Numbers

Key Takeaways

Recap quiz

Further Reading

Interview Questions