Skip to content

Transformer Math

Module 17 · Training

🔧 Fine-tuning & LoRA

LoRA trains 0.1% of parameters and matches full fine-tuning

Status:

Full fine-tuning updates every parameter — expensive and risks catastrophic forgetting. LoRA adds tiny trainable matrices that adapt the model with . QLoRA adds 4-bit quantization so you can fine-tune a (per the original paper; the technique generalizes to 70B-class models).

🎮

LoRA Architecture

What you’re seeing:a frozen pre-trained weight matrix W alongside the trainable low-rank decomposition BA — during the forward pass, the adapter output BA·x is added to the frozen W·x, so only the small A and B matrices are updated. What to try: adjust the rank slider in the configurator below to see how it trades off parameter count against expressiveness.

LoRA: Low-Rank AdaptationxinputW (frozen)d x d❄️Ar x drank rBd x r🔥+W' = W + BAhoutputOnly A and B are trained (r << d) — e.g., d=4096, r=16: 0.8% of parametersFrozenLoRA (trainable)

Configure LoRA settings and see the impact on memory and trainable parameters. Adjust rank, alpha, target modules, base model, and quantization to explore the tradeoffs.

1128
MetricValue
Trainable Parameters16.78M
% of Total Params0.2397%
Model Weights14.0 GB
LoRA Adapters34 MB
Optimizer States137 MB
Activations (est.)1.6 GB
Total VRAM15.8 GB
Fits on GPU24GB Yes 48GB Yes 80GB Yes
💡

The Intuition

Full fine-tuning updates all parameters. For a 70B model, that means 140GB of gradients and optimizer states — you need multiple A100s just for the optimizer. Worse, updating all params risks catastrophic forgetting: the model improves on your task but forgets general capabilities.

LoRA exploits a key insight: the weight updates during fine-tuning have low intrinsic rank. Instead of updating the full matrix, LoRA adds where and with .

QLoRA goes further: quantize the frozen base model to 4-bit, keep LoRA adapters in FP16. The base model shrinks from 140GB to ~35GB, and only the tiny LoRA matrices need gradients. Result: fine-tune Llama-3 70B on a single 48GB GPU.

✨ Insight · Think of LoRA as adding a small correction lens to a telescope. The main mirror (pre-trained weights) stays fixed. The correction lens (LoRA matrices) is tiny but precisely shaped to fix the specific aberration (task adaptation) you care about.

Catastrophic Forgetting — Quantified: Full fine-tuning on a narrow task dataset can dramatically degrade performance on tasks the model was not fine-tuned on. Luo et al. (2023) showed that continual instruction tuning of BLOOMZ-7.1B degraded MMLU accuracy by . LoRA largely avoids this: because the base weights are frozen and the adapter adds only a low-rank correction, the original knowledge is structurally preserved. Empirically, . This structural forgetting resistance is the second major reason (after memory cost) that LoRA has become the default fine-tuning method.

Worked Example — LoRA Parameter Count (Llama-2 7B)

Setup: d_model = 4096, 32 layers, LoRA applied to Q, K, V, O projections (4 matrices per layer), rank r = 16.

Per matrix: 2 × 4096 × 16 =

Per layer: 4 × 131,072 = 524,288 params

Total: 32 × 524,288 = 16,777,216 ≈ 16.8M params

Share of 8B model: 16.8M / 8,000M = — yet matches full fine-tuning quality on most tasks.

DoRA — Weight-Decomposed LoRA: Liu et al. (2024) observed that LoRA updates tend to change the direction of weight vectors much more easily than their magnitude — but full fine-tuning updates both. DoRA explicitly decomposes each weight matrix into magnitude (a scalar per output channel) and direction (a unit-norm matrix), then applies LoRA only to the direction component while keeping the magnitude trainable directly. This better mimics the update pattern of full fine-tuning and produces consistent quality gains over plain LoRA with the same rank and no additional inference cost, since the decomposition can be merged back before deployment.

LoRA+ (2024): Hayou et al. observed that LoRA treats the A and B matrices symmetrically with the same learning rate, but they play different roles — A initializes with random Kaiming uniform values and projects input down, while B initializes to zero and projects back up. LoRA+ sets a higher learning rate for B than A (typically a fixed ratio of ~16x), since B is closer to the output and starts from zero. This asymmetric LR schedule better matches the gradient flow dynamics of full fine-tuning. The practical result: ~2% quality improvement on standard benchmarks with no additional parameters or inference overhead — just a one-line hyperparameter change.

Spectrum (2024):Rather than applying LoRA uniformly to all layers, Spectrum uses Signal-to-Noise Ratio (SNR) per layer to decide which layers to fine-tune. The intuition: layers with high SNR (signal dominates noise in their weight spectra) carry more learned structure and benefit more from adaptation, while low-SNR layers are noisier and contribute less. Spectrum computes SNR from the singular value decomposition of each weight matrix — high singular values relative to noise floor indicate high SNR. Fine-tuning only the top-SNR layers reduces compute by 30–50% with minimal quality loss, and often outperforms full-layer LoRA because it avoids adapting layers that aren't meaningfully task-relevant.

Quick check

Trade-off

Why does LoRA use r ≪ d instead of training the full d×d weight matrix directly?

Why does LoRA use r ≪ d instead of training the full d×d weight matrix directly?
📋

Method Comparison

Full FTLoRAQLoRADoRAPrefix Tuning
Trainable params100%0.01-1%0.01-1%0.01-1%~0.001%
GPU memory (7B)~112GB~16GB~10GB~16GB~14GB
Quality vs full FT100%~98%~97%~99%+~90%
Inference overheadNoneNone (merged)None (merged)None (merged)Extra tokens
Multi-tenantNoYes (hot-swap)YesYesYes
Quick Check

LoRA with rank r=16 on a 4096x4096 weight matrix trains how many parameters?

📐

Step-by-Step Derivation

LoRA: Low-Rank Adaptation

Freeze the pre-trained weight and add a low-rank update. During inference, merge into — zero additional latency:

Original forward pass (frozen):

LoRA forward pass (only A and B train):

Merged at inference (zero latency overhead):

✨ Insight · B initializes to zero so the adapter starts as a no-op (ΔW = BA = 0) — training begins from the pre-trained model's exact output, not a random perturbation.

Parameter Savings

Full matrix: parameters. LoRA: parameters. Concrete numbers for :

At the adapter is 65K params vs 16.7M — a 256× reduction. At it's 524K (3%), still 32× cheaper:

Rank rTrainable params% of d×d
r=865,5360.4%
r=16131,0720.8%
r=64524,2883.1%
full16,777,216100%
💡 Tip · LoRA is typically applied to the attention projection matrices (Q, K, V, O). For a model with layers and 4 projections each, total trainable params = .

PyTorch: LoRA Layer (rank=8, alpha=16)

python
class LoRALinear(nn.Module):
    def __init__(self, in_dim, out_dim, rank=8, alpha=16):
        super().__init__()
        self.W = nn.Linear(in_dim, out_dim, bias=False)
        self.W.weight.requires_grad_(False)  # Freeze base
        self.A = nn.Linear(in_dim, rank, bias=False)
        self.B = nn.Linear(rank, out_dim, bias=False)
        nn.init.zeros_(self.B.weight)  # B=0 → adapter is a no-op at init (ΔW=0)
        self.scale = alpha / rank

    def forward(self, x):
        return self.W(x) + self.B(self.A(x)) * self.scale

PyTorch: LoRA Layer (with merge for inference)

python
class LoRALinear(nn.Module):
    def __init__(self, in_dim, out_dim, rank=16, alpha=32):
        super().__init__()
        self.W = nn.Linear(in_dim, out_dim, bias=False)
        self.W.weight.requires_grad = False  # freeze base

        self.A = nn.Linear(in_dim, rank, bias=False)   # down-project
        self.B = nn.Linear(rank, out_dim, bias=False)   # up-project
        self.scale = alpha / rank

        nn.init.kaiming_uniform_(self.A.weight)
        nn.init.zeros_(self.B.weight)  # start at zero delta

    def forward(self, x):
        return self.W(x) + self.B(self.A(x)) * self.scale

    def merge(self):
        """Merge adapter into base weight — zero inference overhead."""
        self.W.weight.data += (self.B.weight @ self.A.weight) * self.scale
        self.W.weight.requires_grad = False
        # After merge, A and B can be discarded
PyTorch implementation
# LoRA: inject low-rank adapters into attention projections
import torch, torch.nn as nn

class LoRALinear(nn.Module):
    """Drop-in replacement for nn.Linear with LoRA adaptation."""
    def __init__(self, in_dim, out_dim, rank=8, alpha=16):
        super().__init__()
        self.W0 = nn.Linear(in_dim, out_dim, bias=False)
        self.W0.weight.requires_grad_(False)      # freeze base weights
        self.A = nn.Linear(in_dim, rank, bias=False)
        self.B = nn.Linear(rank, out_dim, bias=False)
        nn.init.kaiming_uniform_(self.A.weight)
        nn.init.zeros_(self.B.weight)             # B=0: adapter contributes zero update at init
        self.scale = alpha / rank

    def forward(self, x):
        return self.W0(x) + self.B(self.A(x)) * self.scale  # W0 + BA

    def merge(self):
        """Merge into W0 for zero-overhead inference."""
        self.W0.weight.data += (self.B.weight @ self.A.weight) * self.scale

Practical: Fine-tuning Workflow

1. Data Preparation:Format as instruction/response pairs in the model's chat template. Mask instruction tokens in the loss — train only on responses.

2. OpenAI Fine-tuning API: Upload JSONL with {"messages": [{"role": "system", ...}, {"role": "user", ...}, {"role": "assistant", ...}]}. Cost: ~$8/1M training tokens (GPT-3.5-turbo era pricing; check openai.com/pricing for current rates — newer models are priced per training hour).

3. HuggingFace + QLoRA: Use peft library with BitsAndBytesConfig(load_in_4bit=True). Set r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"].

4. Evaluation: Track train/val loss, run qualitative evals every N steps. Watch for train-val divergence (overfitting signal). Merge adapters for production serving.

Python: QLoRA Setup with Hugging Face PEFT

python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, TaskType
import torch

# 1. Load base model in 4-bit (NF4 quantization)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",        # NormalFloat4 — best for LLMs
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,   # nested quantization saves ~0.4 GB
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

# 2. Attach LoRA adapters in BF16
lora_config = LoraConfig(
    r=16,                             # rank — start here, tune if needed
    lora_alpha=32,                    # scale = alpha/r = 2.0
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 16,777,216 || all params: 6,742,609,920 || trainable%: 0.2489

# 3. Train with SFTTrainer (trl library)
# 4. Merge + save for production
model = model.merge_and_unload()   # folds BA into W0, removes adapter

Choosing Rank r — Decision Guide

Task typeRecommended rankRationale
Style / format changer=4–8Low-rank subspace sufficient for surface changes
Instruction followingr=16Standard choice; matches most published results
Domain adaptationr=32–64Larger shift requires more expressive adapter
Code / math (large shift)r=64–128High intrinsic dimensionality of task; validate on held-out set

Rule of thumb: start at r=16, double if val loss plateaus, halve if train/val diverge. Never go above r=128 without profiling memory — beyond that you are paying full fine-tuning cost.

Quick check

Derivation

The LoRA scaling factor alpha/r is set to alpha=32, r=16, giving scale=2.0. What happens if you double the rank to r=32 but keep alpha=32?

The LoRA scaling factor alpha/r is set to alpha=32, r=16, giving scale=2.0. What happens if you double the rank to r=32 but keep alpha=32?
🔧

Break It — See What Happens

Set rank too low (r=1)
Set rank too high (r=512)
Full fine-tune on small dataset (1K examples)
LoRA rank too high (r=256)
Apply LoRA only to attention layers (skip FFN)

Quick check

Trade-off

If LoRA is applied only to attention projection matrices (Q, K, V, O) and not to FFN layers, which fine-tuning task is most likely to suffer?

If LoRA is applied only to attention projection matrices (Q, K, V, O) and not to FFN layers, which fine-tuning task is most likely to suffer?
📊

Real-World Numbers

Method Comparison (7B model)

MethodTrainable %MemoryQuality vs Full FT
Full fine-tune100%4× model sizeBaseline
LoRA r=80.1–0.4%1.1× model size95–100%
LoRA r=640.8–3%1.3× model size98–100%
QLoRA (4-bit)0.1%0.3× model size93–97%

Concrete GPU Requirements

SetupTrainable ParamsGPU Memory
Llama-2 7B full FT7B (100%)~56GB (multi-GPU)
Llama-2 7B LoRA r=164.2M (0.06%)~16GB (single GPU)
Llama-2 7B QLoRA r=164.2M (0.06%)~6GB (single GPU)
Llama-2 70B QLoRA r=1633M (0.05%)
OpenAI fine-tuning APIUnknown (managed)~$8/1M training tokens (GPT-3.5-turbo era; see openai.com/pricing)
✨ Insight · Llama-2 7B full fine-tuning requires ~112GB VRAM with standard mixed-precision training (FP16 weights + FP32 Adam states). The ~56GB figure in the table assumes 2-GPU FSDP sharding. QLoRA cuts single-GPU memory to ~6GB — the same task on a consumer RTX 4090. For 70B models: full FT needs ~560GB (8×A100), QLoRA fits in ~48GB (1×A100).
💡 Tip · LoRA adapters are tiny files (10–50MB) that can be hot-swapped at serving time. This enables multi-tenant setups: one base model, many task-specific adapters loaded on demand. , making per-customer fine-tuned models economically viable.

SFT Dataset Reference Points

Production SFT pipelines are smaller than you might expect. , followed by . . Both demonstrate that data quality and labeler expertise matter far more than raw example count.

Data Quality vs. Quantity — LIMA (2023)

Zhou et al. (2023) showed that . The implication for fine-tuning strategy: invest in data curation — format consistency, response quality, instruction diversity — rather than raw volume. A small, high-quality dataset beats a large noisy one.

🧠

Key Takeaways

What to remember for interviews

  1. 1LoRA exploits the low intrinsic rank of weight updates — trains B·A where r << d
  2. 2QLoRA quantizes the base model to 4-bit, keeping LoRA adapters in FP16
  3. 3DoRA decomposes weights into magnitude + direction, applying LoRA only to direction
  4. 4LoRA adapters are tiny (10-50MB) and can be hot-swapped at serving time
  5. 5Always validate on a held-out set — if train loss drops but val loss doesn't, reduce rank
🧠

Recap quiz

Derivation

LoRA rank r=16 is applied to a single 4096×4096 attention projection. How many trainable parameters does this add?

LoRA rank r=16 is applied to a single 4096×4096 attention projection. How many trainable parameters does this add?
Trade-off

QLoRA lets you fine-tune a 65B model on one 48 GB GPU. Which combination of techniques makes this possible?

QLoRA lets you fine-tune a 65B model on one 48 GB GPU. Which combination of techniques makes this possible?
Trade-off

LIMA showed 1,000 curated examples match Alpaca (52K examples) on human preference evals. What does this imply for fine-tuning data strategy?

LIMA showed 1,000 curated examples match Alpaca (52K examples) on human preference evals. What does this imply for fine-tuning data strategy?
Trade-off

A team is fine-tuning a 7B model on 500 code-generation examples. They see train loss drop to 0.1 but val loss plateau at 1.8 after epoch 2. What is the most likely fix?

A team is fine-tuning a 7B model on 500 code-generation examples. They see train loss drop to 0.1 but val loss plateau at 1.8 after epoch 2. What is the most likely fix?
Derivation

LoRA initializes B=0 and A with Kaiming uniform. What is the effect at the start of training?

LoRA initializes B=0 and A with Kaiming uniform. What is the effect at the start of training?
Derivation

Why can&apos;t you full fine-tune a 70B model on a single 48 GB GPU, even with FP16 mixed precision?

Why can&apos;t you full fine-tune a 70B model on a single 48 GB GPU, even with FP16 mixed precision?
📚

Further Reading

🎯

Interview Questions

Difficulty:
Company:

Showing 6 of 6

Explain LoRA. Why does it work despite training so few parameters?

★★☆
MetaOpenAI

What is QLoRA and how does it enable fine-tuning 70B models on a single GPU?

★★★
MetaDatabricks

How do you choose the LoRA rank r? What are the tradeoffs?

★★☆
OpenAIMeta

Compare full fine-tuning, LoRA, and prompt tuning. When would you use each?

★☆☆
GoogleOpenAI

How do you prepare data for fine-tuning? What are common pitfalls?

★★☆
OpenAIDatabricks

How do you detect and prevent overfitting during fine-tuning?

★★☆
AnthropicOpenAI