Day 20: Model Adapters & LoRA
Learning Objectives
- Understand LoRA and why it's the dominant fine-tuning approach
- Learn how to swap adapters without reloading the base model
- Benchmark adapter swap latency
Theory (15 min)
What is LoRA?
Low-Rank Adaptation: Instead of fine-tuning all 7B parameters, train small rank-decomposition matrices that adapt the model's behaviour.
Base Model Weights (7B params, frozen)
W = Wโ + AยทB
A: rank r ร d (e.g., 8 ร 4096) โ Trained
B: d ร rank r (e.g., 4096 ร 8) โ Trained
Key numbers: - Full fine-tune: 7B parameters trained = 14GB of gradients - LoRA fine-tune: 2 ร 8 ร 4096 ร 32 layers = 2M params trained = 8MB
LoRA is ~1000x cheaper than full fine-tuning.
Adapter Serving
The real power: swap adapters at inference time without reloading the base model.
Time: โโโโโถ [Base Model loaded in GPU memory] โโโโโโโโ
โฒ โฒ โฒ โฒ
โ โ โ โ
Adapter A Adapter B Adapter C Adapter D
(coding) (chat) (medical) (legal)
weight weight weight weight
~8MB ~8MB ~8MB ~8MB
Swap latency: Load new adapter weights โ matrix multiply. Takes ~1ms.
Compare to loading a full model: 1-10 seconds.
Use Cases
| Scenario | Full Model | LoRA Adapter |
|---|---|---|
| Model load | 30s (7B q4) | 1ms |
| Storage per task | 3.5GB | 8MB |
| Multi-task serving | Nร resources | Nร tiny adapters |
| Per-user customization | Impossible | Trivial |
Hands-on (15 min)
Download, Apply, and Benchmark LoRA Adapters
# Using llama.cpp's LoRA support
# (llama.cpp supports --lora and --lora-base flags)
# Example: Download a LoRA adapter
# huggingface-cli download zixuan/Qwen2.5-3B-Chat-lora --local-dir ./lora
# Apply at inference
llama-cli -m /models/qwen2.5-3b-q4_K_M.gguf \
--lora ./lora/adapter.bin \
-p "Write a Python function to sort a list" \
-n 100
#!/usr/bin/env python3
"""lora-benchmark.py โ measure adapter swap overhead."""
import time
import subprocess
import json
# Stub โ Ayva will expand with:
# - Fine-tune a LoRA adapter on a custom dataset (using unsloth or PEFT)
# - Compare base model vs LoRA output qualitatively
# - Measure: load time, inference speed, memory delta
# - Multi-adapter routing (semantic router from Day 4 directs to correct adapter)
# - Adapter hot-swap without restarting the server
# - Per-user adapter serving architecture
MODEL = "/models/qwen2.5-3b-q4_K_M.gguf"
LORA_PATH = "./lora/adapter.bin" # adjust to your adapter
PROMPT = "Explain machine learning in simple terms."
# Baseline: no LoRA
print("Baseline (no adapter)...")
start = time.time()
result = subprocess.run(
["llama-cli", "-m", MODEL, "-n", "50", "-p", PROMPT,
"--no-display-prompt", "-ngl", "99"],
capture_output=True, text=True, timeout=60,
)
baseline_time = time.time() - start
baseline_output = result.stdout
print(f" โ {baseline_time:.2f}s")
# With LoRA (if adapter exists)
if os.path.exists(LORA_PATH):
print("With LoRA adapter...")
start = time.time()
result = subprocess.run(
["llama-cli", "-m", MODEL, "--lora", LORA_PATH, "-n", "50",
"-p", PROMPT, "--no-display-prompt", "-ngl", "99"],
capture_output=True, text=True, timeout=60,
)
lora_time = time.time() - start
lora_output = result.stdout
print(f" โ {lora_time:.2f}s")
print(f"Overhead: {((lora_time/baseline_time)-1)*100:.1f}%")
# Quality: check if outputs differ
if baseline_output[:100] != lora_output[:100]:
print("โจ LoRA changed the output (expected)")
else:
print("โ ๏ธ Outputs identical โ adapter may not be active")
else:
print("โ ๏ธ No LoRA adapter found at", LORA_PATH)
Questions for Ayva: - How to serve multiple LoRA adapters concurrently (like a model router)? - What's the practical limit on number of adapters per base model? - How does LoRA rank affect quality vs speed?
Key Takeaways
- LoRA adapters are ~1000x cheaper than full fine-tuning
- Adapter swapping enables multi-task serving without reloading the base model
- Swap overhead is ~1ms โ orders of magnitude faster than model reloading
- LoRA is the standard for practical model customisation in production