Day 20: Model Adapters & LoRA

📂 Serving & Inference 📖 15 min read Needs expansion

Learning Objectives

Understand LoRA and why it's the dominant fine-tuning approach
Learn how to swap adapters without reloading the base model
Benchmark adapter swap latency

Theory (15 min)

What is LoRA?

Low-Rank Adaptation: Instead of fine-tuning all 7B parameters, train small rank-decomposition matrices that adapt the model's behaviour.

Base Model Weights (7B params, frozen)
    W = W₀ + A·B

    A: rank r × d  (e.g., 8 × 4096)   ← Trained
    B: d × rank r  (e.g., 4096 × 8)   ← Trained

Key numbers: - Full fine-tune: 7B parameters trained = 14GB of gradients - LoRA fine-tune: 2 × 8 × 4096 × 32 layers = 2M params trained = 8MB

LoRA is ~1000x cheaper than full fine-tuning.

Adapter Serving

The real power: swap adapters at inference time without reloading the base model.

Time:  ────▶ [Base Model loaded in GPU memory] ◀───────
          ▲         ▲         ▲         ▲
          │         │         │         │
     Adapter A  Adapter B  Adapter C  Adapter D
     (coding)   (chat)     (medical)  (legal)
     weight      weight     weight     weight
     ~8MB       ~8MB       ~8MB       ~8MB

Swap latency: Load new adapter weights → matrix multiply. Takes ~1ms.

Compare to loading a full model: 1-10 seconds.

Use Cases

Scenario	Full Model	LoRA Adapter
Model load	30s (7B q4)	1ms
Storage per task	3.5GB	8MB
Multi-task serving	N× resources	N× tiny adapters
Per-user customization	Impossible	Trivial

Hands-on (15 min)

Download, Apply, and Benchmark LoRA Adapters

# Using llama.cpp's LoRA support
# (llama.cpp supports --lora and --lora-base flags)

# Example: Download a LoRA adapter
# huggingface-cli download zixuan/Qwen2.5-3B-Chat-lora --local-dir ./lora

# Apply at inference
llama-cli -m /models/qwen2.5-3b-q4_K_M.gguf \
  --lora ./lora/adapter.bin \
  -p "Write a Python function to sort a list" \
  -n 100

#!/usr/bin/env python3
"""lora-benchmark.py — measure adapter swap overhead."""
import time
import subprocess
import json

# Stub — Ayva will expand with:
# - Fine-tune a LoRA adapter on a custom dataset (using unsloth or PEFT)
# - Compare base model vs LoRA output qualitatively
# - Measure: load time, inference speed, memory delta
# - Multi-adapter routing (semantic router from Day 4 directs to correct adapter)
# - Adapter hot-swap without restarting the server
# - Per-user adapter serving architecture

MODEL = "/models/qwen2.5-3b-q4_K_M.gguf"
LORA_PATH = "./lora/adapter.bin"  # adjust to your adapter

PROMPT = "Explain machine learning in simple terms."

# Baseline: no LoRA
print("Baseline (no adapter)...")
start = time.time()
result = subprocess.run(
    ["llama-cli", "-m", MODEL, "-n", "50", "-p", PROMPT,
     "--no-display-prompt", "-ngl", "99"],
    capture_output=True, text=True, timeout=60,
)
baseline_time = time.time() - start
baseline_output = result.stdout
print(f"  → {baseline_time:.2f}s")

# With LoRA (if adapter exists)
if os.path.exists(LORA_PATH):
    print("With LoRA adapter...")
    start = time.time()
    result = subprocess.run(
        ["llama-cli", "-m", MODEL, "--lora", LORA_PATH, "-n", "50",
         "-p", PROMPT, "--no-display-prompt", "-ngl", "99"],
        capture_output=True, text=True, timeout=60,
    )
    lora_time = time.time() - start
    lora_output = result.stdout
    print(f"  → {lora_time:.2f}s")
    print(f"Overhead: {((lora_time/baseline_time)-1)*100:.1f}%")

    # Quality: check if outputs differ
    if baseline_output[:100] != lora_output[:100]:
        print("✨ LoRA changed the output (expected)")
    else:
        print("⚠️  Outputs identical — adapter may not be active")
else:
    print("⚠️  No LoRA adapter found at", LORA_PATH)

Questions for Ayva: - How to serve multiple LoRA adapters concurrently (like a model router)? - What's the practical limit on number of adapters per base model? - How does LoRA rank affect quality vs speed?

Key Takeaways

LoRA adapters are ~1000x cheaper than full fine-tuning
Adapter swapping enables multi-task serving without reloading the base model
Swap overhead is ~1ms — orders of magnitude faster than model reloading
LoRA is the standard for practical model customisation in production

🧠 AI System Design