๐Ÿง  AI System Design

Day 20: Model Adapters & LoRA

๐Ÿ“‚ Serving & Inference ๐Ÿ“– 15 min read Needs expansion

Learning Objectives

  • Understand LoRA and why it's the dominant fine-tuning approach
  • Learn how to swap adapters without reloading the base model
  • Benchmark adapter swap latency

Theory (15 min)

What is LoRA?

Low-Rank Adaptation: Instead of fine-tuning all 7B parameters, train small rank-decomposition matrices that adapt the model's behaviour.

Base Model Weights (7B params, frozen)
    W = Wโ‚€ + AยทB

    A: rank r ร— d  (e.g., 8 ร— 4096)   โ† Trained
    B: d ร— rank r  (e.g., 4096 ร— 8)   โ† Trained

Key numbers: - Full fine-tune: 7B parameters trained = 14GB of gradients - LoRA fine-tune: 2 ร— 8 ร— 4096 ร— 32 layers = 2M params trained = 8MB

LoRA is ~1000x cheaper than full fine-tuning.

Adapter Serving

The real power: swap adapters at inference time without reloading the base model.

Time:  โ”€โ”€โ”€โ”€โ–ถ [Base Model loaded in GPU memory] โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
          โ–ฒ         โ–ฒ         โ–ฒ         โ–ฒ
          โ”‚         โ”‚         โ”‚         โ”‚
     Adapter A  Adapter B  Adapter C  Adapter D
     (coding)   (chat)     (medical)  (legal)
     weight      weight     weight     weight
     ~8MB       ~8MB       ~8MB       ~8MB

Swap latency: Load new adapter weights โ†’ matrix multiply. Takes ~1ms.

Compare to loading a full model: 1-10 seconds.

Use Cases

Scenario Full Model LoRA Adapter
Model load 30s (7B q4) 1ms
Storage per task 3.5GB 8MB
Multi-task serving Nร— resources Nร— tiny adapters
Per-user customization Impossible Trivial

Hands-on (15 min)

Download, Apply, and Benchmark LoRA Adapters

# Using llama.cpp's LoRA support
# (llama.cpp supports --lora and --lora-base flags)

# Example: Download a LoRA adapter
# huggingface-cli download zixuan/Qwen2.5-3B-Chat-lora --local-dir ./lora

# Apply at inference
llama-cli -m /models/qwen2.5-3b-q4_K_M.gguf \
  --lora ./lora/adapter.bin \
  -p "Write a Python function to sort a list" \
  -n 100
#!/usr/bin/env python3
"""lora-benchmark.py โ€” measure adapter swap overhead."""
import time
import subprocess
import json

# Stub โ€” Ayva will expand with:
# - Fine-tune a LoRA adapter on a custom dataset (using unsloth or PEFT)
# - Compare base model vs LoRA output qualitatively
# - Measure: load time, inference speed, memory delta
# - Multi-adapter routing (semantic router from Day 4 directs to correct adapter)
# - Adapter hot-swap without restarting the server
# - Per-user adapter serving architecture

MODEL = "/models/qwen2.5-3b-q4_K_M.gguf"
LORA_PATH = "./lora/adapter.bin"  # adjust to your adapter

PROMPT = "Explain machine learning in simple terms."

# Baseline: no LoRA
print("Baseline (no adapter)...")
start = time.time()
result = subprocess.run(
    ["llama-cli", "-m", MODEL, "-n", "50", "-p", PROMPT,
     "--no-display-prompt", "-ngl", "99"],
    capture_output=True, text=True, timeout=60,
)
baseline_time = time.time() - start
baseline_output = result.stdout
print(f"  โ†’ {baseline_time:.2f}s")

# With LoRA (if adapter exists)
if os.path.exists(LORA_PATH):
    print("With LoRA adapter...")
    start = time.time()
    result = subprocess.run(
        ["llama-cli", "-m", MODEL, "--lora", LORA_PATH, "-n", "50",
         "-p", PROMPT, "--no-display-prompt", "-ngl", "99"],
        capture_output=True, text=True, timeout=60,
    )
    lora_time = time.time() - start
    lora_output = result.stdout
    print(f"  โ†’ {lora_time:.2f}s")
    print(f"Overhead: {((lora_time/baseline_time)-1)*100:.1f}%")

    # Quality: check if outputs differ
    if baseline_output[:100] != lora_output[:100]:
        print("โœจ LoRA changed the output (expected)")
    else:
        print("โš ๏ธ  Outputs identical โ€” adapter may not be active")
else:
    print("โš ๏ธ  No LoRA adapter found at", LORA_PATH)

Questions for Ayva: - How to serve multiple LoRA adapters concurrently (like a model router)? - What's the practical limit on number of adapters per base model? - How does LoRA rank affect quality vs speed?


Key Takeaways

  • LoRA adapters are ~1000x cheaper than full fine-tuning
  • Adapter swapping enables multi-task serving without reloading the base model
  • Swap overhead is ~1ms โ€” orders of magnitude faster than model reloading
  • LoRA is the standard for practical model customisation in production

References