Day 4: Load Balancing & Routing
Learning Objectives
- Understand why AI systems need load balancing beyond simple round-robin
- Learn semantic routing: classify ā route to specialised model
- Set up a basic load balancer across inference server instances
Theory (15 min)
Why Load Balance Inference?
Multiple reasons to run more than one model instance:
- Throughput ā one server can only handle N concurrent requests
- Specialisation ā different models for different tasks (chat vs code vs embedding)
- Fault tolerance ā if one instance crashes, others take over
- A/B testing ā slowly roll out new model versions
Classic LB Algorithms
| Algorithm | How it works | Best for |
|---|---|---|
| Round-robin | Server A, B, C, A, B, C⦠| Equal-capacity servers |
| Least connections | Send to server with fewest active | Variable request duration |
| IP hash | Always route same client to same server | Session affinity |
| Random | Pick one at random | Simple, good enough under load |
| Weighted | Round-robin weighted by capacity | Heterogeneous hardware |
Problem: These don't know what kind of request is coming. All requests are treated equally.
Semantic Routing (Model-Aware)
The real power: route by intent. A classifier inspects the request and sends it to the best-suited model.
Request: "Write a React component for a dropdown"
ā
ā¼
Classifier ā "code generation"
ā
ā¼
Route to code-specialised model (DeepSeek Coder, CodeLlama)
vs
Request: "I'm feeling anxious about my deadline"
ā
ā¼
Classifier ā "emotional support"
ā
ā¼
Route to instruction-tuned chat model
How it works: 1. Classifier (could be a small model, a rule set, or keywords) tags the intent 2. Router looks up intent ā model mapping 3. Request is forwarded to the right backend
When Routing Matters Most
Systems with heterogeneous model fleets benefit most: - OpenAI: GPT-4 for complex, GPT-3.5 for simple, DALL-E for images - Perplexity: different models for search vs summarisation - Your stack: llama.cpp for chat, a small embedding model for RAG
Hands-on (15 min)
Set Up a Simple LB with nginx
# inference-lb.conf ā round-robin across 2 llama.cpp instances
upstream inference_cluster {
server 127.0.0.1:8080 weight=2; # instance 1 (faster quant)
server 127.0.0.1:8081 weight=1; # instance 2 (higher quality)
}
server {
listen 9000;
location /v1/completions {
proxy_pass http://inference_cluster;
proxy_read_timeout 120s;
proxy_set_header Host $host;
}
location /health {
proxy_pass http://inference_cluster/health;
}
}
Or with a Python Router
#!/usr/bin/env python3
"""semantic-router.py ā route by intent detection."""
import re
import httpx
import json
class SemanticRouter:
def __init__(self):
self.backends = {
"code": "http://localhost:8080/v1/completions",
"chat": "http://localhost:8081/v1/completions",
"fast": "http://localhost:8082/v1/completions",
}
def classify_intent(self, prompt: str) -> str:
"""Simple keyword-based intent classification."""
code_keywords = r'\b(code|function|class|def |import |react|typescript|python|javascript|api)\b'
if re.search(code_keywords, prompt, re.IGNORECASE):
return "code"
short = len(prompt.split()) < 10
if short:
return "fast"
return "chat"
async def route(self, prompt: str, **kwargs):
intent = self.classify_intent(prompt)
backend_url = self.backends[intent]
try:
async with httpx.AsyncClient(timeout=30) as cli:
resp = await cli.post(backend_url, json={
"prompt": prompt,
"max_tokens": kwargs.get("max_tokens", 100),
})
result = resp.json()["choices"][0]["text"]
except Exception as e:
result = f"[error: {e}]"
return intent, result
# Test
import asyncio
async def test():
router = SemanticRouter()
prompts = [
"Write a Python function to sort a list",
"How are you today?",
"hi",
]
for p in prompts:
intent, result = await router.route(p, max_tokens=30)
print(f"[{intent.upper()}] {p}")
print(f" ā {result[:80]}...\n")
asyncio.run(test())
Run it:
cd /tmp
python3 semantic-router.py
Quick nginx LB test (if nginx is available):
docker run -d --name llama-a -p 8080:8080 \
ghcr.io/ggerganov/llama.cpp:server \
-m /models/qwen2.5-3b-q4.gguf -c 4096 --port 8080
docker run -d --name llama-b -p 8081:8080 \
ghcr.io/ggerganov/llama.cpp:server \
-m /models/qwen2.5-3b-q4.gguf -c 4096 --port 8080
Key Takeaways
- Classic load balancing treats all inference requests equally ā but they aren't
- Semantic routing classifies intent and sends requests to the best model for the job
- Heterogeneous model fleets (fast/small + big/smart) are more cost-effective than one big model
- nginx/Traefik work well for Layer-7 routing; add a sidecar for semantic routing