Day 4: Load Balancing & Routing

📂 Foundations 📖 15 min read Ready

Learning Objectives

Understand why AI systems need load balancing beyond simple round-robin
Learn semantic routing: classify → route to specialised model
Set up a basic load balancer across inference server instances

Theory (15 min)

Why Load Balance Inference?

Multiple reasons to run more than one model instance:

Throughput — one server can only handle N concurrent requests
Specialisation — different models for different tasks (chat vs code vs embedding)
Fault tolerance — if one instance crashes, others take over
A/B testing — slowly roll out new model versions

Classic LB Algorithms

Algorithm	How it works	Best for
Round-robin	Server A, B, C, A, B, C…	Equal-capacity servers
Least connections	Send to server with fewest active	Variable request duration
IP hash	Always route same client to same server	Session affinity
Random	Pick one at random	Simple, good enough under load
Weighted	Round-robin weighted by capacity	Heterogeneous hardware

Problem: These don't know what kind of request is coming. All requests are treated equally.

Semantic Routing (Model-Aware)

The real power: route by intent. A classifier inspects the request and sends it to the best-suited model.

Request: "Write a React component for a dropdown"
  │
  ▼
Classifier → "code generation"
  │
  ▼
Route to code-specialised model (DeepSeek Coder, CodeLlama)

vs

Request: "I'm feeling anxious about my deadline"
  │
  ▼
Classifier → "emotional support"
  │
  ▼
Route to instruction-tuned chat model

How it works: 1. Classifier (could be a small model, a rule set, or keywords) tags the intent 2. Router looks up intent → model mapping 3. Request is forwarded to the right backend

When Routing Matters Most

Systems with heterogeneous model fleets benefit most: - OpenAI: GPT-4 for complex, GPT-3.5 for simple, DALL-E for images - Perplexity: different models for search vs summarisation - Your stack: llama.cpp for chat, a small embedding model for RAG

Hands-on (15 min)

Set Up a Simple LB with nginx

# inference-lb.conf — round-robin across 2 llama.cpp instances
upstream inference_cluster {
    server 127.0.0.1:8080 weight=2;   # instance 1 (faster quant)
    server 127.0.0.1:8081 weight=1;   # instance 2 (higher quality)
}

server {
    listen 9000;

    location /v1/completions {
        proxy_pass http://inference_cluster;
        proxy_read_timeout 120s;
        proxy_set_header Host $host;
    }

    location /health {
        proxy_pass http://inference_cluster/health;
    }
}

Or with a Python Router

#!/usr/bin/env python3
"""semantic-router.py — route by intent detection."""
import re
import httpx
import json

class SemanticRouter:
    def __init__(self):
        self.backends = {
            "code": "http://localhost:8080/v1/completions",
            "chat": "http://localhost:8081/v1/completions",
            "fast": "http://localhost:8082/v1/completions",
        }

    def classify_intent(self, prompt: str) -> str:
        """Simple keyword-based intent classification."""
        code_keywords = r'\b(code|function|class|def |import |react|typescript|python|javascript|api)\b'
        if re.search(code_keywords, prompt, re.IGNORECASE):
            return "code"

        short = len(prompt.split()) < 10
        if short:
            return "fast"

        return "chat"

    async def route(self, prompt: str, **kwargs):
        intent = self.classify_intent(prompt)
        backend_url = self.backends[intent]

        try:
            async with httpx.AsyncClient(timeout=30) as cli:
                resp = await cli.post(backend_url, json={
                    "prompt": prompt,
                    "max_tokens": kwargs.get("max_tokens", 100),
                })
                result = resp.json()["choices"][0]["text"]
        except Exception as e:
            result = f"[error: {e}]"

        return intent, result

# Test
import asyncio

async def test():
    router = SemanticRouter()
    prompts = [
        "Write a Python function to sort a list",
        "How are you today?",
        "hi",
    ]
    for p in prompts:
        intent, result = await router.route(p, max_tokens=30)
        print(f"[{intent.upper()}] {p}")
        print(f"  → {result[:80]}...\n")

asyncio.run(test())

Run it:

cd /tmp
python3 semantic-router.py

Quick nginx LB test (if nginx is available):

docker run -d --name llama-a -p 8080:8080 \
  ghcr.io/ggerganov/llama.cpp:server \
  -m /models/qwen2.5-3b-q4.gguf -c 4096 --port 8080

docker run -d --name llama-b -p 8081:8080 \
  ghcr.io/ggerganov/llama.cpp:server \
  -m /models/qwen2.5-3b-q4.gguf -c 4096 --port 8080

Key Takeaways

Classic load balancing treats all inference requests equally — but they aren't
Semantic routing classifies intent and sends requests to the best model for the job
Heterogeneous model fleets (fast/small + big/smart) are more cost-effective than one big model
nginx/Traefik work well for Layer-7 routing; add a sidecar for semantic routing

🧠 AI System Design