🧠 AI System Design

Day 4: Load Balancing & Routing

šŸ“‚ Foundations šŸ“– 15 min read Ready

Learning Objectives

  • Understand why AI systems need load balancing beyond simple round-robin
  • Learn semantic routing: classify → route to specialised model
  • Set up a basic load balancer across inference server instances

Theory (15 min)

Why Load Balance Inference?

Multiple reasons to run more than one model instance:

  1. Throughput — one server can only handle N concurrent requests
  2. Specialisation — different models for different tasks (chat vs code vs embedding)
  3. Fault tolerance — if one instance crashes, others take over
  4. A/B testing — slowly roll out new model versions

Classic LB Algorithms

Algorithm How it works Best for
Round-robin Server A, B, C, A, B, C… Equal-capacity servers
Least connections Send to server with fewest active Variable request duration
IP hash Always route same client to same server Session affinity
Random Pick one at random Simple, good enough under load
Weighted Round-robin weighted by capacity Heterogeneous hardware

Problem: These don't know what kind of request is coming. All requests are treated equally.

Semantic Routing (Model-Aware)

The real power: route by intent. A classifier inspects the request and sends it to the best-suited model.

Request: "Write a React component for a dropdown"
  │
  ā–¼
Classifier → "code generation"
  │
  ā–¼
Route to code-specialised model (DeepSeek Coder, CodeLlama)

vs

Request: "I'm feeling anxious about my deadline"
  │
  ā–¼
Classifier → "emotional support"
  │
  ā–¼
Route to instruction-tuned chat model

How it works: 1. Classifier (could be a small model, a rule set, or keywords) tags the intent 2. Router looks up intent → model mapping 3. Request is forwarded to the right backend

When Routing Matters Most

Systems with heterogeneous model fleets benefit most: - OpenAI: GPT-4 for complex, GPT-3.5 for simple, DALL-E for images - Perplexity: different models for search vs summarisation - Your stack: llama.cpp for chat, a small embedding model for RAG


Hands-on (15 min)

Set Up a Simple LB with nginx

# inference-lb.conf — round-robin across 2 llama.cpp instances
upstream inference_cluster {
    server 127.0.0.1:8080 weight=2;   # instance 1 (faster quant)
    server 127.0.0.1:8081 weight=1;   # instance 2 (higher quality)
}

server {
    listen 9000;

    location /v1/completions {
        proxy_pass http://inference_cluster;
        proxy_read_timeout 120s;
        proxy_set_header Host $host;
    }

    location /health {
        proxy_pass http://inference_cluster/health;
    }
}

Or with a Python Router

#!/usr/bin/env python3
"""semantic-router.py — route by intent detection."""
import re
import httpx
import json

class SemanticRouter:
    def __init__(self):
        self.backends = {
            "code": "http://localhost:8080/v1/completions",
            "chat": "http://localhost:8081/v1/completions",
            "fast": "http://localhost:8082/v1/completions",
        }

    def classify_intent(self, prompt: str) -> str:
        """Simple keyword-based intent classification."""
        code_keywords = r'\b(code|function|class|def |import |react|typescript|python|javascript|api)\b'
        if re.search(code_keywords, prompt, re.IGNORECASE):
            return "code"

        short = len(prompt.split()) < 10
        if short:
            return "fast"

        return "chat"

    async def route(self, prompt: str, **kwargs):
        intent = self.classify_intent(prompt)
        backend_url = self.backends[intent]

        try:
            async with httpx.AsyncClient(timeout=30) as cli:
                resp = await cli.post(backend_url, json={
                    "prompt": prompt,
                    "max_tokens": kwargs.get("max_tokens", 100),
                })
                result = resp.json()["choices"][0]["text"]
        except Exception as e:
            result = f"[error: {e}]"

        return intent, result

# Test
import asyncio

async def test():
    router = SemanticRouter()
    prompts = [
        "Write a Python function to sort a list",
        "How are you today?",
        "hi",
    ]
    for p in prompts:
        intent, result = await router.route(p, max_tokens=30)
        print(f"[{intent.upper()}] {p}")
        print(f"  → {result[:80]}...\n")

asyncio.run(test())

Run it:

cd /tmp
python3 semantic-router.py

Quick nginx LB test (if nginx is available):

docker run -d --name llama-a -p 8080:8080 \
  ghcr.io/ggerganov/llama.cpp:server \
  -m /models/qwen2.5-3b-q4.gguf -c 4096 --port 8080

docker run -d --name llama-b -p 8081:8080 \
  ghcr.io/ggerganov/llama.cpp:server \
  -m /models/qwen2.5-3b-q4.gguf -c 4096 --port 8080

Key Takeaways

  • Classic load balancing treats all inference requests equally — but they aren't
  • Semantic routing classifies intent and sends requests to the best model for the job
  • Heterogeneous model fleets (fast/small + big/smart) are more cost-effective than one big model
  • nginx/Traefik work well for Layer-7 routing; add a sidecar for semantic routing

References