🧠 AI System Design

Day 7: Mini-Project β€” AI Gateway

πŸ“‚ Foundations πŸ“– 15 min read Ready

Learning Objectives

  • Integrate everything from Days 1–6 into a single, runnable AI gateway
  • Understand how components compose in a microservice architecture
  • Containerise and test the full stack

Theory (15 min)

The Gateway Pattern

An AI gateway is the single entry point for all inference traffic. It handles:

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  AI Gateway   β”‚
                    β”‚               β”‚
Client ──HTTP──▢    β”œβ”€ Rate Limiterβ”‚
                    β”œβ”€ Auth        β”‚
                    β”œβ”€ Cache       │───▢ Inference Server (llama.cpp)
                    β”œβ”€ Router      β”‚
                    β”œβ”€ Logger      β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

What it provides: - Single URL for clients β€” simplifies front-end code - Cross-cutting concerns in one place (auth, rate limit, cache, logging) - Backend abstraction β€” swap models without changing client code - Observability β€” one place to monitor all traffic

Architecture Decisions

Today's mini-project combines: 1. Rate limiter (Day 6) β€” protects server from overload 2. Response cache (Day 3) β€” short-circuits duplicate requests 3. Request logger β€” structured logging per request 4. Health endpoint β€” Kubernetes/Docker readiness checks

All as a single Python service, containerised with Docker, pointing at your existing llama.cpp server.

Why a Gateway vs Ad-hoc Middleware?

Approach Pros Cons
Gateway service Centralised, language-agnostic, reusable Extra network hop
Middleware in app No extra hop Tied to framework
Sidecar proxy Transparent to app More complex orchestration

For your VPS: A gateway service (single container) is the sweet spot.


Hands-on (15 min)

Build and Run the AI Gateway

#!/usr/bin/env python3
"""ai-gateway.py β€” unified inference gateway."""
import hashlib
import json
import time
import os
from http.server import HTTPServer, BaseHTTPRequestHandler
from collections import OrderedDict
import threading
import urllib.request

# ── Config ──────────────────────────────────────────────────────────────
LLM_URL = os.getenv("LLM_URL", "http://localhost:8080/v1/completions")
GATEWAY_PORT = int(os.getenv("GATEWAY_PORT", "9001"))

# ── Rate Limiter (Token Bucket) ─────────────────────────────────────────
class TokenBucket:
    def __init__(self, rate=5, capacity=10):
        self.rate = rate
        self.capacity = capacity
        self.tokens = capacity
        self.last_refill = time.time()
        self.lock = threading.Lock()

    def consume(self):
        with self.lock:
            now = time.time()
            self.tokens = min(self.capacity, self.tokens + (now - self.last_refill) * self.rate)
            self.last_refill = now
            if self.tokens >= 1:
                self.tokens -= 1
                return True
            return False

# ── Response Cache (LRU) ────────────────────────────────────────────────
class LRUCache:
    def __init__(self, capacity=64, ttl=300):
        self.cache = OrderedDict()
        self.capacity = capacity
        self.ttl = ttl

    def _key(self, prompt, **params):
        raw = f"{prompt}|{json.dumps(params, sort_keys=True)}"
        return hashlib.sha256(raw.encode()).hexdigest()

    def get(self, prompt, **params):
        key = self._key(prompt, **params)
        if key in self.cache:
            entry = self.cache[key]
            if time.time() - entry["ts"] < self.ttl:
                self.cache.move_to_end(key)
                return entry["response"]
            del self.cache[key]
        return None

    def set(self, prompt, response, **params):
        key = self._key(prompt, **params)
        if len(self.cache) >= self.capacity:
            self.cache.popitem(last=False)
        self.cache[key] = {"response": response, "ts": time.time()}

# ── Gateway Handler ─────────────────────────────────────────────────────
rate_limiter = TokenBucket()
cache = LRUCache()

class GatewayHandler(BaseHTTPRequestHandler):
    def do_POST(self):
        start = time.time()
        content_len = int(self.headers.get("Content-Length", 0))
        body = self.rfile.read(content_len)
        data = json.loads(body)
        prompt = data.get("prompt", "")

        # 1. Rate limit check
        if not rate_limiter.consume():
            self._respond(429, {"error": "rate limit exceeded"})
            return

        # 2. Cache check
        cached = cache.get(prompt, max_tokens=data.get("max_tokens", 50))
        if cached:
            elapsed = time.time() - start
            self._respond(200, {
                "cached": True,
                "elapsed_ms": round(elapsed * 1000),
                "choices": [{"text": cached}],
            })
            return

        # 3. Forward to LLM
        try:
            req = urllib.request.Request(
                LLM_URL,
                data=json.dumps(data).encode(),
                headers={"Content-Type": "application/json"},
            )
            with urllib.request.urlopen(req, timeout=30) as resp:
                llm_response = json.loads(resp.read())
            text = llm_response["choices"][0]["text"]
        except Exception as e:
            self._respond(502, {"error": f"LLM error: {e}"})
            return

        # 4. Cache the response
        cache.set(prompt, text, **data)

        elapsed = time.time() - start
        self._respond(200, {
            "cached": False,
            "elapsed_ms": round(elapsed * 1000),
            "choices": [{"text": text}],
        })

    def do_GET(self):
        if self.path == "/health":
            self._respond(200, {"status": "ok"})
        elif self.path == "/stats":
            self._respond(200, {
                "cache_size": len(cache.cache),
                "cache_capacity": cache.capacity,
            })
        else:
            self._respond(404, {"error": "not found"})

    def _respond(self, status, data):
        self.send_response(status)
        self.send_header("Content-Type", "application/json")
        self.send_header("Access-Control-Allow-Origin", "*")
        self.end_headers()
        self.wfile.write(json.dumps(data).encode())

    def log_message(self, format, *args):
        pass  # silent (or add structured logging here)


if __name__ == "__main__":
    server = HTTPServer(("0.0.0.0", GATEWAY_PORT), GatewayHandler)
    print(f"πŸš€ AI Gateway running on port {GATEWAY_PORT} β†’ {LLM_URL}")
    print(f"   Rate limit: {rate_limiter.rate}/s, burst {rate_limiter.capacity}")
    print(f"   Cache: {cache.capacity} entries, {cache.ttl}s TTL")
    print(f"   Health: http://localhost:{GATEWAY_PORT}/health")
    server.serve_forever()

Run It

# 1. Save the script
cat > /tmp/ai-gateway.py << 'EOF'
# [paste the script above]
EOF

# 2. Run alongside your existing llama.cpp server
LLM_URL=http://localhost:8080/v1/completions python3 /tmp/ai-gateway.py &

# 3. Test
curl -s http://localhost:9001/health
curl -s http://localhost:9001/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Hello!","max_tokens":20}'

# 4. Test cache (second call is faster)
curl -s http://localhost:9001/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Hello!","max_tokens":20}'

# 5. Test rate limit (rapid calls)
for i in $(seq 1 15); do
  curl -s -o /dev/null -w "%{http_code} " \
    http://localhost:9001/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"prompt":"test","max_tokens":5}'
done
echo ""

Docker Compose Integration

Add the gateway to your compose stack:

services:
  llama-server:
    image: ghcr.io/ggerganov/llama.cpp:server
    ports: ["8080:8080"]
    volumes: [./models:/models]
    command: -m /models/qwen2.5-3b-q4.gguf -c 4096 --port 8080

  ai-gateway:
    build: .
    ports: ["9001:9001"]
    environment:
      LLM_URL: http://llama-server:8080/v1/completions
      GATEWAY_PORT: "9001"
    depends_on: [llama-server]

Key Takeaways

  • An AI gateway consolidates rate limiting, caching, routing, and logging into one service
  • The gateway pattern decouples clients from inference backends
  • Each component (rate limiter, cache) is independently swappable
  • This is the architecture used by OpenAI, Anthropic, and every major AI API provider

References