Day 7: Mini-Project — AI Gateway

📂 Foundations 📖 15 min read Ready

Learning Objectives

Integrate everything from Days 1–6 into a single, runnable AI gateway
Understand how components compose in a microservice architecture
Containerise and test the full stack

Theory (15 min)

The Gateway Pattern

An AI gateway is the single entry point for all inference traffic. It handles:

                    ┌──────────────┐
                    │  AI Gateway   │
                    │               │
Client ──HTTP──▶    ├─ Rate Limiter│
                    ├─ Auth        │
                    ├─ Cache       │───▶ Inference Server (llama.cpp)
                    ├─ Router      │
                    ├─ Logger      │
                    └──────────────┘

What it provides: - Single URL for clients — simplifies front-end code - Cross-cutting concerns in one place (auth, rate limit, cache, logging) - Backend abstraction — swap models without changing client code - Observability — one place to monitor all traffic

Architecture Decisions

Today's mini-project combines: 1. Rate limiter (Day 6) — protects server from overload 2. Response cache (Day 3) — short-circuits duplicate requests 3. Request logger — structured logging per request 4. Health endpoint — Kubernetes/Docker readiness checks

All as a single Python service, containerised with Docker, pointing at your existing llama.cpp server.

Why a Gateway vs Ad-hoc Middleware?

Approach	Pros	Cons
Gateway service	Centralised, language-agnostic, reusable	Extra network hop
Middleware in app	No extra hop	Tied to framework
Sidecar proxy	Transparent to app	More complex orchestration

For your VPS: A gateway service (single container) is the sweet spot.

Hands-on (15 min)

Build and Run the AI Gateway

#!/usr/bin/env python3
"""ai-gateway.py — unified inference gateway."""
import hashlib
import json
import time
import os
from http.server import HTTPServer, BaseHTTPRequestHandler
from collections import OrderedDict
import threading
import urllib.request

# ── Config ──────────────────────────────────────────────────────────────
LLM_URL = os.getenv("LLM_URL", "http://localhost:8080/v1/completions")
GATEWAY_PORT = int(os.getenv("GATEWAY_PORT", "9001"))

# ── Rate Limiter (Token Bucket) ─────────────────────────────────────────
class TokenBucket:
    def __init__(self, rate=5, capacity=10):
        self.rate = rate
        self.capacity = capacity
        self.tokens = capacity
        self.last_refill = time.time()
        self.lock = threading.Lock()

    def consume(self):
        with self.lock:
            now = time.time()
            self.tokens = min(self.capacity, self.tokens + (now - self.last_refill) * self.rate)
            self.last_refill = now
            if self.tokens >= 1:
                self.tokens -= 1
                return True
            return False

# ── Response Cache (LRU) ────────────────────────────────────────────────
class LRUCache:
    def __init__(self, capacity=64, ttl=300):
        self.cache = OrderedDict()
        self.capacity = capacity
        self.ttl = ttl

    def _key(self, prompt, **params):
        raw = f"{prompt}|{json.dumps(params, sort_keys=True)}"
        return hashlib.sha256(raw.encode()).hexdigest()

    def get(self, prompt, **params):
        key = self._key(prompt, **params)
        if key in self.cache:
            entry = self.cache[key]
            if time.time() - entry["ts"] < self.ttl:
                self.cache.move_to_end(key)
                return entry["response"]
            del self.cache[key]
        return None

    def set(self, prompt, response, **params):
        key = self._key(prompt, **params)
        if len(self.cache) >= self.capacity:
            self.cache.popitem(last=False)
        self.cache[key] = {"response": response, "ts": time.time()}

# ── Gateway Handler ─────────────────────────────────────────────────────
rate_limiter = TokenBucket()
cache = LRUCache()

class GatewayHandler(BaseHTTPRequestHandler):
    def do_POST(self):
        start = time.time()
        content_len = int(self.headers.get("Content-Length", 0))
        body = self.rfile.read(content_len)
        data = json.loads(body)
        prompt = data.get("prompt", "")

        # 1. Rate limit check
        if not rate_limiter.consume():
            self._respond(429, {"error": "rate limit exceeded"})
            return

        # 2. Cache check
        cached = cache.get(prompt, max_tokens=data.get("max_tokens", 50))
        if cached:
            elapsed = time.time() - start
            self._respond(200, {
                "cached": True,
                "elapsed_ms": round(elapsed * 1000),
                "choices": [{"text": cached}],
            })
            return

        # 3. Forward to LLM
        try:
            req = urllib.request.Request(
                LLM_URL,
                data=json.dumps(data).encode(),
                headers={"Content-Type": "application/json"},
            )
            with urllib.request.urlopen(req, timeout=30) as resp:
                llm_response = json.loads(resp.read())
            text = llm_response["choices"][0]["text"]
        except Exception as e:
            self._respond(502, {"error": f"LLM error: {e}"})
            return

        # 4. Cache the response
        cache.set(prompt, text, **data)

        elapsed = time.time() - start
        self._respond(200, {
            "cached": False,
            "elapsed_ms": round(elapsed * 1000),
            "choices": [{"text": text}],
        })

    def do_GET(self):
        if self.path == "/health":
            self._respond(200, {"status": "ok"})
        elif self.path == "/stats":
            self._respond(200, {
                "cache_size": len(cache.cache),
                "cache_capacity": cache.capacity,
            })
        else:
            self._respond(404, {"error": "not found"})

    def _respond(self, status, data):
        self.send_response(status)
        self.send_header("Content-Type", "application/json")
        self.send_header("Access-Control-Allow-Origin", "*")
        self.end_headers()
        self.wfile.write(json.dumps(data).encode())

    def log_message(self, format, *args):
        pass  # silent (or add structured logging here)


if __name__ == "__main__":
    server = HTTPServer(("0.0.0.0", GATEWAY_PORT), GatewayHandler)
    print(f"🚀 AI Gateway running on port {GATEWAY_PORT} → {LLM_URL}")
    print(f"   Rate limit: {rate_limiter.rate}/s, burst {rate_limiter.capacity}")
    print(f"   Cache: {cache.capacity} entries, {cache.ttl}s TTL")
    print(f"   Health: http://localhost:{GATEWAY_PORT}/health")
    server.serve_forever()

Run It

# 1. Save the script
cat > /tmp/ai-gateway.py << 'EOF'
# [paste the script above]
EOF

# 2. Run alongside your existing llama.cpp server
LLM_URL=http://localhost:8080/v1/completions python3 /tmp/ai-gateway.py &

# 3. Test
curl -s http://localhost:9001/health
curl -s http://localhost:9001/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Hello!","max_tokens":20}'

# 4. Test cache (second call is faster)
curl -s http://localhost:9001/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Hello!","max_tokens":20}'

# 5. Test rate limit (rapid calls)
for i in $(seq 1 15); do
  curl -s -o /dev/null -w "%{http_code} " \
    http://localhost:9001/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"prompt":"test","max_tokens":5}'
done
echo ""

Docker Compose Integration

Add the gateway to your compose stack:

services:
  llama-server:
    image: ghcr.io/ggerganov/llama.cpp:server
    ports: ["8080:8080"]
    volumes: [./models:/models]
    command: -m /models/qwen2.5-3b-q4.gguf -c 4096 --port 8080

  ai-gateway:
    build: .
    ports: ["9001:9001"]
    environment:
      LLM_URL: http://llama-server:8080/v1/completions
      GATEWAY_PORT: "9001"
    depends_on: [llama-server]

Key Takeaways

An AI gateway consolidates rate limiting, caching, routing, and logging into one service
The gateway pattern decouples clients from inference backends
Each component (rate limiter, cache) is independently swappable
This is the architecture used by OpenAI, Anthropic, and every major AI API provider

🧠 AI System Design