Day 7: Mini-Project β AI Gateway
Learning Objectives
- Integrate everything from Days 1β6 into a single, runnable AI gateway
- Understand how components compose in a microservice architecture
- Containerise and test the full stack
Theory (15 min)
The Gateway Pattern
An AI gateway is the single entry point for all inference traffic. It handles:
ββββββββββββββββ
β AI Gateway β
β β
Client ββHTTPβββΆ ββ Rate Limiterβ
ββ Auth β
ββ Cache βββββΆ Inference Server (llama.cpp)
ββ Router β
ββ Logger β
ββββββββββββββββ
What it provides: - Single URL for clients β simplifies front-end code - Cross-cutting concerns in one place (auth, rate limit, cache, logging) - Backend abstraction β swap models without changing client code - Observability β one place to monitor all traffic
Architecture Decisions
Today's mini-project combines: 1. Rate limiter (Day 6) β protects server from overload 2. Response cache (Day 3) β short-circuits duplicate requests 3. Request logger β structured logging per request 4. Health endpoint β Kubernetes/Docker readiness checks
All as a single Python service, containerised with Docker, pointing at your existing llama.cpp server.
Why a Gateway vs Ad-hoc Middleware?
| Approach | Pros | Cons |
|---|---|---|
| Gateway service | Centralised, language-agnostic, reusable | Extra network hop |
| Middleware in app | No extra hop | Tied to framework |
| Sidecar proxy | Transparent to app | More complex orchestration |
For your VPS: A gateway service (single container) is the sweet spot.
Hands-on (15 min)
Build and Run the AI Gateway
#!/usr/bin/env python3
"""ai-gateway.py β unified inference gateway."""
import hashlib
import json
import time
import os
from http.server import HTTPServer, BaseHTTPRequestHandler
from collections import OrderedDict
import threading
import urllib.request
# ββ Config ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
LLM_URL = os.getenv("LLM_URL", "http://localhost:8080/v1/completions")
GATEWAY_PORT = int(os.getenv("GATEWAY_PORT", "9001"))
# ββ Rate Limiter (Token Bucket) βββββββββββββββββββββββββββββββββββββββββ
class TokenBucket:
def __init__(self, rate=5, capacity=10):
self.rate = rate
self.capacity = capacity
self.tokens = capacity
self.last_refill = time.time()
self.lock = threading.Lock()
def consume(self):
with self.lock:
now = time.time()
self.tokens = min(self.capacity, self.tokens + (now - self.last_refill) * self.rate)
self.last_refill = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
# ββ Response Cache (LRU) ββββββββββββββββββββββββββββββββββββββββββββββββ
class LRUCache:
def __init__(self, capacity=64, ttl=300):
self.cache = OrderedDict()
self.capacity = capacity
self.ttl = ttl
def _key(self, prompt, **params):
raw = f"{prompt}|{json.dumps(params, sort_keys=True)}"
return hashlib.sha256(raw.encode()).hexdigest()
def get(self, prompt, **params):
key = self._key(prompt, **params)
if key in self.cache:
entry = self.cache[key]
if time.time() - entry["ts"] < self.ttl:
self.cache.move_to_end(key)
return entry["response"]
del self.cache[key]
return None
def set(self, prompt, response, **params):
key = self._key(prompt, **params)
if len(self.cache) >= self.capacity:
self.cache.popitem(last=False)
self.cache[key] = {"response": response, "ts": time.time()}
# ββ Gateway Handler βββββββββββββββββββββββββββββββββββββββββββββββββββββ
rate_limiter = TokenBucket()
cache = LRUCache()
class GatewayHandler(BaseHTTPRequestHandler):
def do_POST(self):
start = time.time()
content_len = int(self.headers.get("Content-Length", 0))
body = self.rfile.read(content_len)
data = json.loads(body)
prompt = data.get("prompt", "")
# 1. Rate limit check
if not rate_limiter.consume():
self._respond(429, {"error": "rate limit exceeded"})
return
# 2. Cache check
cached = cache.get(prompt, max_tokens=data.get("max_tokens", 50))
if cached:
elapsed = time.time() - start
self._respond(200, {
"cached": True,
"elapsed_ms": round(elapsed * 1000),
"choices": [{"text": cached}],
})
return
# 3. Forward to LLM
try:
req = urllib.request.Request(
LLM_URL,
data=json.dumps(data).encode(),
headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req, timeout=30) as resp:
llm_response = json.loads(resp.read())
text = llm_response["choices"][0]["text"]
except Exception as e:
self._respond(502, {"error": f"LLM error: {e}"})
return
# 4. Cache the response
cache.set(prompt, text, **data)
elapsed = time.time() - start
self._respond(200, {
"cached": False,
"elapsed_ms": round(elapsed * 1000),
"choices": [{"text": text}],
})
def do_GET(self):
if self.path == "/health":
self._respond(200, {"status": "ok"})
elif self.path == "/stats":
self._respond(200, {
"cache_size": len(cache.cache),
"cache_capacity": cache.capacity,
})
else:
self._respond(404, {"error": "not found"})
def _respond(self, status, data):
self.send_response(status)
self.send_header("Content-Type", "application/json")
self.send_header("Access-Control-Allow-Origin", "*")
self.end_headers()
self.wfile.write(json.dumps(data).encode())
def log_message(self, format, *args):
pass # silent (or add structured logging here)
if __name__ == "__main__":
server = HTTPServer(("0.0.0.0", GATEWAY_PORT), GatewayHandler)
print(f"π AI Gateway running on port {GATEWAY_PORT} β {LLM_URL}")
print(f" Rate limit: {rate_limiter.rate}/s, burst {rate_limiter.capacity}")
print(f" Cache: {cache.capacity} entries, {cache.ttl}s TTL")
print(f" Health: http://localhost:{GATEWAY_PORT}/health")
server.serve_forever()
Run It
# 1. Save the script
cat > /tmp/ai-gateway.py << 'EOF'
# [paste the script above]
EOF
# 2. Run alongside your existing llama.cpp server
LLM_URL=http://localhost:8080/v1/completions python3 /tmp/ai-gateway.py &
# 3. Test
curl -s http://localhost:9001/health
curl -s http://localhost:9001/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt":"Hello!","max_tokens":20}'
# 4. Test cache (second call is faster)
curl -s http://localhost:9001/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt":"Hello!","max_tokens":20}'
# 5. Test rate limit (rapid calls)
for i in $(seq 1 15); do
curl -s -o /dev/null -w "%{http_code} " \
http://localhost:9001/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt":"test","max_tokens":5}'
done
echo ""
Docker Compose Integration
Add the gateway to your compose stack:
services:
llama-server:
image: ghcr.io/ggerganov/llama.cpp:server
ports: ["8080:8080"]
volumes: [./models:/models]
command: -m /models/qwen2.5-3b-q4.gguf -c 4096 --port 8080
ai-gateway:
build: .
ports: ["9001:9001"]
environment:
LLM_URL: http://llama-server:8080/v1/completions
GATEWAY_PORT: "9001"
depends_on: [llama-server]
Key Takeaways
- An AI gateway consolidates rate limiting, caching, routing, and logging into one service
- The gateway pattern decouples clients from inference backends
- Each component (rate limiter, cache) is independently swappable
- This is the architecture used by OpenAI, Anthropic, and every major AI API provider