Day 22: Observability

📂 Production & Case Studies 📖 15 min read Needs expansion

Learning Objectives

Understand the three pillars of observability in AI systems
Learn what metrics matter for inference (different from web apps)
Add OpenTelemetry instrumentation to your inference server

Theory (15 min)

Why Observability for AI?

Traditional web observability: request rate, error rate, latency (RED method).

AI observability adds: - Token throughput (tokens/sec per model, per user) - Cache hit/miss rates - Quantisation level serving each request - Input/output token counts (cost tracking) - Context window utilisation - Hallucination / refusal rate

The Three Pillars

1. Logs — structured events per request

{"timestamp": "...", "level": "info", "event": "inference_complete",
 "user_id": "abc", "prompt_tokens": 125, "completion_tokens": 43,
 "model": "qwen-3b", "latency_ms": 2340, "cached": false}

2. Metrics — aggregated counters and distributions

llm_tokens_total{model="qwen-3b",user="abc"} 15000
llm_latency_seconds{quantile="0.5"} 2.3

3. Traces — end-to-end request path across services

Gateway → Rate Limiter → Cache Lookup → LLM Call → Response
  (5ms)     (0.2ms)       (0.1ms)        (2340ms)   (0.5ms)

Key Metrics for AI Systems

Metric	Type	What It Tells You
tokens_per_second	Gauge	Model throughput
latency_ms (p50/p95/p99)	Distribution	User experience
cache_hit_ratio	Gauge	Cache effectiveness
context_utilisation	Gauge	How much context window used
tokens_per_dollar	Gauge	Cost efficiency
queue_depth	Gauge	Backlog pressure
requests_in_flight	Gauge	Current load

OpenTelemetry (OTel)

Standard for instrumenting observability data: - OTel Collector receives telemetry, exports to your backend - Backend options: Grafana+Tempo, Datadog, New Relic, self-hosted SigNoz - Lightweight alternative: Prometheus + structured logs (grep/awk)

Hands-on (15 min)

Add OTel-Compliant Metrics to the Gateway

#!/usr/bin/env python3
"""observability.py — add metrics and structured logging to the AI gateway."""
import json
import time
import os
from collections import defaultdict
from http.server import HTTPServer, BaseHTTPRequestHandler

# Stub — Ayva will expand with:
# - Prometheus client integration (/metrics endpoint)
# - OpenTelemetry exporter (OTLP)
# - Trace context propagation across services
# - Per-user token tracking (cost attribution)
# - Grafana dashboard JSON
# - Alert rules (high latency, low cache hit rate)
# - Integration with the AI Gateway from Day 7

class MetricsCollector:
    """Simple in-memory metrics collector (would use Prometheus in prod)."""
    def __init__(self):
        self.token_total = 0
        self.request_count = 0
        self.cache_hits = 0
        self.cache_misses = 0
        self.latencies = []
        self.user_tokens = defaultdict(int)

    def record_request(self, user_id: str, tokens: int, latency: float, cached: bool):
        self.request_count += 1
        self.token_total += tokens
        self.user_tokens[user_id] += tokens
        self.latencies.append(latency)
        if cached:
            self.cache_hits += 1
        else:
            self.cache_misses += 1

    def snapshot(self) -> dict:
        sorted_l = sorted(self.latencies)
        n = len(sorted_l)
        cache_total = self.cache_hits + self.cache_misses
        return {
            "requests": self.request_count,
            "total_tokens": self.token_total,
            "cache_hit_ratio": round(self.cache_hits / cache_total, 3) if cache_total else 0,
            "avg_latency_ms": round(sum(self.latencies) / n * 1000, 1) if n else 0,
            "p50_latency_ms": round(sorted_l[n // 2] * 1000, 1) if n else 0,
            "p95_latency_ms": round(sorted_l[int(n * 0.95)] * 1000, 1) if n else 0,
            "top_users": dict(sorted(self.user_tokens.items(),
                                      key=lambda x: -x[1])[:5]),
        }


# Structured logging helper
def log_event(event: str, **fields):
    record = {
        "timestamp": time.strftime("%Y-%m-%dT%H:%M:%S", time.gmtime()),
        "event": event,
        **fields,
    }
    print(json.dumps(record, default=str))


# Usage in a request handler
def handle_inference(prompt: str, user_id: str = "anonymous"):
    start = time.time()
    # ... do inference ...
    elapsed = time.time() - start

    log_event(
        "inference_complete",
        user_id=user_id,
        prompt_length=len(prompt.split()),
        latency_ms=round(elapsed * 1000),
        model="qwen2.5-3b-q4_K_M",
    )


if __name__ == "__main__":
    print("Observability demo")
    collector = MetricsCollector()

    # Simulate requests
    for i in range(20):
        user = f"user-{i % 3}"
        cached = i > 10  # simulate cache warming up
        collector.record_request(user, tokens=50 + i, latency=1.5 + i * 0.1, cached=cached)
        time.sleep(0.05)

    snap = collector.snapshot()
    print(f"\n📊 Metrics Snapshot:")
    print(json.dumps(snap, indent=2))

Questions for Ayva: - What Prometheus metrics are most useful for llama.cpp inference? - How to set up a Grafana dashboard for a single-server inference stack? - What alert rules prevent silent degradation?

Key Takeaways

AI observability adds domain-specific metrics (tokens, cache hits, context usage)
The three pillars (logs, metrics, traces) give different views of the same system
Start with structured logging, add Prometheus metrics, then traces as needed
Cost attribution (tokens per user) is essential for production AI systems

🧠 AI System Design