🧠 AI System Design

Day 22: Observability

šŸ“‚ Production & Case Studies šŸ“– 15 min read Needs expansion

Learning Objectives

  • Understand the three pillars of observability in AI systems
  • Learn what metrics matter for inference (different from web apps)
  • Add OpenTelemetry instrumentation to your inference server

Theory (15 min)

Why Observability for AI?

Traditional web observability: request rate, error rate, latency (RED method).

AI observability adds: - Token throughput (tokens/sec per model, per user) - Cache hit/miss rates - Quantisation level serving each request - Input/output token counts (cost tracking) - Context window utilisation - Hallucination / refusal rate

The Three Pillars

1. Logs — structured events per request

{"timestamp": "...", "level": "info", "event": "inference_complete",
 "user_id": "abc", "prompt_tokens": 125, "completion_tokens": 43,
 "model": "qwen-3b", "latency_ms": 2340, "cached": false}

2. Metrics — aggregated counters and distributions

llm_tokens_total{model="qwen-3b",user="abc"} 15000
llm_latency_seconds{quantile="0.5"} 2.3

3. Traces — end-to-end request path across services

Gateway → Rate Limiter → Cache Lookup → LLM Call → Response
  (5ms)     (0.2ms)       (0.1ms)        (2340ms)   (0.5ms)

Key Metrics for AI Systems

Metric Type What It Tells You
tokens_per_second Gauge Model throughput
latency_ms (p50/p95/p99) Distribution User experience
cache_hit_ratio Gauge Cache effectiveness
context_utilisation Gauge How much context window used
tokens_per_dollar Gauge Cost efficiency
queue_depth Gauge Backlog pressure
requests_in_flight Gauge Current load

OpenTelemetry (OTel)

Standard for instrumenting observability data: - OTel Collector receives telemetry, exports to your backend - Backend options: Grafana+Tempo, Datadog, New Relic, self-hosted SigNoz - Lightweight alternative: Prometheus + structured logs (grep/awk)


Hands-on (15 min)

Add OTel-Compliant Metrics to the Gateway

#!/usr/bin/env python3
"""observability.py — add metrics and structured logging to the AI gateway."""
import json
import time
import os
from collections import defaultdict
from http.server import HTTPServer, BaseHTTPRequestHandler

# Stub — Ayva will expand with:
# - Prometheus client integration (/metrics endpoint)
# - OpenTelemetry exporter (OTLP)
# - Trace context propagation across services
# - Per-user token tracking (cost attribution)
# - Grafana dashboard JSON
# - Alert rules (high latency, low cache hit rate)
# - Integration with the AI Gateway from Day 7

class MetricsCollector:
    """Simple in-memory metrics collector (would use Prometheus in prod)."""
    def __init__(self):
        self.token_total = 0
        self.request_count = 0
        self.cache_hits = 0
        self.cache_misses = 0
        self.latencies = []
        self.user_tokens = defaultdict(int)

    def record_request(self, user_id: str, tokens: int, latency: float, cached: bool):
        self.request_count += 1
        self.token_total += tokens
        self.user_tokens[user_id] += tokens
        self.latencies.append(latency)
        if cached:
            self.cache_hits += 1
        else:
            self.cache_misses += 1

    def snapshot(self) -> dict:
        sorted_l = sorted(self.latencies)
        n = len(sorted_l)
        cache_total = self.cache_hits + self.cache_misses
        return {
            "requests": self.request_count,
            "total_tokens": self.token_total,
            "cache_hit_ratio": round(self.cache_hits / cache_total, 3) if cache_total else 0,
            "avg_latency_ms": round(sum(self.latencies) / n * 1000, 1) if n else 0,
            "p50_latency_ms": round(sorted_l[n // 2] * 1000, 1) if n else 0,
            "p95_latency_ms": round(sorted_l[int(n * 0.95)] * 1000, 1) if n else 0,
            "top_users": dict(sorted(self.user_tokens.items(),
                                      key=lambda x: -x[1])[:5]),
        }


# Structured logging helper
def log_event(event: str, **fields):
    record = {
        "timestamp": time.strftime("%Y-%m-%dT%H:%M:%S", time.gmtime()),
        "event": event,
        **fields,
    }
    print(json.dumps(record, default=str))


# Usage in a request handler
def handle_inference(prompt: str, user_id: str = "anonymous"):
    start = time.time()
    # ... do inference ...
    elapsed = time.time() - start

    log_event(
        "inference_complete",
        user_id=user_id,
        prompt_length=len(prompt.split()),
        latency_ms=round(elapsed * 1000),
        model="qwen2.5-3b-q4_K_M",
    )


if __name__ == "__main__":
    print("Observability demo")
    collector = MetricsCollector()

    # Simulate requests
    for i in range(20):
        user = f"user-{i % 3}"
        cached = i > 10  # simulate cache warming up
        collector.record_request(user, tokens=50 + i, latency=1.5 + i * 0.1, cached=cached)
        time.sleep(0.05)

    snap = collector.snapshot()
    print(f"\nšŸ“Š Metrics Snapshot:")
    print(json.dumps(snap, indent=2))

Questions for Ayva: - What Prometheus metrics are most useful for llama.cpp inference? - How to set up a Grafana dashboard for a single-server inference stack? - What alert rules prevent silent degradation?


Key Takeaways

  • AI observability adds domain-specific metrics (tokens, cache hits, context usage)
  • The three pillars (logs, metrics, traces) give different views of the same system
  • Start with structured logging, add Prometheus metrics, then traces as needed
  • Cost attribution (tokens per user) is essential for production AI systems

References