Day 22: Observability
Learning Objectives
- Understand the three pillars of observability in AI systems
- Learn what metrics matter for inference (different from web apps)
- Add OpenTelemetry instrumentation to your inference server
Theory (15 min)
Why Observability for AI?
Traditional web observability: request rate, error rate, latency (RED method).
AI observability adds: - Token throughput (tokens/sec per model, per user) - Cache hit/miss rates - Quantisation level serving each request - Input/output token counts (cost tracking) - Context window utilisation - Hallucination / refusal rate
The Three Pillars
1. Logs ā structured events per request
{"timestamp": "...", "level": "info", "event": "inference_complete",
"user_id": "abc", "prompt_tokens": 125, "completion_tokens": 43,
"model": "qwen-3b", "latency_ms": 2340, "cached": false}
2. Metrics ā aggregated counters and distributions
llm_tokens_total{model="qwen-3b",user="abc"} 15000
llm_latency_seconds{quantile="0.5"} 2.3
3. Traces ā end-to-end request path across services
Gateway ā Rate Limiter ā Cache Lookup ā LLM Call ā Response
(5ms) (0.2ms) (0.1ms) (2340ms) (0.5ms)
Key Metrics for AI Systems
| Metric | Type | What It Tells You |
|---|---|---|
| tokens_per_second | Gauge | Model throughput |
| latency_ms (p50/p95/p99) | Distribution | User experience |
| cache_hit_ratio | Gauge | Cache effectiveness |
| context_utilisation | Gauge | How much context window used |
| tokens_per_dollar | Gauge | Cost efficiency |
| queue_depth | Gauge | Backlog pressure |
| requests_in_flight | Gauge | Current load |
OpenTelemetry (OTel)
Standard for instrumenting observability data: - OTel Collector receives telemetry, exports to your backend - Backend options: Grafana+Tempo, Datadog, New Relic, self-hosted SigNoz - Lightweight alternative: Prometheus + structured logs (grep/awk)
Hands-on (15 min)
Add OTel-Compliant Metrics to the Gateway
#!/usr/bin/env python3
"""observability.py ā add metrics and structured logging to the AI gateway."""
import json
import time
import os
from collections import defaultdict
from http.server import HTTPServer, BaseHTTPRequestHandler
# Stub ā Ayva will expand with:
# - Prometheus client integration (/metrics endpoint)
# - OpenTelemetry exporter (OTLP)
# - Trace context propagation across services
# - Per-user token tracking (cost attribution)
# - Grafana dashboard JSON
# - Alert rules (high latency, low cache hit rate)
# - Integration with the AI Gateway from Day 7
class MetricsCollector:
"""Simple in-memory metrics collector (would use Prometheus in prod)."""
def __init__(self):
self.token_total = 0
self.request_count = 0
self.cache_hits = 0
self.cache_misses = 0
self.latencies = []
self.user_tokens = defaultdict(int)
def record_request(self, user_id: str, tokens: int, latency: float, cached: bool):
self.request_count += 1
self.token_total += tokens
self.user_tokens[user_id] += tokens
self.latencies.append(latency)
if cached:
self.cache_hits += 1
else:
self.cache_misses += 1
def snapshot(self) -> dict:
sorted_l = sorted(self.latencies)
n = len(sorted_l)
cache_total = self.cache_hits + self.cache_misses
return {
"requests": self.request_count,
"total_tokens": self.token_total,
"cache_hit_ratio": round(self.cache_hits / cache_total, 3) if cache_total else 0,
"avg_latency_ms": round(sum(self.latencies) / n * 1000, 1) if n else 0,
"p50_latency_ms": round(sorted_l[n // 2] * 1000, 1) if n else 0,
"p95_latency_ms": round(sorted_l[int(n * 0.95)] * 1000, 1) if n else 0,
"top_users": dict(sorted(self.user_tokens.items(),
key=lambda x: -x[1])[:5]),
}
# Structured logging helper
def log_event(event: str, **fields):
record = {
"timestamp": time.strftime("%Y-%m-%dT%H:%M:%S", time.gmtime()),
"event": event,
**fields,
}
print(json.dumps(record, default=str))
# Usage in a request handler
def handle_inference(prompt: str, user_id: str = "anonymous"):
start = time.time()
# ... do inference ...
elapsed = time.time() - start
log_event(
"inference_complete",
user_id=user_id,
prompt_length=len(prompt.split()),
latency_ms=round(elapsed * 1000),
model="qwen2.5-3b-q4_K_M",
)
if __name__ == "__main__":
print("Observability demo")
collector = MetricsCollector()
# Simulate requests
for i in range(20):
user = f"user-{i % 3}"
cached = i > 10 # simulate cache warming up
collector.record_request(user, tokens=50 + i, latency=1.5 + i * 0.1, cached=cached)
time.sleep(0.05)
snap = collector.snapshot()
print(f"\nš Metrics Snapshot:")
print(json.dumps(snap, indent=2))
Questions for Ayva: - What Prometheus metrics are most useful for llama.cpp inference? - How to set up a Grafana dashboard for a single-server inference stack? - What alert rules prevent silent degradation?
Key Takeaways
- AI observability adds domain-specific metrics (tokens, cache hits, context usage)
- The three pillars (logs, metrics, traces) give different views of the same system
- Start with structured logging, add Prometheus metrics, then traces as needed
- Cost attribution (tokens per user) is essential for production AI systems