š Production & Case Studiesš 15 min readNeeds expansion
Learning Objectives
Understand why AI model updates need careful rollout (non-deterministic outputs)
Learn shadow, canary, and blue-green deployment strategies
Set up a weighted router that directs 10% of traffic to a new model version
Theory (15 min)
Why AI Rollouts Are Hard
Non-deterministic: Same input ā different output (model version change is invisible in unit tests)
Quality is subjective: A model that scores better on benchmarks may feel worse to users
Regressions are silent: The model doesn't crash ā it just answers worse
User trust: A bad experience loses trust faster than a crash
Deployment Strategies
Strategy
How It Works
Risk
Traffic Impact
Detection
Shadow
New model gets all requests but outputs are discarded
None
None
Compare outputs offline
Canary
5-10% traffic to new model, ramp up gradually
Low
Small subset
Monitor metrics per variant
Blue-Green
Flip all traffic at once (two identical environments)
Medium
Instant switch
Pre-deployment validation
A/B
50/50 split, statistically evaluate
Medium
Half of users
Statistical significance test
The Canary Pipeline
Week 1: Shadow mode (collect outputs, no user impact)
ā Compare shadow outputs to production ā identify regressions
Week 2: 5% canary
ā Monitor: latency, user feedback, token usage
Week 3: 25% canary
ā Escalate monitoring
Week 4: 50%
ā Full confidence
Week 5: 100% (blue-green flip)
Rollback at ANY stage if metrics degrade.
What to Monitor in a Canary
Metric
What to Compare
Action if Degraded
Latency p50/p95
Old vs new
Rollback if >20% slower
Token usage
Old vs new
Flag if >50% more tokens
Cache hit rate
Old vs new
Rollback if significantly lower
User feedback score
Old vs new
Rollback if lower
Refusal rate
Old vs new
Investigate if different
Hands-on (15 min)
Build a Weighted A/B Router
#!/usr/bin/env python3
"""ab-router.py ā weighted traffic split between model versions."""
import random
import json
import httpx
import time
from http.server import HTTPServer, BaseHTTPRequestHandler
from collections import defaultdict
# Stub ā Ayva will expand with:
# - Shadow mode (log new model output but serve old model output)
# - Metrics comparison (latency, token count per variant)
# - Automated rollback trigger (if error rate > threshold)
# - Gradual ramp-up scheduler (5% ā 10% ā 25% ā 50% ā 100%)
# - Per-user stickiness (same user always gets same variant)
# - Statistical significance test (is the difference real?)
# - Integration with the AI Gateway from Day 7
class ABRouter:
def __init__(self, variants: dict):
"""
variants = {"model-a": {"url": "...", "weight": 0.9},
"model-b": {"url": "...", "weight": 0.1}}
"""
self.variants = variants
self.results = defaultdict(list) # variant ā [latencies]
def select_variant(self) -> str:
"""Weighted random selection."""
r = random.random()
cumulative = 0
for name, config in self.variants.items():
cumulative += config["weight"]
if r < cumulative:
return name
return list(self.variants.keys())[-1]
async def route(self, prompt: str, variant: str) -> dict:
url = self.variants[variant]["url"]
start = time.time()
try:
async with httpx.AsyncClient(timeout=60) as cli:
resp = await cli.post(url, json={
"prompt": prompt,
"max_tokens": 50,
"temperature": 0.0,
})
latency = time.time() - start
data = resp.json()
self.results[variant].append(latency)
return {
"variant": variant,
"latency": round(latency, 3),
"text": data["choices"][0]["text"][:100],
}
except Exception as e:
return {"variant": variant, "error": str(e)}
def stats(self) -> dict:
stats = {}
for variant, latencies in self.results.items():
if latencies:
sorted_l = sorted(latencies)
stats[variant] = {
"requests": len(latencies),
"avg_latency": round(sum(latencies) / len(latencies), 3),
"p50": round(sorted_l[len(sorted_l) // 2], 3),
}
return stats
# Demo
import asyncio
async def demo():
router = ABRouter({
"model-a": {"url": "http://localhost:8080/v1/completions", "weight": 0.9},
"model-b": {"url": "http://localhost:8081/v1/completions", "weight": 0.1},
})
print("A/B Router ā 20 simulated requests (90/10 split)")
for i in range(20):
variant = router.select_variant()
result = await router.route(f"Fact number {i}", variant)
print(f" [{i:>2}] {result['variant']:>8} | "
f"{result.get('latency', 'ERR'):>5}s")
print(f"\nš Stats:\n{json.dumps(router.stats(), indent=2)}")
asyncio.run(demo())
Questions for Ayva:
- How to measure statistical significance in LLM A/B tests?
- What's the minimum sample size for reliable model comparison?
- How to handle model-specific caching (different KV cache per variant)?
Key Takeaways
AI model rollouts need more care than regular software (silent regressions, subjective quality)
Shadow ā Canary ā Blue-Green is the safe deployment pipeline
Monitor variant-specific metrics, not just aggregate
Always have a rollback plan ā know the threshold that triggers it