Day 24: A/B Testing & Canary Deployments

📂 Production & Case Studies 📖 15 min read Needs expansion

Learning Objectives

Understand why AI model updates need careful rollout (non-deterministic outputs)
Learn shadow, canary, and blue-green deployment strategies
Set up a weighted router that directs 10% of traffic to a new model version

Theory (15 min)

Why AI Rollouts Are Hard

Non-deterministic: Same input → different output (model version change is invisible in unit tests)
Quality is subjective: A model that scores better on benchmarks may feel worse to users
Regressions are silent: The model doesn't crash — it just answers worse
User trust: A bad experience loses trust faster than a crash

Deployment Strategies

Strategy	How It Works	Risk	Traffic Impact	Detection
Shadow	New model gets all requests but outputs are discarded	None	None	Compare outputs offline
Canary	5-10% traffic to new model, ramp up gradually	Low	Small subset	Monitor metrics per variant
Blue-Green	Flip all traffic at once (two identical environments)	Medium	Instant switch	Pre-deployment validation
A/B	50/50 split, statistically evaluate	Medium	Half of users	Statistical significance test

The Canary Pipeline

Week 1: Shadow mode (collect outputs, no user impact)
   ↓ Compare shadow outputs to production → identify regressions
Week 2: 5% canary
   ↓ Monitor: latency, user feedback, token usage
Week 3: 25% canary
   ↓ Escalate monitoring
Week 4: 50%
   ↓ Full confidence
Week 5: 100% (blue-green flip)

Rollback at ANY stage if metrics degrade.

What to Monitor in a Canary

Metric	What to Compare	Action if Degraded
Latency p50/p95	Old vs new	Rollback if >20% slower
Token usage	Old vs new	Flag if >50% more tokens
Cache hit rate	Old vs new	Rollback if significantly lower
User feedback score	Old vs new	Rollback if lower
Refusal rate	Old vs new	Investigate if different

Hands-on (15 min)

Build a Weighted A/B Router

#!/usr/bin/env python3
"""ab-router.py — weighted traffic split between model versions."""
import random
import json
import httpx
import time
from http.server import HTTPServer, BaseHTTPRequestHandler
from collections import defaultdict

# Stub — Ayva will expand with:
# - Shadow mode (log new model output but serve old model output)
# - Metrics comparison (latency, token count per variant)
# - Automated rollback trigger (if error rate > threshold)
# - Gradual ramp-up scheduler (5% → 10% → 25% → 50% → 100%)
# - Per-user stickiness (same user always gets same variant)
# - Statistical significance test (is the difference real?)
# - Integration with the AI Gateway from Day 7

class ABRouter:
    def __init__(self, variants: dict):
        """
        variants = {"model-a": {"url": "...", "weight": 0.9},
                    "model-b": {"url": "...", "weight": 0.1}}
        """
        self.variants = variants
        self.results = defaultdict(list)  # variant → [latencies]

    def select_variant(self) -> str:
        """Weighted random selection."""
        r = random.random()
        cumulative = 0
        for name, config in self.variants.items():
            cumulative += config["weight"]
            if r < cumulative:
                return name
        return list(self.variants.keys())[-1]

    async def route(self, prompt: str, variant: str) -> dict:
        url = self.variants[variant]["url"]
        start = time.time()
        try:
            async with httpx.AsyncClient(timeout=60) as cli:
                resp = await cli.post(url, json={
                    "prompt": prompt,
                    "max_tokens": 50,
                    "temperature": 0.0,
                })
                latency = time.time() - start
                data = resp.json()
                self.results[variant].append(latency)
                return {
                    "variant": variant,
                    "latency": round(latency, 3),
                    "text": data["choices"][0]["text"][:100],
                }
        except Exception as e:
            return {"variant": variant, "error": str(e)}

    def stats(self) -> dict:
        stats = {}
        for variant, latencies in self.results.items():
            if latencies:
                sorted_l = sorted(latencies)
                stats[variant] = {
                    "requests": len(latencies),
                    "avg_latency": round(sum(latencies) / len(latencies), 3),
                    "p50": round(sorted_l[len(sorted_l) // 2], 3),
                }
        return stats


# Demo
import asyncio

async def demo():
    router = ABRouter({
        "model-a": {"url": "http://localhost:8080/v1/completions", "weight": 0.9},
        "model-b": {"url": "http://localhost:8081/v1/completions", "weight": 0.1},
    })

    print("A/B Router — 20 simulated requests (90/10 split)")
    for i in range(20):
        variant = router.select_variant()
        result = await router.route(f"Fact number {i}", variant)
        print(f"  [{i:>2}] {result['variant']:>8} | "
              f"{result.get('latency', 'ERR'):>5}s")

    print(f"\n📊 Stats:\n{json.dumps(router.stats(), indent=2)}")

asyncio.run(demo())

Questions for Ayva: - How to measure statistical significance in LLM A/B tests? - What's the minimum sample size for reliable model comparison? - How to handle model-specific caching (different KV cache per variant)?

Key Takeaways

AI model rollouts need more care than regular software (silent regressions, subjective quality)
Shadow → Canary → Blue-Green is the safe deployment pipeline
Monitor variant-specific metrics, not just aggregate
Always have a rollback plan — know the threshold that triggers it

🧠 AI System Design