🧠 AI System Design

Day 24: A/B Testing & Canary Deployments

šŸ“‚ Production & Case Studies šŸ“– 15 min read Needs expansion

Learning Objectives

  • Understand why AI model updates need careful rollout (non-deterministic outputs)
  • Learn shadow, canary, and blue-green deployment strategies
  • Set up a weighted router that directs 10% of traffic to a new model version

Theory (15 min)

Why AI Rollouts Are Hard

  • Non-deterministic: Same input → different output (model version change is invisible in unit tests)
  • Quality is subjective: A model that scores better on benchmarks may feel worse to users
  • Regressions are silent: The model doesn't crash — it just answers worse
  • User trust: A bad experience loses trust faster than a crash

Deployment Strategies

Strategy How It Works Risk Traffic Impact Detection
Shadow New model gets all requests but outputs are discarded None None Compare outputs offline
Canary 5-10% traffic to new model, ramp up gradually Low Small subset Monitor metrics per variant
Blue-Green Flip all traffic at once (two identical environments) Medium Instant switch Pre-deployment validation
A/B 50/50 split, statistically evaluate Medium Half of users Statistical significance test

The Canary Pipeline

Week 1: Shadow mode (collect outputs, no user impact)
   ↓ Compare shadow outputs to production → identify regressions
Week 2: 5% canary
   ↓ Monitor: latency, user feedback, token usage
Week 3: 25% canary
   ↓ Escalate monitoring
Week 4: 50%
   ↓ Full confidence
Week 5: 100% (blue-green flip)

Rollback at ANY stage if metrics degrade.

What to Monitor in a Canary

Metric What to Compare Action if Degraded
Latency p50/p95 Old vs new Rollback if >20% slower
Token usage Old vs new Flag if >50% more tokens
Cache hit rate Old vs new Rollback if significantly lower
User feedback score Old vs new Rollback if lower
Refusal rate Old vs new Investigate if different

Hands-on (15 min)

Build a Weighted A/B Router

#!/usr/bin/env python3
"""ab-router.py — weighted traffic split between model versions."""
import random
import json
import httpx
import time
from http.server import HTTPServer, BaseHTTPRequestHandler
from collections import defaultdict

# Stub — Ayva will expand with:
# - Shadow mode (log new model output but serve old model output)
# - Metrics comparison (latency, token count per variant)
# - Automated rollback trigger (if error rate > threshold)
# - Gradual ramp-up scheduler (5% → 10% → 25% → 50% → 100%)
# - Per-user stickiness (same user always gets same variant)
# - Statistical significance test (is the difference real?)
# - Integration with the AI Gateway from Day 7

class ABRouter:
    def __init__(self, variants: dict):
        """
        variants = {"model-a": {"url": "...", "weight": 0.9},
                    "model-b": {"url": "...", "weight": 0.1}}
        """
        self.variants = variants
        self.results = defaultdict(list)  # variant → [latencies]

    def select_variant(self) -> str:
        """Weighted random selection."""
        r = random.random()
        cumulative = 0
        for name, config in self.variants.items():
            cumulative += config["weight"]
            if r < cumulative:
                return name
        return list(self.variants.keys())[-1]

    async def route(self, prompt: str, variant: str) -> dict:
        url = self.variants[variant]["url"]
        start = time.time()
        try:
            async with httpx.AsyncClient(timeout=60) as cli:
                resp = await cli.post(url, json={
                    "prompt": prompt,
                    "max_tokens": 50,
                    "temperature": 0.0,
                })
                latency = time.time() - start
                data = resp.json()
                self.results[variant].append(latency)
                return {
                    "variant": variant,
                    "latency": round(latency, 3),
                    "text": data["choices"][0]["text"][:100],
                }
        except Exception as e:
            return {"variant": variant, "error": str(e)}

    def stats(self) -> dict:
        stats = {}
        for variant, latencies in self.results.items():
            if latencies:
                sorted_l = sorted(latencies)
                stats[variant] = {
                    "requests": len(latencies),
                    "avg_latency": round(sum(latencies) / len(latencies), 3),
                    "p50": round(sorted_l[len(sorted_l) // 2], 3),
                }
        return stats


# Demo
import asyncio

async def demo():
    router = ABRouter({
        "model-a": {"url": "http://localhost:8080/v1/completions", "weight": 0.9},
        "model-b": {"url": "http://localhost:8081/v1/completions", "weight": 0.1},
    })

    print("A/B Router — 20 simulated requests (90/10 split)")
    for i in range(20):
        variant = router.select_variant()
        result = await router.route(f"Fact number {i}", variant)
        print(f"  [{i:>2}] {result['variant']:>8} | "
              f"{result.get('latency', 'ERR'):>5}s")

    print(f"\nšŸ“Š Stats:\n{json.dumps(router.stats(), indent=2)}")

asyncio.run(demo())

Questions for Ayva: - How to measure statistical significance in LLM A/B tests? - What's the minimum sample size for reliable model comparison? - How to handle model-specific caching (different KV cache per variant)?


Key Takeaways

  • AI model rollouts need more care than regular software (silent regressions, subjective quality)
  • Shadow → Canary → Blue-Green is the safe deployment pipeline
  • Monitor variant-specific metrics, not just aggregate
  • Always have a rollback plan — know the threshold that triggers it

References