🧠 AI System Design

Day 21: Mini-Project — Inference Benchmark Suite

šŸ“‚ Serving & Inference šŸ“– 15 min read Needs expansion

Learning Objectives

  • Design a benchmark that measures what matters for your workload
  • Sweep multiple parameters and produce a human-readable comparison
  • Understand the interactions between batch size, quantisation, and concurrency

Theory (15 min)

What to Measure

Metric What It Tells You Target
Tokens per second Raw throughput >10 tok/s (interactive), >50 (batch)
Time to first token Perceived latency <500ms
P50 latency Typical user experience <3s
P95 latency Worst-case (skip outliers) <10s
VRAM usage Hardware requirements Fit comfortably
Cost per token Economics Depends on budget

Benchmark Pitfalls

  1. Cold start: First request includes model load time. Warm up first.
  2. Prompt cache: Repeating the same prompt is cached. Vary prompts.
  3. Warm-up: First 1-2 requests after idle are slow. Run 3+ warm-up requests.
  4. Background noise: Other processes affect results. Run on quiet system.
  5. Statistical significance: Test 5+ times. Report mean + std deviation.

The Parameter Space

Quantisation: [q2_K, q3_K, q4_K_M, q8_0]
Batch size:   [1, 2, 4, 8, 16]
Concurrency:  [1, 2, 4, 8]
Input length: [50, 200, 500, 1000] tokens
Output length:[20, 50, 100, 200] tokens

That's 5 Ɨ 5 Ɨ 4 Ɨ 4 Ɨ 4 = 1600 combinations. You don't need all of them. Test one parameter at a time while holding others constant.


Hands-on (15 min)

Build the Benchmark Suite

#!/usr/bin/env python3
"""benchmark-suite.py — comprehensive inference benchmark."""
import time
import json
import csv
import httpx
import asyncio
from dataclasses import dataclass, field
from typing import List

# Stub — Ayva will expand with:
# - All parameter sweeps (quant, batch, concurrency, input_len, output_len)
# - CSV/JSON output for analysis
# - HTML report generation (charts via matplotlib or plotly)
# - Warm-up phase before measurements
# - Multiple prompt variations per test
# - Memory and GPU utilisation logging
# - Comparison with historical runs (version tracking)
# - Integration with the gateway from Week 1 for end-to-end testing

@dataclass
class BenchmarkConfig:
    llm_url: str = "http://localhost:8080/v1/completions"
    model_name: str = "qwen2.5-3b-q4_K_M"
    n_warmup: int = 3
    n_runs: int = 5
    concurrency: int = 1
    max_tokens: int = 100
    temperature: float = 0.0

@dataclass
class BenchmarkResult:
    config: BenchmarkConfig
    latencies: List[float] = field(default_factory=list)
    tokens_counts: List[int] = field(default_factory=list)

    def summary(self) -> dict:
        if not self.latencies:
            return {"error": "no data"}
        sorted_l = sorted(self.latencies)
        sorted_t = sorted(self.tokens_counts)
        n = len(sorted_l)
        return {
            "avg_latency": round(sum(self.latencies) / n, 3),
            "p50": round(sorted_l[n // 2], 3),
            "p95": round(sorted_l[int(n * 0.95)], 3),
            "min": round(sorted_l[0], 3),
            "max": round(sorted_l[-1], 3),
            "avg_tokens_per_sec": round(
                sum(self.tokens_counts) / sum(self.latencies), 2
            ),
            "total_tokens": sum(self.tokens_counts),
        }

async def run_benchmark(config: BenchmarkConfig, prompts: List[str]) -> BenchmarkResult:
    result = BenchmarkResult(config=config)
    async with httpx.AsyncClient(timeout=120) as client:

        # Warm-up
        for p in prompts[:config.n_warmup]:
            try:
                await client.post(config.llm_url, json={
                    "prompt": p, "max_tokens": 10, "temperature": config.temperature,
                })
            except:
                pass

        # Benchmarked runs
        for p in prompts[config.n_warmup:config.n_warmup + config.n_runs]:
            start = time.time()
            try:
                resp = await client.post(config.llm_url, json={
                    "prompt": p, "max_tokens": config.max_tokens,
                    "temperature": config.temperature,
                })
                elapsed = time.time() - start
                data = resp.json()
                text = data["choices"][0]["text"]
                result.latencies.append(elapsed)
                result.tokens_counts.append(len(text.split()))
            except Exception as e:
                print(f"  āš ļø  Error: {e}")

    return result


async def main():
    prompts = [
        "Explain machine learning.",
        "What is the capital of France?",
        "Write a haiku about databases.",
        "Compare TCP and UDP.",
        "What is caching?",
        "Describe attention in transformers.",
        "Write a Python decorator.",
        "What is load balancing?",
    ] * 3  # enough for warmup + runs

    config = BenchmarkConfig()
    result = await run_benchmark(config, prompts)
    summary = result.summary()

    print(f"\nšŸ“Š Benchmark: {config.model_name}")
    print(f"   Concurrency: {config.concurrency}, Max tokens: {config.max_tokens}")
    print(f"   Runs: {len(result.latencies)} after {config.n_warmup} warm-ups")
    print(f"\n{'Metric':<20} {'Value':<15}")
    print("-" * 35)
    for k, v in summary.items():
        print(f"  {k:<18} {v}")

    # Save to CSV
    with open("benchmark_results.csv", "w", newline="") as f:
        w = csv.writer(f)
        w.writerow(["run", "latency_s", "tokens", "tokens_per_sec"])
        for i, (lat, tok) in enumerate(zip(result.latencies, result.tokens_counts)):
            w.writerow([i + 1, lat, tok, round(tok / lat, 2)])
    print(f"\nšŸ’¾ Saved to benchmark_results.csv")

if __name__ == "__main__":
    asyncio.run(main())

Run it:

python3 /tmp/benchmark-suite.py

Questions for Ayva: - How to visualise the benchmark results (latency distribution, throughput curves)? - What's the methodology for comparing two inference servers fairly? - How to detect if your benchmark results are statistically significant?


Key Takeaways

  • Measure what matters for your use case — not the highest possible number
  • Always warm up before benchmarking (cold start is misleading)
  • Sweep one parameter at a time while holding others constant
  • Report summaries (p50, p95) not just averages — averages hide outliers

References