Day 21: Mini-Project ā Inference Benchmark Suite
Learning Objectives
- Design a benchmark that measures what matters for your workload
- Sweep multiple parameters and produce a human-readable comparison
- Understand the interactions between batch size, quantisation, and concurrency
Theory (15 min)
What to Measure
| Metric | What It Tells You | Target |
|---|---|---|
| Tokens per second | Raw throughput | >10 tok/s (interactive), >50 (batch) |
| Time to first token | Perceived latency | <500ms |
| P50 latency | Typical user experience | <3s |
| P95 latency | Worst-case (skip outliers) | <10s |
| VRAM usage | Hardware requirements | Fit comfortably |
| Cost per token | Economics | Depends on budget |
Benchmark Pitfalls
- Cold start: First request includes model load time. Warm up first.
- Prompt cache: Repeating the same prompt is cached. Vary prompts.
- Warm-up: First 1-2 requests after idle are slow. Run 3+ warm-up requests.
- Background noise: Other processes affect results. Run on quiet system.
- Statistical significance: Test 5+ times. Report mean + std deviation.
The Parameter Space
Quantisation: [q2_K, q3_K, q4_K_M, q8_0]
Batch size: [1, 2, 4, 8, 16]
Concurrency: [1, 2, 4, 8]
Input length: [50, 200, 500, 1000] tokens
Output length:[20, 50, 100, 200] tokens
That's 5 Ć 5 Ć 4 Ć 4 Ć 4 = 1600 combinations. You don't need all of them. Test one parameter at a time while holding others constant.
Hands-on (15 min)
Build the Benchmark Suite
#!/usr/bin/env python3
"""benchmark-suite.py ā comprehensive inference benchmark."""
import time
import json
import csv
import httpx
import asyncio
from dataclasses import dataclass, field
from typing import List
# Stub ā Ayva will expand with:
# - All parameter sweeps (quant, batch, concurrency, input_len, output_len)
# - CSV/JSON output for analysis
# - HTML report generation (charts via matplotlib or plotly)
# - Warm-up phase before measurements
# - Multiple prompt variations per test
# - Memory and GPU utilisation logging
# - Comparison with historical runs (version tracking)
# - Integration with the gateway from Week 1 for end-to-end testing
@dataclass
class BenchmarkConfig:
llm_url: str = "http://localhost:8080/v1/completions"
model_name: str = "qwen2.5-3b-q4_K_M"
n_warmup: int = 3
n_runs: int = 5
concurrency: int = 1
max_tokens: int = 100
temperature: float = 0.0
@dataclass
class BenchmarkResult:
config: BenchmarkConfig
latencies: List[float] = field(default_factory=list)
tokens_counts: List[int] = field(default_factory=list)
def summary(self) -> dict:
if not self.latencies:
return {"error": "no data"}
sorted_l = sorted(self.latencies)
sorted_t = sorted(self.tokens_counts)
n = len(sorted_l)
return {
"avg_latency": round(sum(self.latencies) / n, 3),
"p50": round(sorted_l[n // 2], 3),
"p95": round(sorted_l[int(n * 0.95)], 3),
"min": round(sorted_l[0], 3),
"max": round(sorted_l[-1], 3),
"avg_tokens_per_sec": round(
sum(self.tokens_counts) / sum(self.latencies), 2
),
"total_tokens": sum(self.tokens_counts),
}
async def run_benchmark(config: BenchmarkConfig, prompts: List[str]) -> BenchmarkResult:
result = BenchmarkResult(config=config)
async with httpx.AsyncClient(timeout=120) as client:
# Warm-up
for p in prompts[:config.n_warmup]:
try:
await client.post(config.llm_url, json={
"prompt": p, "max_tokens": 10, "temperature": config.temperature,
})
except:
pass
# Benchmarked runs
for p in prompts[config.n_warmup:config.n_warmup + config.n_runs]:
start = time.time()
try:
resp = await client.post(config.llm_url, json={
"prompt": p, "max_tokens": config.max_tokens,
"temperature": config.temperature,
})
elapsed = time.time() - start
data = resp.json()
text = data["choices"][0]["text"]
result.latencies.append(elapsed)
result.tokens_counts.append(len(text.split()))
except Exception as e:
print(f" ā ļø Error: {e}")
return result
async def main():
prompts = [
"Explain machine learning.",
"What is the capital of France?",
"Write a haiku about databases.",
"Compare TCP and UDP.",
"What is caching?",
"Describe attention in transformers.",
"Write a Python decorator.",
"What is load balancing?",
] * 3 # enough for warmup + runs
config = BenchmarkConfig()
result = await run_benchmark(config, prompts)
summary = result.summary()
print(f"\nš Benchmark: {config.model_name}")
print(f" Concurrency: {config.concurrency}, Max tokens: {config.max_tokens}")
print(f" Runs: {len(result.latencies)} after {config.n_warmup} warm-ups")
print(f"\n{'Metric':<20} {'Value':<15}")
print("-" * 35)
for k, v in summary.items():
print(f" {k:<18} {v}")
# Save to CSV
with open("benchmark_results.csv", "w", newline="") as f:
w = csv.writer(f)
w.writerow(["run", "latency_s", "tokens", "tokens_per_sec"])
for i, (lat, tok) in enumerate(zip(result.latencies, result.tokens_counts)):
w.writerow([i + 1, lat, tok, round(tok / lat, 2)])
print(f"\nš¾ Saved to benchmark_results.csv")
if __name__ == "__main__":
asyncio.run(main())
Run it:
python3 /tmp/benchmark-suite.py
Questions for Ayva: - How to visualise the benchmark results (latency distribution, throughput curves)? - What's the methodology for comparing two inference servers fairly? - How to detect if your benchmark results are statistically significant?
Key Takeaways
- Measure what matters for your use case ā not the highest possible number
- Always warm up before benchmarking (cold start is misleading)
- Sweep one parameter at a time while holding others constant
- Report summaries (p50, p95) not just averages ā averages hide outliers