Day 16: Continuous Batching & Speculative Decoding
Learning Objectives
- Understand continuous batching and why it's the standard for production serving
- Learn speculative decoding as a lossless speedup technique
- Benchmark
llama-serverwith different batch sizes
Theory (15 min)
Static Batching (the old way)
[Req1]ββββββββββββββββββββββββββββββββββββββββββββββββ[Done]
[Req2]ββββββββββββββββββββββββββββββββββββββββββ[Done]
[Req3]ββββββββββββββββββββββββββββββββββββ[Done]
β Wait for batch to fillβ
Every request waits for the batch to fill. Short requests sit idle waiting for long ones.
Continuous Batching (vLLM, TGI)
[Req1βt1βReq1βt2βReq1βt3]
[Req2βt1βReq2βt2βReq3βt1βReq1βt4βReq4βt1βReq3βt2]
β Req1 finishes, slot freed
Req3 and Req4 fill empty slots immediately
Key insight: Each position in the batch processes tokens at different rates. When one request finishes, a new one fills its slot immediately β no waiting.
Memory: vLLM uses PagedAttention β KV cache is stored in fixed-size pages (like virtual memory), eliminating fragmentation and enabling 2-4x higher throughput.
Speculative Decoding
The idea: Use a fast draft model to propose K tokens, then the target model verifies them all in one forward pass.
Draft model (fast, small): "The capital of France is Par"
Target model (big): "Par" β accept (because "Paris")
"P" β accept
"a" β accept
"r" β reject β draft from here
Saving: 3 out of 4 tokens accepted β 3x speedup
Lossless: The final distribution is identical to running the target model alone. No quality loss.
When it works best: Highly predictable outputs (code, structured generation, common phrases).
When it fails: Creative/open-ended generation β draft model guesses wrong, most tokens rejected.
Hands-on (15 min)
Benchmark Batch Sizes with llama-server
#!/usr/bin/env python3
"""batch-benchmark.py β test throughput at different batch sizes."""
import time
import json
import httpx
import asyncio
# Stub β Ayva will expand with:
# - Run llama-server with -ub (batch size) parameter variation
# - Concurrent request count sweep
# - Measure: tokens/sec, latency p50/p95/p99
# - Compare with vLLM if available
# - Continuous batching visualisation (timeline chart)
# - Speculative decoding test (draft model + target model)
LLM_URL = "http://localhost:8080/v1/completions"
async def send_request(client, prompt, req_id):
start = time.time()
try:
resp = await client.post(LLM_URL, json={
"prompt": prompt,
"max_tokens": 50,
"temperature": 0.0,
}, timeout=60)
elapsed = time.time() - start
data = resp.json()
text = data["choices"][0]["text"]
tokens = len(text.split())
return {"id": req_id, "latency": elapsed, "tokens": tokens, "ok": True}
except Exception as e:
return {"id": req_id, "latency": time.time() - start, "error": str(e), "ok": False}
async def concurrent_benchmark(n_requests: int, concurrency: int):
prompts = [f"Tell me a fun fact about number {i}." for i in range(n_requests)]
async with httpx.AsyncClient(timeout=60) as client:
sem = asyncio.Semaphore(concurrency)
async def bounded(prompt, i):
async with sem:
return await send_request(client, prompt, i)
tasks = [bounded(p, i) for i, p in enumerate(prompts)]
results = await asyncio.gather(*tasks)
latencies = [r["latency"] for r in results if r["ok"]]
if not latencies:
return {"error": "all requests failed"}
latencies.sort()
return {
"n_requests": n_requests,
"concurrency": concurrency,
"avg_latency": sum(latencies) / len(latencies),
"p50": latencies[len(latencies) // 2],
"p95": latencies[int(len(latencies) * 0.95)],
"p99": latencies[int(len(latencies) * 0.99)],
"throughput": len(latencies) / (sum(latencies) / len(latencies)),
}
async def main():
print("Continuous Batching Benchmark")
print("=" * 40)
for concurrency in [1, 2, 4, 8]:
result = await concurrent_benchmark(20, concurrency)
print(f"\nConcurrency: {concurrency}")
print(f" Avg latency: {result['avg_latency']:.2f}s")
print(f" P50: {result['p50']:.2f}s")
print(f" P95: {result['p95']:.2f}s")
print(f" Throughput: {result['throughput']:.1f} req/s")
asyncio.run(main())
Questions for Ayva:
- How does vLLM's PagedAttention compare to llama.cpp's KV cache management?
- What's the practical benefit of speculative decoding on CPU (draft vs target overhead)?
- How to tune --batch-size and --ubatch-size in llama-server?
Key Takeaways
- Continuous batching eliminates idle GPU time by dynamically managing batch slots
- vLLM's PagedAttention enables 2-4x throughput over static batching
- Speculative decoding provides lossless speedup for predictable outputs
- Benchmark with your actual traffic pattern β synthetic benchmarks lie