🧠 AI System Design

Day 16: Continuous Batching & Speculative Decoding

πŸ“‚ Serving & Inference πŸ“– 15 min read Needs expansion

Learning Objectives

  • Understand continuous batching and why it's the standard for production serving
  • Learn speculative decoding as a lossless speedup technique
  • Benchmark llama-server with different batch sizes

Theory (15 min)

Static Batching (the old way)

[Req1]────────────────────────────────────────────────[Done]
      [Req2]──────────────────────────────────────────[Done]
            [Req3]────────────────────────────────────[Done]
            ↑ Wait for batch to fill↑

Every request waits for the batch to fill. Short requests sit idle waiting for long ones.

Continuous Batching (vLLM, TGI)

[Req1─t1β”‚Req1─t2β”‚Req1─t3]
[Req2─t1β”‚Req2─t2β”‚Req3─t1β”‚Req1─t4β”‚Req4─t1β”‚Req3─t2]
        ↑ Req1 finishes, slot freed
        Req3 and Req4 fill empty slots immediately

Key insight: Each position in the batch processes tokens at different rates. When one request finishes, a new one fills its slot immediately β€” no waiting.

Memory: vLLM uses PagedAttention β€” KV cache is stored in fixed-size pages (like virtual memory), eliminating fragmentation and enabling 2-4x higher throughput.

Speculative Decoding

The idea: Use a fast draft model to propose K tokens, then the target model verifies them all in one forward pass.

Draft model (fast, small): "The capital of France is Par"
Target model (big): "Par" β†’ accept (because "Paris")
                    "P"   β†’ accept
                    "a"   β†’ accept
                    "r"   β†’ reject β†’ draft from here
                    Saving: 3 out of 4 tokens accepted β‰ˆ 3x speedup

Lossless: The final distribution is identical to running the target model alone. No quality loss.

When it works best: Highly predictable outputs (code, structured generation, common phrases).

When it fails: Creative/open-ended generation β€” draft model guesses wrong, most tokens rejected.


Hands-on (15 min)

Benchmark Batch Sizes with llama-server

#!/usr/bin/env python3
"""batch-benchmark.py β€” test throughput at different batch sizes."""
import time
import json
import httpx
import asyncio

# Stub β€” Ayva will expand with:
# - Run llama-server with -ub (batch size) parameter variation
# - Concurrent request count sweep
# - Measure: tokens/sec, latency p50/p95/p99
# - Compare with vLLM if available
# - Continuous batching visualisation (timeline chart)
# - Speculative decoding test (draft model + target model)

LLM_URL = "http://localhost:8080/v1/completions"

async def send_request(client, prompt, req_id):
    start = time.time()
    try:
        resp = await client.post(LLM_URL, json={
            "prompt": prompt,
            "max_tokens": 50,
            "temperature": 0.0,
        }, timeout=60)
        elapsed = time.time() - start
        data = resp.json()
        text = data["choices"][0]["text"]
        tokens = len(text.split())
        return {"id": req_id, "latency": elapsed, "tokens": tokens, "ok": True}
    except Exception as e:
        return {"id": req_id, "latency": time.time() - start, "error": str(e), "ok": False}

async def concurrent_benchmark(n_requests: int, concurrency: int):
    prompts = [f"Tell me a fun fact about number {i}." for i in range(n_requests)]

    async with httpx.AsyncClient(timeout=60) as client:
        sem = asyncio.Semaphore(concurrency)

        async def bounded(prompt, i):
            async with sem:
                return await send_request(client, prompt, i)

        tasks = [bounded(p, i) for i, p in enumerate(prompts)]
        results = await asyncio.gather(*tasks)

    latencies = [r["latency"] for r in results if r["ok"]]
    if not latencies:
        return {"error": "all requests failed"}

    latencies.sort()
    return {
        "n_requests": n_requests,
        "concurrency": concurrency,
        "avg_latency": sum(latencies) / len(latencies),
        "p50": latencies[len(latencies) // 2],
        "p95": latencies[int(len(latencies) * 0.95)],
        "p99": latencies[int(len(latencies) * 0.99)],
        "throughput": len(latencies) / (sum(latencies) / len(latencies)),
    }

async def main():
    print("Continuous Batching Benchmark")
    print("=" * 40)

    for concurrency in [1, 2, 4, 8]:
        result = await concurrent_benchmark(20, concurrency)
        print(f"\nConcurrency: {concurrency}")
        print(f"  Avg latency: {result['avg_latency']:.2f}s")
        print(f"  P50:         {result['p50']:.2f}s")
        print(f"  P95:         {result['p95']:.2f}s")
        print(f"  Throughput:  {result['throughput']:.1f} req/s")

asyncio.run(main())

Questions for Ayva: - How does vLLM's PagedAttention compare to llama.cpp's KV cache management? - What's the practical benefit of speculative decoding on CPU (draft vs target overhead)? - How to tune --batch-size and --ubatch-size in llama-server?


Key Takeaways

  • Continuous batching eliminates idle GPU time by dynamically managing batch slots
  • vLLM's PagedAttention enables 2-4x throughput over static batching
  • Speculative decoding provides lossless speedup for predictable outputs
  • Benchmark with your actual traffic pattern β€” synthetic benchmarks lie

References