🧠 AI System Design

Day 2: Sync vs Async Inference

πŸ“‚ Foundations πŸ“– 15 min read Ready

Learning Objectives

  • Understand the three inference modes: sync, async, and streaming
  • Know the latency/throughput/cost tradeoffs of each
  • Build a simple request queuing proxy that batches inference

Theory (15 min)

Inference isn't one thing β€” there are three fundamentally different modes, each with different architectural implications.

Mode 1: Synchronous (Request-Response)

Client ──▢ Server ──▢ Model ──▢ Response ──▢ Client
         (blocking, holding connection open)

Pros: Simple, predictable, easy to debug. Cons: Client blocks, server connection per request, idle time during prefill.

When to use: Low-traffic internal tools, quick demos, latency-insensitive batch work.

On your VPS: curl -X POST localhost:8080/v1/completions β€” this is what you normally do.

Mode 2: Asynchronous (Queue + Poll)

Client ──▢ Queue ──▢ Worker ──▢ Result Store ──▢ Client (polls)

Pros: Decouples client from server, handles load spikes via queue depth, workers can batch. Cons: Higher perceived latency (you don't get an answer immediately), needs polling or webhook.

When to use: Heavy batch processing, image gen, data pipelines, anything with variable processing time.

Mode 3: Streaming (SSE / WebSocket)

Client ──▢ Server ──▢ tokens────tokens────tokens───▢ Client
                     (no wait for full response)

Pros: Time-to-first-token (TTFT) is the new latency metric. Users feel faster. Cons: Harder to implement correctly, consumes resources per connection, harder to cache.

When to use: Chat apps, code completion, any interactive use case.

The Critical Tradeoff Matrix

Mode Latency Throughput Cost per req Complexity
Sync Low (monolithic) Low High at scale Low
Async High (queue wait) High (batching) Lowest Medium
Streaming Lowest (TTFT) Medium Medium Medium-High

Batching β€” The Superpower

The single most impactful optimisation: process multiple requests together.

Without batching:   [Request1] [Request2] [Request3]  = 3 forward passes
With batching:       [Req1β”‚Req2β”‚Req3]                 = 1 forward pass

Modern inference servers (vLLM, TensorRT-LLM, llama.cpp batch mode) do continuous batching β€” add requests to the running batch as others finish generating.


Hands-on (15 min)

Build a Request Queuing Proxy

This creates a simple async queue that batches incoming requests before sending them to your inference server, then returns results.

#!/usr/bin/env python3
"""async-queue-proxy.py β€” batches inference requests for throughput."""
import asyncio
import httpx
import json
import time
from collections import deque

LLM_URL = "http://localhost:8080/v1/completions"
BATCH_SIZE = 4
FLUSH_INTERVAL = 2.0  # max wait before forcing a batch

pending = deque()
results = {}

async def batch_worker():
    """Continuously flush pending requests as batches."""
    while True:
        await asyncio.sleep(FLUSH_INTERVAL)
        if not pending:
            continue
        batch = []
        while pending and len(batch) < BATCH_SIZE:
            batch.append(pending.popleft())
        if not batch:
            continue

        prompts = [item["prompt"] for item in batch]
        try:
            async with httpx.AsyncClient(timeout=30) as cli:
                resp = await cli.post(LLM_URL, json={
                    "prompt": f"<BATCH>{json.dumps(prompts)}</BATCH>\n",
                    "max_tokens": 50,
                    "temperature": 0.0,
                })
                data = resp.json()
            # naive: same response for all (real batching needs server support)
            result = data.get("choices", [{}])[0].get("text", "")
        except Exception as e:
            result = f"[error: {e}]"

        for item in batch:
            results[item["id"]] = result

async def handle_request(prompt: str, req_id: str) -> str:
    future = asyncio.get_event_loop().create_future()

    def on_complete():
        if not future.done():
            future.set_result(results.pop(req_id, ""))

    pending.append({"id": req_id, "prompt": prompt, "on_complete": on_complete})
    return await future

async def main():
    asyncio.create_task(batch_worker())

    # Simulate concurrent requests
    test_prompts = [
        ("q1", "Explain caching in one sentence."),
        ("q2", "What is a load balancer?"),
        ("q3", "Define batching."),
        ("q4", "What is async processing?"),
        ("q5", "List three DB types."),
    ]
    tasks = [handle_request(p, i) for i, p in test_prompts]
    t0 = time.time()
    results_list = await asyncio.gather(*tasks)
    elapsed = time.time() - t0

    for (req_id, prompt), result in zip(test_prompts, results_list):
        print(f"[{req_id}] {prompt}")
        print(f"  β†’ {result[:100]}\n")
    print(f"Processed {len(test_prompts)} requests in {elapsed:.2f}s")

if __name__ == "__main__":
    asyncio.run(main())

Run it:

cd /tmp
python3 async-queue-proxy.py

Compare with sending 5 requests serially β€” note the throughput difference.

Question: What happens when BATCH_SIZE=1? When FLUSH_INTERVAL is very short?


Key Takeaways

  • Three inference modes: sync (simple), async (scalable), streaming (responsive)
  • Batching is the single highest-leverage optimisation for throughput
  • The right mode depends on your latency SLA and traffic pattern
  • Continuous batching (vLLM, TGI) combines the best of all modes

References