Day 2: Sync vs Async Inference
Learning Objectives
- Understand the three inference modes: sync, async, and streaming
- Know the latency/throughput/cost tradeoffs of each
- Build a simple request queuing proxy that batches inference
Theory (15 min)
Inference isn't one thing β there are three fundamentally different modes, each with different architectural implications.
Mode 1: Synchronous (Request-Response)
Client βββΆ Server βββΆ Model βββΆ Response βββΆ Client
(blocking, holding connection open)
Pros: Simple, predictable, easy to debug. Cons: Client blocks, server connection per request, idle time during prefill.
When to use: Low-traffic internal tools, quick demos, latency-insensitive batch work.
On your VPS: curl -X POST localhost:8080/v1/completions β this is what you normally do.
Mode 2: Asynchronous (Queue + Poll)
Client βββΆ Queue βββΆ Worker βββΆ Result Store βββΆ Client (polls)
Pros: Decouples client from server, handles load spikes via queue depth, workers can batch. Cons: Higher perceived latency (you don't get an answer immediately), needs polling or webhook.
When to use: Heavy batch processing, image gen, data pipelines, anything with variable processing time.
Mode 3: Streaming (SSE / WebSocket)
Client βββΆ Server βββΆ tokensββββtokensββββtokensββββΆ Client
(no wait for full response)
Pros: Time-to-first-token (TTFT) is the new latency metric. Users feel faster. Cons: Harder to implement correctly, consumes resources per connection, harder to cache.
When to use: Chat apps, code completion, any interactive use case.
The Critical Tradeoff Matrix
| Mode | Latency | Throughput | Cost per req | Complexity |
|---|---|---|---|---|
| Sync | Low (monolithic) | Low | High at scale | Low |
| Async | High (queue wait) | High (batching) | Lowest | Medium |
| Streaming | Lowest (TTFT) | Medium | Medium | Medium-High |
Batching β The Superpower
The single most impactful optimisation: process multiple requests together.
Without batching: [Request1] [Request2] [Request3] = 3 forward passes
With batching: [Req1βReq2βReq3] = 1 forward pass
Modern inference servers (vLLM, TensorRT-LLM, llama.cpp batch mode) do continuous batching β add requests to the running batch as others finish generating.
Hands-on (15 min)
Build a Request Queuing Proxy
This creates a simple async queue that batches incoming requests before sending them to your inference server, then returns results.
#!/usr/bin/env python3
"""async-queue-proxy.py β batches inference requests for throughput."""
import asyncio
import httpx
import json
import time
from collections import deque
LLM_URL = "http://localhost:8080/v1/completions"
BATCH_SIZE = 4
FLUSH_INTERVAL = 2.0 # max wait before forcing a batch
pending = deque()
results = {}
async def batch_worker():
"""Continuously flush pending requests as batches."""
while True:
await asyncio.sleep(FLUSH_INTERVAL)
if not pending:
continue
batch = []
while pending and len(batch) < BATCH_SIZE:
batch.append(pending.popleft())
if not batch:
continue
prompts = [item["prompt"] for item in batch]
try:
async with httpx.AsyncClient(timeout=30) as cli:
resp = await cli.post(LLM_URL, json={
"prompt": f"<BATCH>{json.dumps(prompts)}</BATCH>\n",
"max_tokens": 50,
"temperature": 0.0,
})
data = resp.json()
# naive: same response for all (real batching needs server support)
result = data.get("choices", [{}])[0].get("text", "")
except Exception as e:
result = f"[error: {e}]"
for item in batch:
results[item["id"]] = result
async def handle_request(prompt: str, req_id: str) -> str:
future = asyncio.get_event_loop().create_future()
def on_complete():
if not future.done():
future.set_result(results.pop(req_id, ""))
pending.append({"id": req_id, "prompt": prompt, "on_complete": on_complete})
return await future
async def main():
asyncio.create_task(batch_worker())
# Simulate concurrent requests
test_prompts = [
("q1", "Explain caching in one sentence."),
("q2", "What is a load balancer?"),
("q3", "Define batching."),
("q4", "What is async processing?"),
("q5", "List three DB types."),
]
tasks = [handle_request(p, i) for i, p in test_prompts]
t0 = time.time()
results_list = await asyncio.gather(*tasks)
elapsed = time.time() - t0
for (req_id, prompt), result in zip(test_prompts, results_list):
print(f"[{req_id}] {prompt}")
print(f" β {result[:100]}\n")
print(f"Processed {len(test_prompts)} requests in {elapsed:.2f}s")
if __name__ == "__main__":
asyncio.run(main())
Run it:
cd /tmp
python3 async-queue-proxy.py
Compare with sending 5 requests serially β note the throughput difference.
Question: What happens when BATCH_SIZE=1? When FLUSH_INTERVAL is very short?
Key Takeaways
- Three inference modes: sync (simple), async (scalable), streaming (responsive)
- Batching is the single highest-leverage optimisation for throughput
- The right mode depends on your latency SLA and traffic pattern
- Continuous batching (vLLM, TGI) combines the best of all modes