Day 17: Prefill vs Decode
Learning Objectives
- Understand the two-phase nature of transformer inference
- Learn why prefill is compute-bound and decode is memory-bound
- Profile your inference server to identify the bottleneck
Theory (15 min)
The Two Phases
Every transformer inference call has two distinct phases:
Phase 1: Prefill
Input: "The capital of France is"
↓
Compute attention for ALL input tokens in parallel
↓
Output: KV cache for all input tokens + first output token logits
- Compute-bound: Fully utilises compute units (matrix multiply on input sequence)
- Parallel: Processes all input tokens simultaneously
- Duration: Depends on input length (O(L²) for full attention)
Phase 2: Decode
KV cache already populated for input.
Generate: "Par" → compute attention for this ONE token → append to KV cache
Generate: "is" → one token at a time
Generate: "the" → ...
- Memory-bound: Starved by memory bandwidth (moving KV cache from memory to compute)
- Sequential: One token at a time — cannot parallelise
- Duration: Depends on output length (O(L) per token — slow but constant)
The Ratio
| Metric | Prefill | Decode |
|---|---|---|
| Compute utilisation | High (90%+) | Low (10-30%) |
| Memory bandwidth utilisation | Medium | High (95%+) |
| Latency per token | Low | Fixed (high) |
| Parallelism | Full | Sequential |
Key insight: - Long input + short output = dominated by prefill - Short input + long output = dominated by decode - This determines your bottleneck and optimisation strategy
Implications for Architecture
| If dominated by… | Optimise by… |
|---|---|
| Prefill (long RAG context) | KV cache reuse, prefix caching, input length limits |
| Decode (long generation) | Speculative decoding, smaller model, quantisation |
Hands-on (15 min)
Profile llama-server Phases
#!/usr/bin/env python3
"""profile-phases.py — measure prefill vs decode time."""
import time
import httpx
import json
# Stub — Ayva will expand with:
# - Instrument llama-server with timing hooks (if available)
# - Vary input length systematically (128, 512, 2048 tokens)
# - Vary output length systematically (16, 64, 256 tokens)
# - Plot: prefill time vs input length, decode time vs output length
# - Compare CPU vs GPU (if GPU available)
# - Identify your system's bottleneck curve
LLM_URL = "http://localhost:8080/v1/completions"
def test_phase(input_tokens: int, output_tokens: int):
"""Measure total time for a given input/output length."""
# Generate a prompt of roughly `input_tokens` words
words = ["architecture", "system", "design", "cache", "pipeline",
"model", "inference", "latency", "batch", "throughput"]
prompt_words = []
for i in range(input_tokens):
prompt_words.append(words[i % len(words)])
prompt = " ".join(prompt_words)
start = time.time()
resp = httpx.post(LLM_URL, json={
"prompt": prompt,
"max_tokens": output_tokens,
"temperature": 0.0,
}, timeout=120)
elapsed = time.time() - start
text = resp.json()["choices"][0]["text"]
actual_output = len(text.split())
tokens_per_sec = actual_output / elapsed if elapsed > 0 else 0
return {
"input_tokens": input_tokens,
"output_tokens": actual_output,
"total_time": round(elapsed, 3),
"tokens_per_sec": round(tokens_per_sec, 2),
}
# Profile matrix
profile = []
for input_len in [10, 50, 100, 200]:
for output_len in [20, 50, 100]:
print(f"Testing input={input_len}, output={output_len}...")
result = test_phase(input_len, output_len)
profile.append(result)
time.sleep(0.5)
print("\n📊 Phase Profile")
print(f"{'Input':>8} {'Output':>8} {'Time(s)':>10} {'tok/s':>10}")
print("-" * 40)
for r in profile:
print(f"{r['input_tokens']:>8} {r['output_tokens']:>8} {r['total_time']:>10.3f} {r['tokens_per_sec']:>10.2f}")
Questions for Ayva: - How does context length affect the prefill/decode ratio for llama.cpp? - What's the practical max context length before prefill dominates? - How do flash attention and page attention change the prefill/decode balance?
Key Takeaways
- Transformers have two fundamentally different phases: prefill (compute-bound) and decode (memory-bound)
- Prefill dominates for RAG-heavy workloads; decode dominates for chat
- Your bottleneck determines which optimisation to apply
- Profiling is essential — don't guess which phase is dominant