🧠 AI System Design

Day 17: Prefill vs Decode

📂 Serving & Inference 📖 15 min read Needs expansion

Learning Objectives

  • Understand the two-phase nature of transformer inference
  • Learn why prefill is compute-bound and decode is memory-bound
  • Profile your inference server to identify the bottleneck

Theory (15 min)

The Two Phases

Every transformer inference call has two distinct phases:

Phase 1: Prefill

Input: "The capital of France is"
       ↓
Compute attention for ALL input tokens in parallel
       ↓
Output: KV cache for all input tokens + first output token logits
  • Compute-bound: Fully utilises compute units (matrix multiply on input sequence)
  • Parallel: Processes all input tokens simultaneously
  • Duration: Depends on input length (O(L²) for full attention)

Phase 2: Decode

KV cache already populated for input.
Generate: "Par"  → compute attention for this ONE token → append to KV cache
Generate: "is"   → one token at a time
Generate: "the"  → ...
  • Memory-bound: Starved by memory bandwidth (moving KV cache from memory to compute)
  • Sequential: One token at a time — cannot parallelise
  • Duration: Depends on output length (O(L) per token — slow but constant)

The Ratio

Metric Prefill Decode
Compute utilisation High (90%+) Low (10-30%)
Memory bandwidth utilisation Medium High (95%+)
Latency per token Low Fixed (high)
Parallelism Full Sequential

Key insight: - Long input + short output = dominated by prefill - Short input + long output = dominated by decode - This determines your bottleneck and optimisation strategy

Implications for Architecture

If dominated by… Optimise by…
Prefill (long RAG context) KV cache reuse, prefix caching, input length limits
Decode (long generation) Speculative decoding, smaller model, quantisation

Hands-on (15 min)

Profile llama-server Phases

#!/usr/bin/env python3
"""profile-phases.py — measure prefill vs decode time."""
import time
import httpx
import json

# Stub — Ayva will expand with:
# - Instrument llama-server with timing hooks (if available)
# - Vary input length systematically (128, 512, 2048 tokens)
# - Vary output length systematically (16, 64, 256 tokens)
# - Plot: prefill time vs input length, decode time vs output length
# - Compare CPU vs GPU (if GPU available)
# - Identify your system's bottleneck curve

LLM_URL = "http://localhost:8080/v1/completions"

def test_phase(input_tokens: int, output_tokens: int):
    """Measure total time for a given input/output length."""
    # Generate a prompt of roughly `input_tokens` words
    words = ["architecture", "system", "design", "cache", "pipeline",
             "model", "inference", "latency", "batch", "throughput"]
    prompt_words = []
    for i in range(input_tokens):
        prompt_words.append(words[i % len(words)])
    prompt = " ".join(prompt_words)

    start = time.time()
    resp = httpx.post(LLM_URL, json={
        "prompt": prompt,
        "max_tokens": output_tokens,
        "temperature": 0.0,
    }, timeout=120)
    elapsed = time.time() - start
    text = resp.json()["choices"][0]["text"]
    actual_output = len(text.split())

    tokens_per_sec = actual_output / elapsed if elapsed > 0 else 0
    return {
        "input_tokens": input_tokens,
        "output_tokens": actual_output,
        "total_time": round(elapsed, 3),
        "tokens_per_sec": round(tokens_per_sec, 2),
    }

# Profile matrix
profile = []
for input_len in [10, 50, 100, 200]:
    for output_len in [20, 50, 100]:
        print(f"Testing input={input_len}, output={output_len}...")
        result = test_phase(input_len, output_len)
        profile.append(result)
        time.sleep(0.5)

print("\n📊 Phase Profile")
print(f"{'Input':>8} {'Output':>8} {'Time(s)':>10} {'tok/s':>10}")
print("-" * 40)
for r in profile:
    print(f"{r['input_tokens']:>8} {r['output_tokens']:>8} {r['total_time']:>10.3f} {r['tokens_per_sec']:>10.2f}")

Questions for Ayva: - How does context length affect the prefill/decode ratio for llama.cpp? - What's the practical max context length before prefill dominates? - How do flash attention and page attention change the prefill/decode balance?


Key Takeaways

  • Transformers have two fundamentally different phases: prefill (compute-bound) and decode (memory-bound)
  • Prefill dominates for RAG-heavy workloads; decode dominates for chat
  • Your bottleneck determines which optimisation to apply
  • Profiling is essential — don't guess which phase is dominant

References