🧠 AI System Design

Day 15: Inference Optimization

📂 Serving & Inference 📖 15 min read Needs expansion

Learning Objectives

  • Understand why quantization is the single most impactful inference optimisation
  • Compare GGUF (CPU-focused) vs GPTQ/AWQ (GPU-focused) formats
  • Benchmark q4_K_M vs q8_0 on your own hardware

Theory (15 min)

Why Quantize?

Model weights are stored as fp16 (16-bit floats). Each weight uses 2 bytes.

A 7B parameter model: - fp16: 7B × 2 bytes = 14 GB — won't fit on a 12GB GPU - q8 (8-bit): 7B × 1 byte = 7 GB — fits on most GPUs - q4 (4-bit): 7B × 0.5 bytes = 3.5 GB — fits on low-end GPUs, runs on CPU

Quantization: store weights at lower precision, accept small quality loss.

Quantization Formats Compared

Format Bit Width Quality (MMLU) Speed Memory Hardware
fp16 16 100% (baseline) Reference 14GB GPU only
q8_0 8 99.5% ~1.5x 7GB GPU + CPU
q4_K_M 4.5 ~97% ~2-3x 4GB CPU ideal
q3_K_S 3.5 ~95% ~3-4x 3GB CPU, mobile
q2_K 2.5 ~88% ~4x 2GB Extreme compression

q4_K_M is the sweet spot — 97% quality at 1/4 the memory.

How Quantization Works

Original fp16 weights:
[1.2345, -0.7891, 3.4567, ...]

q4 quantized:
[1.25, -0.75, 3.50, ...]  → rounded to nearest 0.25 grid point

Loss: the rounding error. For most weights, this is negligible.

What You Lose (and Gain)

  • Loss: ~2-3% on academic benchmarks (MMLU drop from 68% to 65%)
  • Gain: 4x memory reduction, 2-3x speedup (more data fits in cache)
  • Real-world impact: Often undetectable for chat/QA. Noticeable for maths/code.

Hands-on (15 min)

Benchmark Quantization Levels

#!/usr/bin/env bash
# quant-benchmark.sh — compare q4_K_M vs q8_0 throughput
#
# Prerequisites: llama-server running with two model copies

echo "=== Quantization Benchmark ==="

# Test with q4_K_M (your usual setup)
if command -v llama-cli &> /dev/null; then
  echo "Model test: Qwen2.5-3B-Instruct"
  echo ""

  for quant in "q4_K_M" "q8_0"; do
    MODEL="/models/qwen2.5-3b-${quant}.gguf"
    if [ ! -f "$MODEL" ]; then
      echo "⚠️  $MODEL not found — skipping"
      continue
    fi

    echo "--- $quant ---"
    # Run benchmark with fixed prompt, measure tok/s
    llama-cli -m "$MODEL" -n 128 -p "Explain machine learning in 3 paragraphs." \
      --no-display-prompt 2>&1 | tail -5
    echo ""
  done
else
  echo "⚠️  llama-cli not found — run for a quick manual test"
fi

# If llama-server is running, compare response times
echo "=== Server-side latency comparison ==="
for endpoint in "8080" "8081"; do
  total=0
  for i in 1 2 3; do
    start=$(date +%s%N)
    curl -s "http://localhost:${endpoint}/v1/completions" \
      -H "Content-Type: application/json" \
      -d '{"prompt":"Write a short poem about coding.","max_tokens":50}' > /dev/null
    elapsed=$((($(date +%s%N) - start) / 1000000))
    total=$((total + elapsed))
  done
  avg=$((total / 3))
  echo "  Port ${endpoint}: avg ${avg}ms"
done
#!/usr/bin/env python3
"""quant-benchmark.py — measure speed and quality at different quant levels."""
import time
import json
import httpx

# Stub — Ayva will expand with:
# - Actual quality comparison (same prompt, compare outputs)
# - Memory usage tracking (psutil or /proc/meminfo)
# - Multiple prompt lengths (short, medium, long)
# - Concurrent request throughput
# - Temperature 0.0 for deterministic comparison
# - Automated MMLU subset evaluation

LLM_URL = "http://localhost:8080/v1/completions"

def benchmark(prompt: str, max_tokens: int = 128):
    latencies = []
    outputs = []

    for _ in range(5):
        start = time.time()
        resp = httpx.post(LLM_URL, json={
            "prompt": prompt,
            "max_tokens": max_tokens,
            "temperature": 0.0,
        }, timeout=60)
        elapsed = time.time() - start
        latencies.append(elapsed)
        outputs.append(resp.json()["choices"][0]["text"])

    return {
        "avg_latency": sum(latencies) / len(latencies),
        "min_latency": min(latencies),
        "max_latency": max(latencies),
        "tokens_per_sec": max_tokens / (sum(latencies) / len(latencies)),
        "first_output_preview": outputs[0][:100],
    }

if __name__ == "__main__":
    print("Benchmarking inference...")
    result = benchmark("What is the difference between TCP and UDP?", 128)
    print(json.dumps(result, indent=2))

Questions for Ayva: - What's the actual quality difference between q4_K_M and q8_0 on real-world prompts (not benchmarks)? - Are there specific tasks (math, code generation) where quantisation degrades more? - How does IQ4_NL (importance-weighted 4-bit) compare to q4_K_M?


Key Takeaways

  • Quantization is the highest-ROI optimisation — 4x memory for ~3% quality loss
  • q4_K_M (GGUF) is the sweet spot for CPU inference on your VPS
  • Always benchmark on your own workload — benchmark scores don't always translate
  • Lower quantisation is better for memory-bound, higher for compute-bound

References