Day 15: Inference Optimization

📂 Serving & Inference 📖 15 min read Needs expansion

Learning Objectives

Understand why quantization is the single most impactful inference optimisation
Compare GGUF (CPU-focused) vs GPTQ/AWQ (GPU-focused) formats
Benchmark q4_K_M vs q8_0 on your own hardware

Theory (15 min)

Why Quantize?

Model weights are stored as fp16 (16-bit floats). Each weight uses 2 bytes.

A 7B parameter model: - fp16: 7B × 2 bytes = 14 GB — won't fit on a 12GB GPU - q8 (8-bit): 7B × 1 byte = 7 GB — fits on most GPUs - q4 (4-bit): 7B × 0.5 bytes = 3.5 GB — fits on low-end GPUs, runs on CPU

Quantization: store weights at lower precision, accept small quality loss.

Quantization Formats Compared

Format	Bit Width	Quality (MMLU)	Speed	Memory	Hardware
fp16	16	100% (baseline)	Reference	14GB	GPU only
q8_0	8	99.5%	~1.5x	7GB	GPU + CPU
q4_K_M	4.5	~97%	~2-3x	4GB	CPU ideal
q3_K_S	3.5	~95%	~3-4x	3GB	CPU, mobile
q2_K	2.5	~88%	~4x	2GB	Extreme compression

q4_K_M is the sweet spot — 97% quality at 1/4 the memory.

How Quantization Works

Original fp16 weights:
[1.2345, -0.7891, 3.4567, ...]

q4 quantized:
[1.25, -0.75, 3.50, ...]  → rounded to nearest 0.25 grid point

Loss: the rounding error. For most weights, this is negligible.

What You Lose (and Gain)

Loss: ~2-3% on academic benchmarks (MMLU drop from 68% to 65%)
Gain: 4x memory reduction, 2-3x speedup (more data fits in cache)
Real-world impact: Often undetectable for chat/QA. Noticeable for maths/code.

Hands-on (15 min)

Benchmark Quantization Levels

#!/usr/bin/env bash
# quant-benchmark.sh — compare q4_K_M vs q8_0 throughput
#
# Prerequisites: llama-server running with two model copies

echo "=== Quantization Benchmark ==="

# Test with q4_K_M (your usual setup)
if command -v llama-cli &> /dev/null; then
  echo "Model test: Qwen2.5-3B-Instruct"
  echo ""

  for quant in "q4_K_M" "q8_0"; do
    MODEL="/models/qwen2.5-3b-${quant}.gguf"
    if [ ! -f "$MODEL" ]; then
      echo "⚠️  $MODEL not found — skipping"
      continue
    fi

    echo "--- $quant ---"
    # Run benchmark with fixed prompt, measure tok/s
    llama-cli -m "$MODEL" -n 128 -p "Explain machine learning in 3 paragraphs." \
      --no-display-prompt 2>&1 | tail -5
    echo ""
  done
else
  echo "⚠️  llama-cli not found — run for a quick manual test"
fi

# If llama-server is running, compare response times
echo "=== Server-side latency comparison ==="
for endpoint in "8080" "8081"; do
  total=0
  for i in 1 2 3; do
    start=$(date +%s%N)
    curl -s "http://localhost:${endpoint}/v1/completions" \
      -H "Content-Type: application/json" \
      -d '{"prompt":"Write a short poem about coding.","max_tokens":50}' > /dev/null
    elapsed=$((($(date +%s%N) - start) / 1000000))
    total=$((total + elapsed))
  done
  avg=$((total / 3))
  echo "  Port ${endpoint}: avg ${avg}ms"
done

#!/usr/bin/env python3
"""quant-benchmark.py — measure speed and quality at different quant levels."""
import time
import json
import httpx

# Stub — Ayva will expand with:
# - Actual quality comparison (same prompt, compare outputs)
# - Memory usage tracking (psutil or /proc/meminfo)
# - Multiple prompt lengths (short, medium, long)
# - Concurrent request throughput
# - Temperature 0.0 for deterministic comparison
# - Automated MMLU subset evaluation

LLM_URL = "http://localhost:8080/v1/completions"

def benchmark(prompt: str, max_tokens: int = 128):
    latencies = []
    outputs = []

    for _ in range(5):
        start = time.time()
        resp = httpx.post(LLM_URL, json={
            "prompt": prompt,
            "max_tokens": max_tokens,
            "temperature": 0.0,
        }, timeout=60)
        elapsed = time.time() - start
        latencies.append(elapsed)
        outputs.append(resp.json()["choices"][0]["text"])

    return {
        "avg_latency": sum(latencies) / len(latencies),
        "min_latency": min(latencies),
        "max_latency": max(latencies),
        "tokens_per_sec": max_tokens / (sum(latencies) / len(latencies)),
        "first_output_preview": outputs[0][:100],
    }

if __name__ == "__main__":
    print("Benchmarking inference...")
    result = benchmark("What is the difference between TCP and UDP?", 128)
    print(json.dumps(result, indent=2))

Questions for Ayva: - What's the actual quality difference between q4_K_M and q8_0 on real-world prompts (not benchmarks)? - Are there specific tasks (math, code generation) where quantisation degrades more? - How does IQ4_NL (importance-weighted 4-bit) compare to q4_K_M?

Key Takeaways

Quantization is the highest-ROI optimisation — 4x memory for ~3% quality loss
q4_K_M (GGUF) is the sweet spot for CPU inference on your VPS
Always benchmark on your own workload — benchmark scores don't always translate
Lower quantisation is better for memory-bound, higher for compute-bound

🧠 AI System Design