Day 15: Inference Optimization
Learning Objectives
- Understand why quantization is the single most impactful inference optimisation
- Compare GGUF (CPU-focused) vs GPTQ/AWQ (GPU-focused) formats
- Benchmark q4_K_M vs q8_0 on your own hardware
Theory (15 min)
Why Quantize?
Model weights are stored as fp16 (16-bit floats). Each weight uses 2 bytes.
A 7B parameter model: - fp16: 7B × 2 bytes = 14 GB — won't fit on a 12GB GPU - q8 (8-bit): 7B × 1 byte = 7 GB — fits on most GPUs - q4 (4-bit): 7B × 0.5 bytes = 3.5 GB — fits on low-end GPUs, runs on CPU
Quantization: store weights at lower precision, accept small quality loss.
Quantization Formats Compared
| Format | Bit Width | Quality (MMLU) | Speed | Memory | Hardware |
|---|---|---|---|---|---|
| fp16 | 16 | 100% (baseline) | Reference | 14GB | GPU only |
| q8_0 | 8 | 99.5% | ~1.5x | 7GB | GPU + CPU |
| q4_K_M | 4.5 | ~97% | ~2-3x | 4GB | CPU ideal |
| q3_K_S | 3.5 | ~95% | ~3-4x | 3GB | CPU, mobile |
| q2_K | 2.5 | ~88% | ~4x | 2GB | Extreme compression |
q4_K_M is the sweet spot — 97% quality at 1/4 the memory.
How Quantization Works
Original fp16 weights:
[1.2345, -0.7891, 3.4567, ...]
q4 quantized:
[1.25, -0.75, 3.50, ...] → rounded to nearest 0.25 grid point
Loss: the rounding error. For most weights, this is negligible.
What You Lose (and Gain)
- Loss: ~2-3% on academic benchmarks (MMLU drop from 68% to 65%)
- Gain: 4x memory reduction, 2-3x speedup (more data fits in cache)
- Real-world impact: Often undetectable for chat/QA. Noticeable for maths/code.
Hands-on (15 min)
Benchmark Quantization Levels
#!/usr/bin/env bash
# quant-benchmark.sh — compare q4_K_M vs q8_0 throughput
#
# Prerequisites: llama-server running with two model copies
echo "=== Quantization Benchmark ==="
# Test with q4_K_M (your usual setup)
if command -v llama-cli &> /dev/null; then
echo "Model test: Qwen2.5-3B-Instruct"
echo ""
for quant in "q4_K_M" "q8_0"; do
MODEL="/models/qwen2.5-3b-${quant}.gguf"
if [ ! -f "$MODEL" ]; then
echo "⚠️ $MODEL not found — skipping"
continue
fi
echo "--- $quant ---"
# Run benchmark with fixed prompt, measure tok/s
llama-cli -m "$MODEL" -n 128 -p "Explain machine learning in 3 paragraphs." \
--no-display-prompt 2>&1 | tail -5
echo ""
done
else
echo "⚠️ llama-cli not found — run for a quick manual test"
fi
# If llama-server is running, compare response times
echo "=== Server-side latency comparison ==="
for endpoint in "8080" "8081"; do
total=0
for i in 1 2 3; do
start=$(date +%s%N)
curl -s "http://localhost:${endpoint}/v1/completions" \
-H "Content-Type: application/json" \
-d '{"prompt":"Write a short poem about coding.","max_tokens":50}' > /dev/null
elapsed=$((($(date +%s%N) - start) / 1000000))
total=$((total + elapsed))
done
avg=$((total / 3))
echo " Port ${endpoint}: avg ${avg}ms"
done
#!/usr/bin/env python3
"""quant-benchmark.py — measure speed and quality at different quant levels."""
import time
import json
import httpx
# Stub — Ayva will expand with:
# - Actual quality comparison (same prompt, compare outputs)
# - Memory usage tracking (psutil or /proc/meminfo)
# - Multiple prompt lengths (short, medium, long)
# - Concurrent request throughput
# - Temperature 0.0 for deterministic comparison
# - Automated MMLU subset evaluation
LLM_URL = "http://localhost:8080/v1/completions"
def benchmark(prompt: str, max_tokens: int = 128):
latencies = []
outputs = []
for _ in range(5):
start = time.time()
resp = httpx.post(LLM_URL, json={
"prompt": prompt,
"max_tokens": max_tokens,
"temperature": 0.0,
}, timeout=60)
elapsed = time.time() - start
latencies.append(elapsed)
outputs.append(resp.json()["choices"][0]["text"])
return {
"avg_latency": sum(latencies) / len(latencies),
"min_latency": min(latencies),
"max_latency": max(latencies),
"tokens_per_sec": max_tokens / (sum(latencies) / len(latencies)),
"first_output_preview": outputs[0][:100],
}
if __name__ == "__main__":
print("Benchmarking inference...")
result = benchmark("What is the difference between TCP and UDP?", 128)
print(json.dumps(result, indent=2))
Questions for Ayva: - What's the actual quality difference between q4_K_M and q8_0 on real-world prompts (not benchmarks)? - Are there specific tasks (math, code generation) where quantisation degrades more? - How does IQ4_NL (importance-weighted 4-bit) compare to q4_K_M?
Key Takeaways
- Quantization is the highest-ROI optimisation — 4x memory for ~3% quality loss
- q4_K_M (GGUF) is the sweet spot for CPU inference on your VPS
- Always benchmark on your own workload — benchmark scores don't always translate
- Lower quantisation is better for memory-bound, higher for compute-bound