Day 18: GPU vs CPU Offloading
Learning Objectives
- Understand why layer placement matters (memory hierarchy, bandwidth)
- Learn how
-ngl Nin llama.cpp controls GPU offloading - Benchmark different offload levels to find the sweet spot on your hardware
Theory (15 min)
The Memory Hierarchy
CPU RAM (DDR5, 64 GB/s) āā Large, slow
ā
GPU VRAM (GDDR6, 1 TB/s) āā Small, fast
ā
GPU SRAM (Shared mem, 10+ TB/s) āā Tiny, extremely fast
Bandwidth is the bottleneck for LLM inference. Each token requires moving the full model weights from memory to compute. Higher bandwidth = faster inference.
What -ngl Does
-ngl N = offload the first N layers to the GPU.
-ngl 0: All layers on CPU (VRAM = 0)
Slowest inference, most memory-efficient
-ngl 20: 20 layers on GPU, rest on CPU
Medium speed, needs some VRAM
-ngl 99: All layers on GPU (or max possible)
Fastest inference, needs full VRAM
PCIe Bottleneck
When a layer is on GPU but its inputs come from CPU-resident layers, data crosses the PCIe bus.
CPU Layer āāā¶ PCIe (32 GB/s) āāā¶ GPU Layer āāā¶ GPU SRAM
PCIe Gen4 x16 = ~32 GB/s. This is 30x slower than GDDR6 bandwidth.
Implication: Offloading partial layers gives diminishing returns because each token crosses PCIe.
Finding Your Sweet Spot
For a 3B model (q4_K_M = ~1.8 GB):
-ngl |
VRAM Used | PCIe Crossings | Expected Speed |
|---|---|---|---|
| 0 | 0 GB | 0 | 1x (baseline) |
| 10 | ~1 GB | 2 per token | ~2x |
| 20 | ~1.8 GB | 0 (all GPU) | ~3-4x |
| 99 | ~1.8 GB | 0 | ~3-4x (saturated) |
The curve flattens ā once all layers are offloaded, more GPU doesn't help.
Hands-on (15 min)
Sweep -ngl Values and Measure
#!/usr/bin/env bash
# ngl-sweep.sh ā benchmark different GPU offload levels
# Requires: llama-cli, a GGUF model
MODEL="/models/qwen2.5-3b-q4_K_M.gguf"
PROMPT="Explain the concept of attention in transformer models in 3 paragraphs."
N_TOKENS=128
echo "=== GPU Offload (-ngl) Sweep ==="
echo "Model: $MODEL"
echo ""
for ngl in 0 5 10 15 20 25 30 99; do
echo "--- ngl = $ngl ---"
OUTPUT=$(llama-cli -m "$MODEL" -n $N_TOKENS -p "$PROMPT" \
--no-display-prompt -ngl $ngl 2>&1)
# Extract timing info (format varies by version)
TOK_S=$(echo "$OUTPUT" | grep -oP '(\d+\.?\d*) tokens per second' | head -1)
echo " Tokens/sec: $TOK_S"
echo ""
done
#!/usr/bin/env python3
"""ngl-sweep.py ā programmatic offload benchmark."""
import subprocess
import re
import json
import time
# Stub ā Ayva will expand with:
# - Multiple model sizes (1.5B, 3B, 7B)
# - Multiple quant levels (q4_K_M, q8_0)
# - Memory usage tracking (free -m, nvidia-smi)
# - Latency vs throughput chart
# - PCIe bandwidth utilisation measurement
# - CPU-only vs GPU-offload vs full-GPU comparison
MODEL = "/models/qwen2.5-3b-q4_K_M.gguf"
PROMPT = "Explain the concept of attention in transformer models."
NGL_VALUES = [0, 5, 10, 15, 20, 99]
results = []
for ngl in NGL_VALUES:
print(f"Testing ngl={ngl}...")
start = time.time()
result = subprocess.run(
["llama-cli", "-m", MODEL, "-n", "128", "-p", PROMPT,
"--no-display-prompt", "-ngl", str(ngl)],
capture_output=True, text=True, timeout=120,
)
elapsed = time.time() - start
output = result.stdout + result.stderr
# Try to extract tok/s
tok_s_match = re.search(r'(\d+\.?\d*)\s*tokens per second', output)
tok_s = float(tok_s_match.group(1)) if tok_s_match else 0
results.append({"ngl": ngl, "tokens_per_sec": tok_s, "total_time": round(elapsed, 2)})
print(f" ā {tok_s:.1f} tok/s ({elapsed:.1f}s total)")
time.sleep(2) # cooldown
print("\nš Results:")
print(f"{'ngl':>6} {'tok/s':>10} {'time(s)':>10}")
print("-" * 30)
for r in results:
print(f"{r['ngl']:>6} {r['tokens_per_sec']:>10.1f} {r['total_time']:>10.2f}")
Questions for Ayva:
- How does PCIe Gen4 vs Gen5 affect the optimal -ngl value?
- For CPU-only inference, what's the most impactful optimisation (RAM speed, thread count, cache size)?
- How does model size change the -ngl sweet spot curve?
Key Takeaways
- GPU offloading speeds inference, but with diminishing returns once all layers fit on GPU
- PCIe bandwidth is a major bottleneck for partial offloading
- Always benchmark your specific model and hardware ā the sweet spot varies
- For consumer GPUs with limited VRAM: offload as many layers as fit, keep rest on CPU