Day 18: GPU vs CPU Offloading

📂 Serving & Inference 📖 15 min read Needs expansion

Learning Objectives

Understand why layer placement matters (memory hierarchy, bandwidth)
Learn how -ngl N in llama.cpp controls GPU offloading
Benchmark different offload levels to find the sweet spot on your hardware

Theory (15 min)

The Memory Hierarchy

CPU RAM  (DDR5, 64 GB/s)  ── Large, slow
    │
GPU VRAM (GDDR6, 1 TB/s) ── Small, fast
    │
GPU SRAM (Shared mem, 10+ TB/s) ── Tiny, extremely fast

Bandwidth is the bottleneck for LLM inference. Each token requires moving the full model weights from memory to compute. Higher bandwidth = faster inference.

What `-ngl` Does

-ngl N = offload the first N layers to the GPU.

-ngl 0:  All layers on CPU (VRAM = 0)
         Slowest inference, most memory-efficient

-ngl 20: 20 layers on GPU, rest on CPU
         Medium speed, needs some VRAM

-ngl 99: All layers on GPU (or max possible)
         Fastest inference, needs full VRAM

PCIe Bottleneck

When a layer is on GPU but its inputs come from CPU-resident layers, data crosses the PCIe bus.

CPU Layer  ──▶ PCIe (32 GB/s) ──▶ GPU Layer ──▶ GPU SRAM

PCIe Gen4 x16 = ~32 GB/s. This is 30x slower than GDDR6 bandwidth.

Implication: Offloading partial layers gives diminishing returns because each token crosses PCIe.

Finding Your Sweet Spot

For a 3B model (q4_K_M = ~1.8 GB):

`-ngl`	VRAM Used	PCIe Crossings	Expected Speed
0	0 GB	0	1x (baseline)
10	~1 GB	2 per token	~2x
20	~1.8 GB	0 (all GPU)	~3-4x
99	~1.8 GB	0	~3-4x (saturated)

The curve flattens — once all layers are offloaded, more GPU doesn't help.

Hands-on (15 min)

Sweep `-ngl` Values and Measure

#!/usr/bin/env bash
# ngl-sweep.sh — benchmark different GPU offload levels
# Requires: llama-cli, a GGUF model

MODEL="/models/qwen2.5-3b-q4_K_M.gguf"
PROMPT="Explain the concept of attention in transformer models in 3 paragraphs."
N_TOKENS=128

echo "=== GPU Offload (-ngl) Sweep ==="
echo "Model: $MODEL"
echo ""

for ngl in 0 5 10 15 20 25 30 99; do
  echo "--- ngl = $ngl ---"
  OUTPUT=$(llama-cli -m "$MODEL" -n $N_TOKENS -p "$PROMPT" \
    --no-display-prompt -ngl $ngl 2>&1)

  # Extract timing info (format varies by version)
  TOK_S=$(echo "$OUTPUT" | grep -oP '(\d+\.?\d*) tokens per second' | head -1)
  echo "  Tokens/sec: $TOK_S"
  echo ""
done

#!/usr/bin/env python3
"""ngl-sweep.py — programmatic offload benchmark."""
import subprocess
import re
import json
import time

# Stub — Ayva will expand with:
# - Multiple model sizes (1.5B, 3B, 7B)
# - Multiple quant levels (q4_K_M, q8_0)
# - Memory usage tracking (free -m, nvidia-smi)
# - Latency vs throughput chart
# - PCIe bandwidth utilisation measurement
# - CPU-only vs GPU-offload vs full-GPU comparison

MODEL = "/models/qwen2.5-3b-q4_K_M.gguf"
PROMPT = "Explain the concept of attention in transformer models."
NGL_VALUES = [0, 5, 10, 15, 20, 99]

results = []
for ngl in NGL_VALUES:
    print(f"Testing ngl={ngl}...")
    start = time.time()

    result = subprocess.run(
        ["llama-cli", "-m", MODEL, "-n", "128", "-p", PROMPT,
         "--no-display-prompt", "-ngl", str(ngl)],
        capture_output=True, text=True, timeout=120,
    )

    elapsed = time.time() - start
    output = result.stdout + result.stderr

    # Try to extract tok/s
    tok_s_match = re.search(r'(\d+\.?\d*)\s*tokens per second', output)
    tok_s = float(tok_s_match.group(1)) if tok_s_match else 0

    results.append({"ngl": ngl, "tokens_per_sec": tok_s, "total_time": round(elapsed, 2)})
    print(f"  → {tok_s:.1f} tok/s ({elapsed:.1f}s total)")
    time.sleep(2)  # cooldown

print("\n📊 Results:")
print(f"{'ngl':>6} {'tok/s':>10} {'time(s)':>10}")
print("-" * 30)
for r in results:
    print(f"{r['ngl']:>6} {r['tokens_per_sec']:>10.1f} {r['total_time']:>10.2f}")

Questions for Ayva: - How does PCIe Gen4 vs Gen5 affect the optimal -ngl value? - For CPU-only inference, what's the most impactful optimisation (RAM speed, thread count, cache size)? - How does model size change the -ngl sweet spot curve?

Key Takeaways

GPU offloading speeds inference, but with diminishing returns once all layers fit on GPU
PCIe bandwidth is a major bottleneck for partial offloading
Always benchmark your specific model and hardware — the sweet spot varies
For consumer GPUs with limited VRAM: offload as many layers as fit, keep rest on CPU

🧠 AI System Design

Day 18: GPU vs CPU Offloading

Learning Objectives

Theory (15 min)

The Memory Hierarchy

What `-ngl` Does

PCIe Bottleneck

Finding Your Sweet Spot

Hands-on (15 min)

Sweep `-ngl` Values and Measure

Key Takeaways

References

Day 18: GPU vs CPU Offloading

Learning Objectives

Theory (15 min)

The Memory Hierarchy

What -ngl Does

PCIe Bottleneck

Finding Your Sweet Spot

Hands-on (15 min)

Sweep -ngl Values and Measure

Key Takeaways

References

What `-ngl` Does

Sweep `-ngl` Values and Measure