🧠 AI System Design

Day 18: GPU vs CPU Offloading

šŸ“‚ Serving & Inference šŸ“– 15 min read Needs expansion

Learning Objectives

  • Understand why layer placement matters (memory hierarchy, bandwidth)
  • Learn how -ngl N in llama.cpp controls GPU offloading
  • Benchmark different offload levels to find the sweet spot on your hardware

Theory (15 min)

The Memory Hierarchy

CPU RAM  (DDR5, 64 GB/s)  ── Large, slow
    │
GPU VRAM (GDDR6, 1 TB/s) ── Small, fast
    │
GPU SRAM (Shared mem, 10+ TB/s) ── Tiny, extremely fast

Bandwidth is the bottleneck for LLM inference. Each token requires moving the full model weights from memory to compute. Higher bandwidth = faster inference.

What -ngl Does

-ngl N = offload the first N layers to the GPU.

-ngl 0:  All layers on CPU (VRAM = 0)
         Slowest inference, most memory-efficient

-ngl 20: 20 layers on GPU, rest on CPU
         Medium speed, needs some VRAM

-ngl 99: All layers on GPU (or max possible)
         Fastest inference, needs full VRAM

PCIe Bottleneck

When a layer is on GPU but its inputs come from CPU-resident layers, data crosses the PCIe bus.

CPU Layer  ──▶ PCIe (32 GB/s) ──▶ GPU Layer ──▶ GPU SRAM

PCIe Gen4 x16 = ~32 GB/s. This is 30x slower than GDDR6 bandwidth.

Implication: Offloading partial layers gives diminishing returns because each token crosses PCIe.

Finding Your Sweet Spot

For a 3B model (q4_K_M = ~1.8 GB):

-ngl VRAM Used PCIe Crossings Expected Speed
0 0 GB 0 1x (baseline)
10 ~1 GB 2 per token ~2x
20 ~1.8 GB 0 (all GPU) ~3-4x
99 ~1.8 GB 0 ~3-4x (saturated)

The curve flattens — once all layers are offloaded, more GPU doesn't help.


Hands-on (15 min)

Sweep -ngl Values and Measure

#!/usr/bin/env bash
# ngl-sweep.sh — benchmark different GPU offload levels
# Requires: llama-cli, a GGUF model

MODEL="/models/qwen2.5-3b-q4_K_M.gguf"
PROMPT="Explain the concept of attention in transformer models in 3 paragraphs."
N_TOKENS=128

echo "=== GPU Offload (-ngl) Sweep ==="
echo "Model: $MODEL"
echo ""

for ngl in 0 5 10 15 20 25 30 99; do
  echo "--- ngl = $ngl ---"
  OUTPUT=$(llama-cli -m "$MODEL" -n $N_TOKENS -p "$PROMPT" \
    --no-display-prompt -ngl $ngl 2>&1)

  # Extract timing info (format varies by version)
  TOK_S=$(echo "$OUTPUT" | grep -oP '(\d+\.?\d*) tokens per second' | head -1)
  echo "  Tokens/sec: $TOK_S"
  echo ""
done
#!/usr/bin/env python3
"""ngl-sweep.py — programmatic offload benchmark."""
import subprocess
import re
import json
import time

# Stub — Ayva will expand with:
# - Multiple model sizes (1.5B, 3B, 7B)
# - Multiple quant levels (q4_K_M, q8_0)
# - Memory usage tracking (free -m, nvidia-smi)
# - Latency vs throughput chart
# - PCIe bandwidth utilisation measurement
# - CPU-only vs GPU-offload vs full-GPU comparison

MODEL = "/models/qwen2.5-3b-q4_K_M.gguf"
PROMPT = "Explain the concept of attention in transformer models."
NGL_VALUES = [0, 5, 10, 15, 20, 99]

results = []
for ngl in NGL_VALUES:
    print(f"Testing ngl={ngl}...")
    start = time.time()

    result = subprocess.run(
        ["llama-cli", "-m", MODEL, "-n", "128", "-p", PROMPT,
         "--no-display-prompt", "-ngl", str(ngl)],
        capture_output=True, text=True, timeout=120,
    )

    elapsed = time.time() - start
    output = result.stdout + result.stderr

    # Try to extract tok/s
    tok_s_match = re.search(r'(\d+\.?\d*)\s*tokens per second', output)
    tok_s = float(tok_s_match.group(1)) if tok_s_match else 0

    results.append({"ngl": ngl, "tokens_per_sec": tok_s, "total_time": round(elapsed, 2)})
    print(f"  → {tok_s:.1f} tok/s ({elapsed:.1f}s total)")
    time.sleep(2)  # cooldown

print("\nšŸ“Š Results:")
print(f"{'ngl':>6} {'tok/s':>10} {'time(s)':>10}")
print("-" * 30)
for r in results:
    print(f"{r['ngl']:>6} {r['tokens_per_sec']:>10.1f} {r['total_time']:>10.2f}")

Questions for Ayva: - How does PCIe Gen4 vs Gen5 affect the optimal -ngl value? - For CPU-only inference, what's the most impactful optimisation (RAM speed, thread count, cache size)? - How does model size change the -ngl sweet spot curve?


Key Takeaways

  • GPU offloading speeds inference, but with diminishing returns once all layers fit on GPU
  • PCIe bandwidth is a major bottleneck for partial offloading
  • Always benchmark your specific model and hardware — the sweet spot varies
  • For consumer GPUs with limited VRAM: offload as many layers as fit, keep rest on CPU

References