🧠 AI System Design

Day 29: Scaling Law Intuition

📂 Production & Case Studies 📖 15 min read Needs expansion

Learning Objectives

  • Develop an intuition for the economics of scaling AI systems
  • Understand when to throw hardware vs when to optimise software
  • Calculate: cost of upgrading VPS vs cost of optimising your serving layer

Theory (15 min)

The Scaling Laws (Chinchilla, Kaplan)

Key insight: model quality scales predictably with compute, data, and parameters.

Loss ∝ 1 / (N^α × D^β × C^γ)

Where:
N = model parameters
D = dataset size (tokens)
C = compute (FLOPs)
α, β, γ ≈ 0.05-0.1 (diminishing returns)

Practical meaning: - Doubling model params gives ~5% loss improvement - Doubling training data gives ~5% loss improvement - There's a sweet spot: 20x more training data than params (Chinchilla optimal)

For Inference (Your World)

The scaling law for serving is different:

Total Cost = N_models × C_query × V_queries

Where:
N_models = number of model instances
C_query = cost per query (hardware + electricity)
V_queries = query volume

Key insight: The cost to serve scales linearly with queries, but quality only scales logarithmically with model size.

Hardware Upgrade vs Software Optimisation

You have two options:

A) Upgrade VPS: 8 vCPU → 16 vCPU ($15/month more)
   Expected speedup: ~1.5x (partial, not all ops parallelisable)

B) Optimise: q4 quantisation + caching + batch inference
   Cost: your time (free ~1 hour)
   Expected speedup: ~3-4x + token savings

Optimisation wins — by a significant margin, at low scale.

When Hardware Makes Sense

Scenario Upgrade Don't Upgrade
Serving latency critical GPU upgrade CPU optimisation
Throughput bottleneck More RAM/cores Batch size tuning
Model doesn't fit More VRAM/RAM Quantisation
10+ concurrent users Multiple instances Single-instance tuning
Always at 100% CPU Upgrade needed Already optimised

Always optimise software first. Hardware is the last resort.


Hands-on (15 min)

Build a Cost Scaling Calculator

#!/usr/bin/env python3
"""scaling-calculator.py — compare hardware upgrade vs optimisation."""

# Stub — Ayva will expand with:
# - Real VPS pricing (Hetzner, AWS, DigitalOcean)
# - Real electricity cost calculations
# - GPU vs CPU cost comparison for inference
# - ROI timeline for hardware upgrades
# - "What if" scenarios (2x traffic, 10x traffic, 100x traffic)
# - Graph: cost vs throughput for different strategies

def calc_upgrade_scenario(increase_2x: bool = True):
    """Compare upgrading hardware vs optimising software."""
    print("=" * 60)
    print("📈 HARDWARE UPGRADE SCENARIO" if increase_2x else "📉 OPTIMISATION SCENARIO")
    print("=" * 60)

    # Baseline: Hetzner CAX21 (4 vCPU, 8GB) — your current
    baseline = {"name": "CAX21", "vCPU": 4, "ram_gb": 8, "cost_month": 7.5, "tok_s": 8}

    if increase_2x:
        target = {"name": "CAX31", "vCPU": 8, "ram_gb": 16, "cost_month": 15, "tok_s": 14}
        label = "Upgrade to CAX31"
        speedup = 1.75
        cost_mult = 2.0
    else:
        target = {"name": "Optimised", "vCPU": 4, "ram_gb": 8, "cost_month": 7.5, "tok_s": 28}
        label = "Optimise software"
        speedup = 3.5
        cost_mult = 1.0

    print(f"\nBaseline: {baseline['name']} @ ${baseline['cost_month']}/mo")
    print(f"  Tok/s: {baseline['tok_s']}")
    print(f"  Capacity: {baseline['tok_s'] * 86400 / 1000:.0f}K tokens/day")

    print(f"\n{label}:")
    print(f"  Tok/s: {target['tok_s']}")
    print(f"  Cost: ${target['cost_month']}/mo")
    print(f"  Capacity: {target['tok_s'] * 86400 / 1000:.0f}K tokens/day\n")

    # Daily query capacity comparison
    avg_tokens_per_query = 2000
    baseline_queries = baseline['tok_s'] * 86400 // avg_tokens_per_query
    target_queries = target['tok_s'] * 86400 // avg_tokens_per_query

    print(f"📊 Daily Query Capacity:")
    print(f"  Baseline: {baseline_queries:,} queries/day")
    print(f"  {label}: {target_queries:,} queries/day")
    print(f"  Increase: {(target_queries/baseline_queries - 1)*100:.0f}%\n")

    # Cost per query
    baseline_cpq = baseline['cost_month'] / (baseline_queries * 30)
    target_cpq = target['cost_month'] / (target_queries * 30)
    print(f"💰 Cost Per Query:")
    print(f"  Baseline: ${baseline_cpq:.6f}")
    print(f"  {label}: ${target_cpq:.6f}")
    print(f"  Savings: {(1 - target_cpq/baseline_cpq)*100:.0f}%\n")

    # When optimisation saturates
    print(f"⏰ When to upgrade:")
    current_queries = 5000  # your approximate daily usage
    utilization = current_queries / target_queries
    print(f"  Current usage: {current_queries:,} queries/day")
    print(f"  Capacity util: {utilization*100:.0f}%")
    if utilization > 0.8:
        print(f"  ⚠️  At 80%+ utilisation — consider hardware upgrade")
    else:
        print(f"  ✅ Optimisation sufficient for now (headroom: {(1-utilization)*100:.0f}%)")


calc_upgrade_scenario(increase_2x=True)
print()
calc_upgrade_scenario(increase_2x=False)

# Summary
print("\n" + "=" * 60)
print("💡 RECOMMENDATION")
print("=" * 60)
print("""
For your current scale (~5K queries/day on CAX21):

Software optimisation (quantisation, caching, batching) gives
~3-4x speedup at zero additional cost.

Hardware upgrade gives ~1.5-2x at 2x cost.

Optimise first. Only upgrade when:
  1. CPU is at 90%+ utilisation for sustained periods
  2. Query volume exceeds 15K/day
  3. You add GPU-dependent features
""")

Run it:

python3 /tmp/scaling-calculator.py

Questions for Ayva: - What's the actual token/s ceiling on your Hetzner CAX21 with qwen2.5-3b? - At what query volume does hardware upgrade pay for itself? - What's the cost-per-million-tokens for your current setup?


Key Takeaways

  • Diminishing returns: 2x model size ≠ 2x quality; 2x capacity ≠ 2x cost
  • Software optimisation (quantisation, caching, batching) beats hardware upgrades at small scale
  • Only upgrade hardware when software optimisation is exhausted
  • Cost-per-query is the metric that matters for economic decisions

References