Day 29: Scaling Law Intuition
Learning Objectives
- Develop an intuition for the economics of scaling AI systems
- Understand when to throw hardware vs when to optimise software
- Calculate: cost of upgrading VPS vs cost of optimising your serving layer
Theory (15 min)
The Scaling Laws (Chinchilla, Kaplan)
Key insight: model quality scales predictably with compute, data, and parameters.
Loss ∝ 1 / (N^α × D^β × C^γ)
Where:
N = model parameters
D = dataset size (tokens)
C = compute (FLOPs)
α, β, γ ≈ 0.05-0.1 (diminishing returns)
Practical meaning: - Doubling model params gives ~5% loss improvement - Doubling training data gives ~5% loss improvement - There's a sweet spot: 20x more training data than params (Chinchilla optimal)
For Inference (Your World)
The scaling law for serving is different:
Total Cost = N_models × C_query × V_queries
Where:
N_models = number of model instances
C_query = cost per query (hardware + electricity)
V_queries = query volume
Key insight: The cost to serve scales linearly with queries, but quality only scales logarithmically with model size.
Hardware Upgrade vs Software Optimisation
You have two options:
A) Upgrade VPS: 8 vCPU → 16 vCPU ($15/month more)
Expected speedup: ~1.5x (partial, not all ops parallelisable)
B) Optimise: q4 quantisation + caching + batch inference
Cost: your time (free ~1 hour)
Expected speedup: ~3-4x + token savings
Optimisation wins — by a significant margin, at low scale.
When Hardware Makes Sense
| Scenario | Upgrade | Don't Upgrade |
|---|---|---|
| Serving latency critical | GPU upgrade | CPU optimisation |
| Throughput bottleneck | More RAM/cores | Batch size tuning |
| Model doesn't fit | More VRAM/RAM | Quantisation |
| 10+ concurrent users | Multiple instances | Single-instance tuning |
| Always at 100% CPU | Upgrade needed | Already optimised |
Always optimise software first. Hardware is the last resort.
Hands-on (15 min)
Build a Cost Scaling Calculator
#!/usr/bin/env python3
"""scaling-calculator.py — compare hardware upgrade vs optimisation."""
# Stub — Ayva will expand with:
# - Real VPS pricing (Hetzner, AWS, DigitalOcean)
# - Real electricity cost calculations
# - GPU vs CPU cost comparison for inference
# - ROI timeline for hardware upgrades
# - "What if" scenarios (2x traffic, 10x traffic, 100x traffic)
# - Graph: cost vs throughput for different strategies
def calc_upgrade_scenario(increase_2x: bool = True):
"""Compare upgrading hardware vs optimising software."""
print("=" * 60)
print("📈 HARDWARE UPGRADE SCENARIO" if increase_2x else "📉 OPTIMISATION SCENARIO")
print("=" * 60)
# Baseline: Hetzner CAX21 (4 vCPU, 8GB) — your current
baseline = {"name": "CAX21", "vCPU": 4, "ram_gb": 8, "cost_month": 7.5, "tok_s": 8}
if increase_2x:
target = {"name": "CAX31", "vCPU": 8, "ram_gb": 16, "cost_month": 15, "tok_s": 14}
label = "Upgrade to CAX31"
speedup = 1.75
cost_mult = 2.0
else:
target = {"name": "Optimised", "vCPU": 4, "ram_gb": 8, "cost_month": 7.5, "tok_s": 28}
label = "Optimise software"
speedup = 3.5
cost_mult = 1.0
print(f"\nBaseline: {baseline['name']} @ ${baseline['cost_month']}/mo")
print(f" Tok/s: {baseline['tok_s']}")
print(f" Capacity: {baseline['tok_s'] * 86400 / 1000:.0f}K tokens/day")
print(f"\n{label}:")
print(f" Tok/s: {target['tok_s']}")
print(f" Cost: ${target['cost_month']}/mo")
print(f" Capacity: {target['tok_s'] * 86400 / 1000:.0f}K tokens/day\n")
# Daily query capacity comparison
avg_tokens_per_query = 2000
baseline_queries = baseline['tok_s'] * 86400 // avg_tokens_per_query
target_queries = target['tok_s'] * 86400 // avg_tokens_per_query
print(f"📊 Daily Query Capacity:")
print(f" Baseline: {baseline_queries:,} queries/day")
print(f" {label}: {target_queries:,} queries/day")
print(f" Increase: {(target_queries/baseline_queries - 1)*100:.0f}%\n")
# Cost per query
baseline_cpq = baseline['cost_month'] / (baseline_queries * 30)
target_cpq = target['cost_month'] / (target_queries * 30)
print(f"💰 Cost Per Query:")
print(f" Baseline: ${baseline_cpq:.6f}")
print(f" {label}: ${target_cpq:.6f}")
print(f" Savings: {(1 - target_cpq/baseline_cpq)*100:.0f}%\n")
# When optimisation saturates
print(f"⏰ When to upgrade:")
current_queries = 5000 # your approximate daily usage
utilization = current_queries / target_queries
print(f" Current usage: {current_queries:,} queries/day")
print(f" Capacity util: {utilization*100:.0f}%")
if utilization > 0.8:
print(f" ⚠️ At 80%+ utilisation — consider hardware upgrade")
else:
print(f" ✅ Optimisation sufficient for now (headroom: {(1-utilization)*100:.0f}%)")
calc_upgrade_scenario(increase_2x=True)
print()
calc_upgrade_scenario(increase_2x=False)
# Summary
print("\n" + "=" * 60)
print("💡 RECOMMENDATION")
print("=" * 60)
print("""
For your current scale (~5K queries/day on CAX21):
Software optimisation (quantisation, caching, batching) gives
~3-4x speedup at zero additional cost.
Hardware upgrade gives ~1.5-2x at 2x cost.
Optimise first. Only upgrade when:
1. CPU is at 90%+ utilisation for sustained periods
2. Query volume exceeds 15K/day
3. You add GPU-dependent features
""")
Run it:
python3 /tmp/scaling-calculator.py
Questions for Ayva: - What's the actual token/s ceiling on your Hetzner CAX21 with qwen2.5-3b? - At what query volume does hardware upgrade pay for itself? - What's the cost-per-million-tokens for your current setup?
Key Takeaways
- Diminishing returns: 2x model size ≠ 2x quality; 2x capacity ≠ 2x cost
- Software optimisation (quantisation, caching, batching) beats hardware upgrades at small scale
- Only upgrade hardware when software optimisation is exhausted
- Cost-per-query is the metric that matters for economic decisions