Day 28: Cost Engineering

📂 Production & Case Studies 📖 15 min read Needs expansion

Learning Objectives

Understand the economics of inference (not just performance)
Learn the Cost-Per-Query (CPQ) framework
Audit your own Hermes token usage and find savings

Theory (15 min)

The Economics of Inference

Most AI systems spend 80% of runtime cost on inference, not training.

For your VPS, costs are: - VPS rental (fixed) - Electricity (minimal) - API calls if using external providers (variable) - Your time — the most expensive resource

Cost-Per-Query Framework

CPQ = (tokens_per_query × cost_per_token) + infrastructure_overhead_per_query

Cost_per_token ≈ (model_params_in_billions × bytes_per_param) ÷ (context_length × GPU_hours_cost)

Simplified:
- A 3B model at q4_K_M on CPU: ~$0.00001/query (electricity only)
- GPT-4o via API: ~$0.01/query
- That's a 1000x difference

Where Tokens Go

# Typical query breakdown
prompt_tokens = 1500    # system prompt (500) + conversation (800) + RAG context (200)
completion_tokens = 500 # model response
total_tokens = 2000
cost = total_tokens × (0.15 / 1_000_000)  # for a cheap API
# = $0.0003 per query
# 10,000 queries/month = $3.00

Cost Optimisation Strategies

Strategy	Savings	Effort	Impact on Quality
Shorter system prompt	10-30%	Low	None (if well-written)
Cache frequent queries	20-50%	Medium	None (cached = same)
Semantic cache	30-60%	Medium	Minor (if threshold high)
Smaller model for simple tasks	40-80%	High	Major if done poorly
Batch processing	30-50%	Medium	None (same output)
Quantisation	0-10% (cost)	Already done	~3% quality loss
Rate limiting	Variable	Low	None (prevents abuse)

The biggest lever: sending simple queries to a smaller model.

Model Selection by Task Difficulty

Task Difficulty	Example	Model	Cost Multiplier
Simple	"What's 2+2?"	1.5B q4	0.2x
Medium	"Explain caching"	3B q4	1x (baseline)
Hard	"Write a distributed cache in Rust"	7B q4	3x
Expert	"Design a cache coherence protocol"	14B+ q4 / API	10x

A cost-aware router selects model by task difficulty, saving 50-80%.

Hands-on (15 min)

Audit Your Hermes Token Usage

#!/usr/bin/env python3
"""cost-audit.py — analyse token usage and find savings."""
import json
from collections import defaultdict

# Stub — Ayva will expand with:
# - Parse actual Hermes logs for token usage per command
# - Track usage per profile (hermy, codi, ayva, sira)
# - Calculate cost if using paid API vs local
# - Identify top 5 most token-expensive operations
# - Recommend specific optimisations with projected savings
# - Set up budget alerts (daily/weekly token limits)

# Simulated usage data (replace with real log parsing)
usage_data = [
    {"profile": "hermy", "task": "Daily briefing", "tokens": 12000, "count": 30},
    {"profile": "codi", "task": "Code review", "tokens": 8000, "count": 10},
    {"profile": "ayva", "task": "Research summary", "tokens": 15000, "count": 5},
    {"profile": "sira", "task": "Design review", "tokens": 10000, "count": 3},
    {"profile": "hermy", "task": "Cron jobs", "tokens": 5000, "count": 20},
    {"profile": "tesa", "task": "Quality checks", "tokens": 6000, "count": 8},
]

# Cost calculation (local CPU vs API)
LOCAL_COST_PER_1K_TOKENS = 0.00001  # ~$0.01/1M tokens (electricity)
API_COST_PER_1K_TOKENS = 0.003      # DeepSeek API ~$0.003/1K

total_local = 0
total_api = 0

print("📊 Token Usage Audit\n")
print(f"{'Profile':<10} {'Task':<20} {'Tokens/mo':<12} {'Local $':<12} {'API $':<12}")
print("-" * 66)

for item in usage_data:
    tokens_month = item["tokens"] * item["count"]
    local_cost = tokens_month / 1000 * LOCAL_COST_PER_1K_TOKENS
    api_cost = tokens_month / 1000 * API_COST_PER_1K_TOKENS
    total_local += local_cost
    total_api += api_cost
    print(f"{item['profile']:<10} {item['task']:<20} "
          f"{tokens_month:<12,} ${local_cost:<10.4f} ${api_cost:<10.4f}")

print("-" * 66)
print(f"{'TOTAL':<10} {'':20} {'':12} ${total_local:<10.4f} ${total_api:<10.4f}")

print(f"\n💡 Savings if local: ${total_api - total_local:.2f}/month")
print(f"   ({((total_api - total_local) / total_api * 100):.0f}% cheaper)\n")

# Savings recommendations
print("🔧 Optimisation Recommendations:")
recommendations = [
    ("Shorter system prompts", "10-20%", "Review each profile's system prompt length"),
    ("Semantic cache for briefings", "30-50%", "Daily briefings often repeated"),
    ("Small model for simple cron checks", "50%", "Use 1.5B model for status checks"),
    ("Batch overnight processing", "30%", "Queue non-urgent work to off-peak"),
]
print(f"{'Strategy':<35} {'Savings':<10} {'Notes'}")
print("-" * 70)
for rec in recommendations:
    print(f"{rec[0]:<35} {rec[1]:<10} {rec[2]}")

Questions for Ayva: - How to parse Hermes logs for actual token usage per request? - What's the cost-per-day of running qwen2.5-3b 24/7 on CPU (electricity)? - What's the break-even point for local vs API inference on your VPS?

Key Takeaways

Inference cost is often the largest operating expense for AI systems
Model selection by task difficulty saves 50-80% vs one-size-fits-all
Cache is the cheapest optimisation — zero cost on hits
Local inference (llama.cpp) vs API (OpenAI) is a 1000x cost difference
Audit your actual usage before optimising — don't guess

🧠 AI System Design