Day 28: Cost Engineering
Learning Objectives
- Understand the economics of inference (not just performance)
- Learn the Cost-Per-Query (CPQ) framework
- Audit your own Hermes token usage and find savings
Theory (15 min)
The Economics of Inference
Most AI systems spend 80% of runtime cost on inference, not training.
For your VPS, costs are: - VPS rental (fixed) - Electricity (minimal) - API calls if using external providers (variable) - Your time โ the most expensive resource
Cost-Per-Query Framework
CPQ = (tokens_per_query ร cost_per_token) + infrastructure_overhead_per_query
Cost_per_token โ (model_params_in_billions ร bytes_per_param) รท (context_length ร GPU_hours_cost)
Simplified:
- A 3B model at q4_K_M on CPU: ~$0.00001/query (electricity only)
- GPT-4o via API: ~$0.01/query
- That's a 1000x difference
Where Tokens Go
# Typical query breakdown
prompt_tokens = 1500 # system prompt (500) + conversation (800) + RAG context (200)
completion_tokens = 500 # model response
total_tokens = 2000
cost = total_tokens ร (0.15 / 1_000_000) # for a cheap API
# = $0.0003 per query
# 10,000 queries/month = $3.00
Cost Optimisation Strategies
| Strategy | Savings | Effort | Impact on Quality |
|---|---|---|---|
| Shorter system prompt | 10-30% | Low | None (if well-written) |
| Cache frequent queries | 20-50% | Medium | None (cached = same) |
| Semantic cache | 30-60% | Medium | Minor (if threshold high) |
| Smaller model for simple tasks | 40-80% | High | Major if done poorly |
| Batch processing | 30-50% | Medium | None (same output) |
| Quantisation | 0-10% (cost) | Already done | ~3% quality loss |
| Rate limiting | Variable | Low | None (prevents abuse) |
The biggest lever: sending simple queries to a smaller model.
Model Selection by Task Difficulty
| Task Difficulty | Example | Model | Cost Multiplier |
|---|---|---|---|
| Simple | "What's 2+2?" | 1.5B q4 | 0.2x |
| Medium | "Explain caching" | 3B q4 | 1x (baseline) |
| Hard | "Write a distributed cache in Rust" | 7B q4 | 3x |
| Expert | "Design a cache coherence protocol" | 14B+ q4 / API | 10x |
A cost-aware router selects model by task difficulty, saving 50-80%.
Hands-on (15 min)
Audit Your Hermes Token Usage
#!/usr/bin/env python3
"""cost-audit.py โ analyse token usage and find savings."""
import json
from collections import defaultdict
# Stub โ Ayva will expand with:
# - Parse actual Hermes logs for token usage per command
# - Track usage per profile (hermy, codi, ayva, sira)
# - Calculate cost if using paid API vs local
# - Identify top 5 most token-expensive operations
# - Recommend specific optimisations with projected savings
# - Set up budget alerts (daily/weekly token limits)
# Simulated usage data (replace with real log parsing)
usage_data = [
{"profile": "hermy", "task": "Daily briefing", "tokens": 12000, "count": 30},
{"profile": "codi", "task": "Code review", "tokens": 8000, "count": 10},
{"profile": "ayva", "task": "Research summary", "tokens": 15000, "count": 5},
{"profile": "sira", "task": "Design review", "tokens": 10000, "count": 3},
{"profile": "hermy", "task": "Cron jobs", "tokens": 5000, "count": 20},
{"profile": "tesa", "task": "Quality checks", "tokens": 6000, "count": 8},
]
# Cost calculation (local CPU vs API)
LOCAL_COST_PER_1K_TOKENS = 0.00001 # ~$0.01/1M tokens (electricity)
API_COST_PER_1K_TOKENS = 0.003 # DeepSeek API ~$0.003/1K
total_local = 0
total_api = 0
print("๐ Token Usage Audit\n")
print(f"{'Profile':<10} {'Task':<20} {'Tokens/mo':<12} {'Local $':<12} {'API $':<12}")
print("-" * 66)
for item in usage_data:
tokens_month = item["tokens"] * item["count"]
local_cost = tokens_month / 1000 * LOCAL_COST_PER_1K_TOKENS
api_cost = tokens_month / 1000 * API_COST_PER_1K_TOKENS
total_local += local_cost
total_api += api_cost
print(f"{item['profile']:<10} {item['task']:<20} "
f"{tokens_month:<12,} ${local_cost:<10.4f} ${api_cost:<10.4f}")
print("-" * 66)
print(f"{'TOTAL':<10} {'':20} {'':12} ${total_local:<10.4f} ${total_api:<10.4f}")
print(f"\n๐ก Savings if local: ${total_api - total_local:.2f}/month")
print(f" ({((total_api - total_local) / total_api * 100):.0f}% cheaper)\n")
# Savings recommendations
print("๐ง Optimisation Recommendations:")
recommendations = [
("Shorter system prompts", "10-20%", "Review each profile's system prompt length"),
("Semantic cache for briefings", "30-50%", "Daily briefings often repeated"),
("Small model for simple cron checks", "50%", "Use 1.5B model for status checks"),
("Batch overnight processing", "30%", "Queue non-urgent work to off-peak"),
]
print(f"{'Strategy':<35} {'Savings':<10} {'Notes'}")
print("-" * 70)
for rec in recommendations:
print(f"{rec[0]:<35} {rec[1]:<10} {rec[2]}")
Questions for Ayva: - How to parse Hermes logs for actual token usage per request? - What's the cost-per-day of running qwen2.5-3b 24/7 on CPU (electricity)? - What's the break-even point for local vs API inference on your VPS?
Key Takeaways
- Inference cost is often the largest operating expense for AI systems
- Model selection by task difficulty saves 50-80% vs one-size-fits-all
- Cache is the cheapest optimisation โ zero cost on hits
- Local inference (llama.cpp) vs API (OpenAI) is a 1000x cost difference
- Audit your actual usage before optimising โ don't guess