๐Ÿง  AI System Design

Day 28: Cost Engineering

๐Ÿ“‚ Production & Case Studies ๐Ÿ“– 15 min read Needs expansion

Learning Objectives

  • Understand the economics of inference (not just performance)
  • Learn the Cost-Per-Query (CPQ) framework
  • Audit your own Hermes token usage and find savings

Theory (15 min)

The Economics of Inference

Most AI systems spend 80% of runtime cost on inference, not training.

For your VPS, costs are: - VPS rental (fixed) - Electricity (minimal) - API calls if using external providers (variable) - Your time โ€” the most expensive resource

Cost-Per-Query Framework

CPQ = (tokens_per_query ร— cost_per_token) + infrastructure_overhead_per_query

Cost_per_token โ‰ˆ (model_params_in_billions ร— bytes_per_param) รท (context_length ร— GPU_hours_cost)

Simplified:
- A 3B model at q4_K_M on CPU: ~$0.00001/query (electricity only)
- GPT-4o via API: ~$0.01/query
- That's a 1000x difference

Where Tokens Go

# Typical query breakdown
prompt_tokens = 1500    # system prompt (500) + conversation (800) + RAG context (200)
completion_tokens = 500 # model response
total_tokens = 2000
cost = total_tokens ร— (0.15 / 1_000_000)  # for a cheap API
# = $0.0003 per query
# 10,000 queries/month = $3.00

Cost Optimisation Strategies

Strategy Savings Effort Impact on Quality
Shorter system prompt 10-30% Low None (if well-written)
Cache frequent queries 20-50% Medium None (cached = same)
Semantic cache 30-60% Medium Minor (if threshold high)
Smaller model for simple tasks 40-80% High Major if done poorly
Batch processing 30-50% Medium None (same output)
Quantisation 0-10% (cost) Already done ~3% quality loss
Rate limiting Variable Low None (prevents abuse)

The biggest lever: sending simple queries to a smaller model.

Model Selection by Task Difficulty

Task Difficulty Example Model Cost Multiplier
Simple "What's 2+2?" 1.5B q4 0.2x
Medium "Explain caching" 3B q4 1x (baseline)
Hard "Write a distributed cache in Rust" 7B q4 3x
Expert "Design a cache coherence protocol" 14B+ q4 / API 10x

A cost-aware router selects model by task difficulty, saving 50-80%.


Hands-on (15 min)

Audit Your Hermes Token Usage

#!/usr/bin/env python3
"""cost-audit.py โ€” analyse token usage and find savings."""
import json
from collections import defaultdict

# Stub โ€” Ayva will expand with:
# - Parse actual Hermes logs for token usage per command
# - Track usage per profile (hermy, codi, ayva, sira)
# - Calculate cost if using paid API vs local
# - Identify top 5 most token-expensive operations
# - Recommend specific optimisations with projected savings
# - Set up budget alerts (daily/weekly token limits)

# Simulated usage data (replace with real log parsing)
usage_data = [
    {"profile": "hermy", "task": "Daily briefing", "tokens": 12000, "count": 30},
    {"profile": "codi", "task": "Code review", "tokens": 8000, "count": 10},
    {"profile": "ayva", "task": "Research summary", "tokens": 15000, "count": 5},
    {"profile": "sira", "task": "Design review", "tokens": 10000, "count": 3},
    {"profile": "hermy", "task": "Cron jobs", "tokens": 5000, "count": 20},
    {"profile": "tesa", "task": "Quality checks", "tokens": 6000, "count": 8},
]

# Cost calculation (local CPU vs API)
LOCAL_COST_PER_1K_TOKENS = 0.00001  # ~$0.01/1M tokens (electricity)
API_COST_PER_1K_TOKENS = 0.003      # DeepSeek API ~$0.003/1K

total_local = 0
total_api = 0

print("๐Ÿ“Š Token Usage Audit\n")
print(f"{'Profile':<10} {'Task':<20} {'Tokens/mo':<12} {'Local $':<12} {'API $':<12}")
print("-" * 66)

for item in usage_data:
    tokens_month = item["tokens"] * item["count"]
    local_cost = tokens_month / 1000 * LOCAL_COST_PER_1K_TOKENS
    api_cost = tokens_month / 1000 * API_COST_PER_1K_TOKENS
    total_local += local_cost
    total_api += api_cost
    print(f"{item['profile']:<10} {item['task']:<20} "
          f"{tokens_month:<12,} ${local_cost:<10.4f} ${api_cost:<10.4f}")

print("-" * 66)
print(f"{'TOTAL':<10} {'':20} {'':12} ${total_local:<10.4f} ${total_api:<10.4f}")

print(f"\n๐Ÿ’ก Savings if local: ${total_api - total_local:.2f}/month")
print(f"   ({((total_api - total_local) / total_api * 100):.0f}% cheaper)\n")

# Savings recommendations
print("๐Ÿ”ง Optimisation Recommendations:")
recommendations = [
    ("Shorter system prompts", "10-20%", "Review each profile's system prompt length"),
    ("Semantic cache for briefings", "30-50%", "Daily briefings often repeated"),
    ("Small model for simple cron checks", "50%", "Use 1.5B model for status checks"),
    ("Batch overnight processing", "30%", "Queue non-urgent work to off-peak"),
]
print(f"{'Strategy':<35} {'Savings':<10} {'Notes'}")
print("-" * 70)
for rec in recommendations:
    print(f"{rec[0]:<35} {rec[1]:<10} {rec[2]}")

Questions for Ayva: - How to parse Hermes logs for actual token usage per request? - What's the cost-per-day of running qwen2.5-3b 24/7 on CPU (electricity)? - What's the break-even point for local vs API inference on your VPS?


Key Takeaways

  • Inference cost is often the largest operating expense for AI systems
  • Model selection by task difficulty saves 50-80% vs one-size-fits-all
  • Cache is the cheapest optimisation โ€” zero cost on hits
  • Local inference (llama.cpp) vs API (OpenAI) is a 1000x cost difference
  • Audit your actual usage before optimising โ€” don't guess

References