Day 1: ML System Blueprint

📂 Foundations 📖 15 min read Ready

Learning Objectives

Understand the high-level architecture of any ML-powered system
Identify the three pillars: compute, storage, and orchestration
Map a real system (Hermes + llama.cpp) onto the blueprint

Theory (15 min)

Every AI system — from ChatGPT to a simple RAG pipeline — follows the same fundamental blueprint:

┌──────────┐    ┌──────────┐    ┌──────────┐
│  DATA    │───▶│  TRAIN   │───▶│  SERVE   │
│  PIPELINE│    │  (or     │    │  INFER   │
│          │    │   LOAD)  │    │          │
└──────────┘    └──────────┘    └──────────┘
     │               │               │
     ▼               ▼               ▼
┌─────────────────────────────────────────┐
│           MONITOR · OBSERVE             │
└─────────────────────────────────────────┘

The Three Pillars

1. Compute — the hardware that runs your models - CPUs: general-purpose, good for pre/post-processing - GPUs: parallel matrix ops — the engine of deep learning - NPUs/TPUs: specialised accelerators (Google TPU, Apple Neural Engine) - On your VPS: CPU-only with llama.cpp (which is surprisingly capable)

2. Storage — where data, models, and state live - Object store (S3/MinIO): models, datasets, checkpoints - Relational DB (Postgres): users, logs, metadata - Vector DB (Qdrant/Pinecone): embeddings for retrieval - Cache (Redis): conversation state, KV-cache, frequent queries - On your VPS: local SSD, possibly MinIO in Docker

3. Orchestration — how everything talks to everything else - API gateway: request routing, auth, rate limiting - Message queue: async job dispatch (RabbitMQ, Redis streams) - Container orchestration: Kubernetes, Docker Compose, Nomad - On your VPS: Docker Compose + a simple API proxy

The Inference Loop (where you'll spend most time)

Client → [Gateway] → [Rate Limiter] → [Load Balancer] → [Inference Server]
                                                              │
                                                    ┌─────────▼────────┐
                                                    │  Tokeniser →     │
                                                    │  Model →         │
                                                    │  Detokeniser     │
                                                    └─────────┬────────┘
                                                              │
Client ← [Gateway] ← [Cache check] ←─────────────────────────┘

Key insight: latency is dominated by the model forward pass, not the plumbing. Good architectural design minimises unnecessary round-trips between components, caches aggressively, and parallelises where possible.

Architecture vs Infrastructure

Architecture (what you design)	Infrastructure (how you run it)
Request flow, data paths	Hardware, networking
Caching strategy	Redis cluster config
Model serving topology	Docker/K8s manifests
Error handling, fallbacks	Monitoring, alerting

This course focuses on architecture. You'll touch infrastructure enough to make it real.

Hands-on (15 min)

Map Your Own Stack

Open a terminal and draw the architecture of your running Hermes + llama.cpp setup.

# 1. Check what's running
docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Ports}}"
# If Hermes is bare-metal:
ps aux | grep -E 'hermes|llama' | grep -v grep

# 2. Check what ports are open
ss -tlnp | grep -E ':(3000|5000|8080|11434|11435)'

# 3. Check storage
df -h /opt/data

Now draw (on paper or in a file) the architecture:

Client → your Telegram/terminal
Gateway → Hermes agent profile
Inference → llama.cpp server (or whatever provider)
Storage → filesystem, any DBs you run
Orchestration → Docker Compose / bare processes

Save it as a reference — you'll come back to it on Day 30 and see how much more you understand.

# Bonus: Quick latency check
time curl -s http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Hello","max_tokens":10}' > /dev/null

Question to answer after this exercise: If traffic doubled, where would this system break first?

Key Takeaways

All AI systems share the same core blueprint: Data → Training → Serving → Monitoring
The three pillars (compute, storage, orchestration) constrain every architectural decision
Architecture is how the pieces connect — infrastructure is what runs them
Latency is dominated by the model, not the plumbing — so optimise the model path first

🧠 AI System Design