Day 1: ML System Blueprint
Learning Objectives
- Understand the high-level architecture of any ML-powered system
- Identify the three pillars: compute, storage, and orchestration
- Map a real system (Hermes + llama.cpp) onto the blueprint
Theory (15 min)
Every AI system β from ChatGPT to a simple RAG pipeline β follows the same fundamental blueprint:
ββββββββββββ ββββββββββββ ββββββββββββ
β DATA βββββΆβ TRAIN βββββΆβ SERVE β
β PIPELINEβ β (or β β INFER β
β β β LOAD) β β β
ββββββββββββ ββββββββββββ ββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β MONITOR Β· OBSERVE β
βββββββββββββββββββββββββββββββββββββββββββ
The Three Pillars
1. Compute β the hardware that runs your models - CPUs: general-purpose, good for pre/post-processing - GPUs: parallel matrix ops β the engine of deep learning - NPUs/TPUs: specialised accelerators (Google TPU, Apple Neural Engine) - On your VPS: CPU-only with llama.cpp (which is surprisingly capable)
2. Storage β where data, models, and state live - Object store (S3/MinIO): models, datasets, checkpoints - Relational DB (Postgres): users, logs, metadata - Vector DB (Qdrant/Pinecone): embeddings for retrieval - Cache (Redis): conversation state, KV-cache, frequent queries - On your VPS: local SSD, possibly MinIO in Docker
3. Orchestration β how everything talks to everything else - API gateway: request routing, auth, rate limiting - Message queue: async job dispatch (RabbitMQ, Redis streams) - Container orchestration: Kubernetes, Docker Compose, Nomad - On your VPS: Docker Compose + a simple API proxy
The Inference Loop (where you'll spend most time)
Client β [Gateway] β [Rate Limiter] β [Load Balancer] β [Inference Server]
β
βββββββββββΌβββββββββ
β Tokeniser β β
β Model β β
β Detokeniser β
βββββββββββ¬βββββββββ
β
Client β [Gateway] β [Cache check] βββββββββββββββββββββββββββ
Key insight: latency is dominated by the model forward pass, not the plumbing. Good architectural design minimises unnecessary round-trips between components, caches aggressively, and parallelises where possible.
Architecture vs Infrastructure
| Architecture (what you design) | Infrastructure (how you run it) |
|---|---|
| Request flow, data paths | Hardware, networking |
| Caching strategy | Redis cluster config |
| Model serving topology | Docker/K8s manifests |
| Error handling, fallbacks | Monitoring, alerting |
This course focuses on architecture. You'll touch infrastructure enough to make it real.
Hands-on (15 min)
Map Your Own Stack
Open a terminal and draw the architecture of your running Hermes + llama.cpp setup.
# 1. Check what's running
docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Ports}}"
# If Hermes is bare-metal:
ps aux | grep -E 'hermes|llama' | grep -v grep
# 2. Check what ports are open
ss -tlnp | grep -E ':(3000|5000|8080|11434|11435)'
# 3. Check storage
df -h /opt/data
Now draw (on paper or in a file) the architecture:
- Client β your Telegram/terminal
- Gateway β Hermes agent profile
- Inference β llama.cpp server (or whatever provider)
- Storage β filesystem, any DBs you run
- Orchestration β Docker Compose / bare processes
Save it as a reference β you'll come back to it on Day 30 and see how much more you understand.
# Bonus: Quick latency check
time curl -s http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt":"Hello","max_tokens":10}' > /dev/null
Question to answer after this exercise: If traffic doubled, where would this system break first?
Key Takeaways
- All AI systems share the same core blueprint: Data β Training β Serving β Monitoring
- The three pillars (compute, storage, orchestration) constrain every architectural decision
- Architecture is how the pieces connect β infrastructure is what runs them
- Latency is dominated by the model, not the plumbing β so optimise the model path first