🧠 AI System Design

Day 1: ML System Blueprint

πŸ“‚ Foundations πŸ“– 15 min read Ready

Learning Objectives

  • Understand the high-level architecture of any ML-powered system
  • Identify the three pillars: compute, storage, and orchestration
  • Map a real system (Hermes + llama.cpp) onto the blueprint

Theory (15 min)

Every AI system β€” from ChatGPT to a simple RAG pipeline β€” follows the same fundamental blueprint:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  DATA    │───▢│  TRAIN   │───▢│  SERVE   β”‚
β”‚  PIPELINEβ”‚    β”‚  (or     β”‚    β”‚  INFER   β”‚
β”‚          β”‚    β”‚   LOAD)  β”‚    β”‚          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚               β”‚               β”‚
     β–Ό               β–Ό               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           MONITOR Β· OBSERVE             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The Three Pillars

1. Compute β€” the hardware that runs your models - CPUs: general-purpose, good for pre/post-processing - GPUs: parallel matrix ops β€” the engine of deep learning - NPUs/TPUs: specialised accelerators (Google TPU, Apple Neural Engine) - On your VPS: CPU-only with llama.cpp (which is surprisingly capable)

2. Storage β€” where data, models, and state live - Object store (S3/MinIO): models, datasets, checkpoints - Relational DB (Postgres): users, logs, metadata - Vector DB (Qdrant/Pinecone): embeddings for retrieval - Cache (Redis): conversation state, KV-cache, frequent queries - On your VPS: local SSD, possibly MinIO in Docker

3. Orchestration β€” how everything talks to everything else - API gateway: request routing, auth, rate limiting - Message queue: async job dispatch (RabbitMQ, Redis streams) - Container orchestration: Kubernetes, Docker Compose, Nomad - On your VPS: Docker Compose + a simple API proxy

The Inference Loop (where you'll spend most time)

Client β†’ [Gateway] β†’ [Rate Limiter] β†’ [Load Balancer] β†’ [Inference Server]
                                                              β”‚
                                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
                                                    β”‚  Tokeniser β†’     β”‚
                                                    β”‚  Model β†’         β”‚
                                                    β”‚  Detokeniser     β”‚
                                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                              β”‚
Client ← [Gateway] ← [Cache check] β†β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key insight: latency is dominated by the model forward pass, not the plumbing. Good architectural design minimises unnecessary round-trips between components, caches aggressively, and parallelises where possible.

Architecture vs Infrastructure

Architecture (what you design) Infrastructure (how you run it)
Request flow, data paths Hardware, networking
Caching strategy Redis cluster config
Model serving topology Docker/K8s manifests
Error handling, fallbacks Monitoring, alerting

This course focuses on architecture. You'll touch infrastructure enough to make it real.


Hands-on (15 min)

Map Your Own Stack

Open a terminal and draw the architecture of your running Hermes + llama.cpp setup.

# 1. Check what's running
docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Ports}}"
# If Hermes is bare-metal:
ps aux | grep -E 'hermes|llama' | grep -v grep
# 2. Check what ports are open
ss -tlnp | grep -E ':(3000|5000|8080|11434|11435)'
# 3. Check storage
df -h /opt/data

Now draw (on paper or in a file) the architecture:

  1. Client β†’ your Telegram/terminal
  2. Gateway β†’ Hermes agent profile
  3. Inference β†’ llama.cpp server (or whatever provider)
  4. Storage β†’ filesystem, any DBs you run
  5. Orchestration β†’ Docker Compose / bare processes

Save it as a reference β€” you'll come back to it on Day 30 and see how much more you understand.

# Bonus: Quick latency check
time curl -s http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Hello","max_tokens":10}' > /dev/null

Question to answer after this exercise: If traffic doubled, where would this system break first?


Key Takeaways

  • All AI systems share the same core blueprint: Data β†’ Training β†’ Serving β†’ Monitoring
  • The three pillars (compute, storage, orchestration) constrain every architectural decision
  • Architecture is how the pieces connect β€” infrastructure is what runs them
  • Latency is dominated by the model, not the plumbing β€” so optimise the model path first

References