Day 25: Case Study: ChatGPT
Learning Objectives
- Understand the scale and complexity of production AI serving
- Learn how architectural decisions are shaped by user expectations
- Map each ChatGPT component to something you could build on your VPS
Theory (15 min)
ChatGPT by the Numbers (as of 2024)
- 100M+ weekly active users
- <500ms median time-to-first-token
- Runs on ~30,000 GPUs (A100/H100 clusters in Azure)
- ~100B+ tokens generated daily
- Multiple models: GPT-4o (flagship), GPT-4o-mini (cheap/fast), o1 (reasoning)
Architecture (What's Known Publicly)
User βββΆ Cloudflare βββΆ API Gateway βββΆ Model Router βββΆ Inference Cluster
β β β
DDoS Auth, Classify Hundreds of nodes
protection Rate Limit intent β with GPU, load-
user tier select balanced
model
β
ββββββΌβββββ
β Model β
β Router β
ββββββ¬βββββ
β
βββββββββββββββββββββΌβββββββββββββββββββ
βΌ βΌ βΌ
GPT-4o Cluster GPT-4o-mini O1 Cluster
β β β
βββββββΌβββββββ βββββββΌβββββββ βββββββΌβββββββ
β128Γ H100 β β256Γ A100 β β64Γ H100 β
βKV cache β βContinuous β βLong contextβ
βsharding β βbatching β βoptimised β
ββββββββββββββ ββββββββββββββ ββββββββββββββ
Key Architectural Decisions
1. Sliding window attention β KV cache only keeps recent N tokens, drops old context. Saves memory, limits conversation length.
2. Continuous batching (vLLM-like) β Dynamic batch management, no idle GPU time.
3. Model parallelism β GPT-4 doesn't fit on one GPU. Uses tensor + pipeline parallelism across 8+ GPUs per inference node.
4. KV cache offloading β Cache frequently-used prompts (system prompts, few-shot) to avoid recomputation.
5. Speculative decoding β Fast draft model proposes tokens, slow target model verifies. ~2x speedup for free.
What You Can Learn From This
| Component | ChatGPT Scale | Your Scale |
|---|---|---|
| Hardware | 30K GPUs | 1 CPU / 0 GPU |
| Model | GPT-4o (trillions of params) | Qwen 2.5 3B |
| Batching | Custom continuous batching | llama.cpp default |
| Caching | Multi-tier (KVC, semantic, response) | LRU response cache |
| Routing | ML-based intent classifier | Keyword router |
The patterns are the same β only the numbers change.
Hands-on (15 min)
Map ChatGPT's Components to Your Stack
#!/usr/bin/env python3
"""chatgpt-case-study.py β map ChatGPT's architecture to your VPS."""
# Stub β Ayva will expand with:
# - Detailed architectural diagrams (using diagrams-as-code like Mermaid)
# - Comparison of OpenAI's infrastructure vs open-source alternatives
# - Deep dive into specific systems:
# - KV cache management at scale
# - How model parallelism works across GPUs
# - Inference cluster scheduling (Kubernetes + GPUs)
# - What parts of the stack are proprietary vs documented
# - Lessons for smaller-scale systems
chatgpt_stack = {
"Frontend": {
"component": "OpenAI web UI / mobile app / API",
"your_stack": "Telegram bot (@aloyclaw_bot) / terminal",
"key_lesson": "Multiple surfaces, same backend",
},
"DDoS Protection": {
"component": "Cloudflare",
"your_stack": "None needed at your scale",
"key_lesson": "Rate limiting is enough for low traffic",
},
"API Gateway": {
"component": "Custom gateway (auth, rate limit, tiering)",
"your_stack": "Your AI Gateway from Day 7",
"key_lesson": "Same pattern, different scale",
},
"Model Router": {
"component": "ML classifier β model selection",
"your_stack": "Semantic router from Day 4",
"key_lesson": "Simple classifier can replace complex routing",
},
"Inference Cluster": {
"component": "H100 GPUs, custom inference engine",
"your_stack": "llama.cpp (llama-server)",
"key_lesson": "Engine optimisation matters more than hardware",
},
"Caching": {
"component": "KV cache prefix reuse + response cache",
"your_stack": "LRU response cache from Day 3",
"key_lesson": "Cache at every layer, from KV to response",
},
"Monitoring": {
"component": "Custom observability (logs + metrics + traces)",
"your_stack": "Structured JSON logs from Day 22",
"key_lesson": "Start with logs, add metrics as you grow",
},
"Moderation": {
"component": "OpenAI Moderation API (multilayer)",
"your_stack": "Regex guardrails from Day 23",
"key_lesson": "Layered safety, automate what you can",
},
}
print("π§ ChatGPT Architecture Map β Your VPS\n")
print(f"{'Component':<22} {'ChatGPT':<25} {'Your Stack':<30} {'Lesson'}")
print("-" * 100)
for comp, details in chatgpt_stack.items():
print(f"{comp:<22} {details['component']:<25} "
f"{details['your_stack']:<30} {details['key_lesson']}")
Questions for Ayva: - What open-source projects (vLLM, TGI, SGLang) are closest to OpenAI's internal stack? - How does GPT-4o-mini achieve its price/performance ratio? - What's the one architectural change that would most impact a single-server setup?
Key Takeaways
- ChatGPT's architecture follows the same patterns as your VPS stack β just at 10,000x scale
- The hard parts (model parallelism, custom inference engine) are solved by open-source (vLLM, llama.cpp)
- OpenAI's competitive advantage is hardware access, not architectural innovation
- Every component has an open-source equivalent you can run today