🧠 AI System Design

Day 25: Case Study: ChatGPT

πŸ“‚ Production & Case Studies πŸ“– 15 min read Needs expansion

Learning Objectives

  • Understand the scale and complexity of production AI serving
  • Learn how architectural decisions are shaped by user expectations
  • Map each ChatGPT component to something you could build on your VPS

Theory (15 min)

ChatGPT by the Numbers (as of 2024)

  • 100M+ weekly active users
  • <500ms median time-to-first-token
  • Runs on ~30,000 GPUs (A100/H100 clusters in Azure)
  • ~100B+ tokens generated daily
  • Multiple models: GPT-4o (flagship), GPT-4o-mini (cheap/fast), o1 (reasoning)

Architecture (What's Known Publicly)

User ──▢ Cloudflare ──▢ API Gateway ──▢ Model Router ──▢ Inference Cluster
           β”‚               β”‚                β”‚
         DDoS            Auth,          Classify      Hundreds of nodes
         protection      Rate Limit     intent β†’      with GPU, load-
                           user tier     select        balanced
                                           model
                                       β”‚
                                  β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
                                  β”‚  Model   β”‚
                                  β”‚  Router  β”‚
                                  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
                                       β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β–Ό                   β–Ό                    β–Ό
              GPT-4o Cluster      GPT-4o-mini          O1 Cluster
                    β”‚               β”‚                     β”‚
              β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
              β”‚128Γ— H100   β”‚  β”‚256Γ— A100   β”‚       β”‚64Γ— H100    β”‚
              β”‚KV cache    β”‚  β”‚Continuous  β”‚       β”‚Long contextβ”‚
              β”‚sharding    β”‚  β”‚batching    β”‚       β”‚optimised   β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Architectural Decisions

1. Sliding window attention β€” KV cache only keeps recent N tokens, drops old context. Saves memory, limits conversation length.

2. Continuous batching (vLLM-like) β€” Dynamic batch management, no idle GPU time.

3. Model parallelism β€” GPT-4 doesn't fit on one GPU. Uses tensor + pipeline parallelism across 8+ GPUs per inference node.

4. KV cache offloading β€” Cache frequently-used prompts (system prompts, few-shot) to avoid recomputation.

5. Speculative decoding β€” Fast draft model proposes tokens, slow target model verifies. ~2x speedup for free.

What You Can Learn From This

Component ChatGPT Scale Your Scale
Hardware 30K GPUs 1 CPU / 0 GPU
Model GPT-4o (trillions of params) Qwen 2.5 3B
Batching Custom continuous batching llama.cpp default
Caching Multi-tier (KVC, semantic, response) LRU response cache
Routing ML-based intent classifier Keyword router

The patterns are the same β€” only the numbers change.


Hands-on (15 min)

Map ChatGPT's Components to Your Stack

#!/usr/bin/env python3
"""chatgpt-case-study.py β€” map ChatGPT's architecture to your VPS."""

# Stub β€” Ayva will expand with:
# - Detailed architectural diagrams (using diagrams-as-code like Mermaid)
# - Comparison of OpenAI's infrastructure vs open-source alternatives
# - Deep dive into specific systems:
#   - KV cache management at scale
#   - How model parallelism works across GPUs
#   - Inference cluster scheduling (Kubernetes + GPUs)
# - What parts of the stack are proprietary vs documented
# - Lessons for smaller-scale systems

chatgpt_stack = {
    "Frontend": {
        "component": "OpenAI web UI / mobile app / API",
        "your_stack": "Telegram bot (@aloyclaw_bot) / terminal",
        "key_lesson": "Multiple surfaces, same backend",
    },
    "DDoS Protection": {
        "component": "Cloudflare",
        "your_stack": "None needed at your scale",
        "key_lesson": "Rate limiting is enough for low traffic",
    },
    "API Gateway": {
        "component": "Custom gateway (auth, rate limit, tiering)",
        "your_stack": "Your AI Gateway from Day 7",
        "key_lesson": "Same pattern, different scale",
    },
    "Model Router": {
        "component": "ML classifier β†’ model selection",
        "your_stack": "Semantic router from Day 4",
        "key_lesson": "Simple classifier can replace complex routing",
    },
    "Inference Cluster": {
        "component": "H100 GPUs, custom inference engine",
        "your_stack": "llama.cpp (llama-server)",
        "key_lesson": "Engine optimisation matters more than hardware",
    },
    "Caching": {
        "component": "KV cache prefix reuse + response cache",
        "your_stack": "LRU response cache from Day 3",
        "key_lesson": "Cache at every layer, from KV to response",
    },
    "Monitoring": {
        "component": "Custom observability (logs + metrics + traces)",
        "your_stack": "Structured JSON logs from Day 22",
        "key_lesson": "Start with logs, add metrics as you grow",
    },
    "Moderation": {
        "component": "OpenAI Moderation API (multilayer)",
        "your_stack": "Regex guardrails from Day 23",
        "key_lesson": "Layered safety, automate what you can",
    },
}

print("🧠 ChatGPT Architecture Map β†’ Your VPS\n")
print(f"{'Component':<22} {'ChatGPT':<25} {'Your Stack':<30} {'Lesson'}")
print("-" * 100)
for comp, details in chatgpt_stack.items():
    print(f"{comp:<22} {details['component']:<25} "
          f"{details['your_stack']:<30} {details['key_lesson']}")

Questions for Ayva: - What open-source projects (vLLM, TGI, SGLang) are closest to OpenAI's internal stack? - How does GPT-4o-mini achieve its price/performance ratio? - What's the one architectural change that would most impact a single-server setup?


Key Takeaways

  • ChatGPT's architecture follows the same patterns as your VPS stack β€” just at 10,000x scale
  • The hard parts (model parallelism, custom inference engine) are solved by open-source (vLLM, llama.cpp)
  • OpenAI's competitive advantage is hardware access, not architectural innovation
  • Every component has an open-source equivalent you can run today

References