🧠 AI System Design

Day 11: Distributed Training 101

📂 Data & Training 📖 15 min read Needs expansion

Learning Objectives

  • Understand the three parallelism strategies and when they apply
  • Know why distributed training is complex and often unnecessary
  • Simulate parallel vs sequential workloads on your VPS

Theory (15 min)

Why Distribute?

Training large models requires: 1. More memory than one device can hold (model doesn't fit) 2. More compute than one device can deliver in acceptable time

The Three Paradigms

1. Data Parallelism

Batch ──┬─▶ GPU 0 ──▶ Gradients
         ├─▶ GPU 1 ──▶ Gradients  ──▶ All-Reduce ──▶ Update
         ├─▶ GPU 2 ──▶ Gradients
         └─▶ GPU 3 ──▶ Gradients

Each GPU holds a full copy of the model. The batch is split across GPUs. Gradients are averaged.

Best for: Model fits on one GPU, but you want to train faster on more data.

2. Model Parallelism

Layer 0-5 ──▶ GPU 0
Layer 6-11 ──▶ GPU 1    (sequential)
Layer 12-17 ─▶ GPU 2
Layer 18-23 ─▶ GPU 3

Each GPU holds a portion of the model. The forward pass travels through GPUs sequentially.

Best for: Model is too large for one GPU.

3. Pipeline Parallelism

GPU 0: [Batch1:L0-5] → [Batch2:L0-5] → ...
GPU 1:                [Batch1:L6-11] → [Batch2:L6-11] → ...

Like model parallelism, but with micro-batches to keep GPUs busy (reduce idle time when earlier stages compute).

Best for: Large models where GPU utilisation matters.

When NOT to Distribute

  • Model is <7B params → single GPU is fine
  • Dataset is <10GB → training is fast enough
  • You're doing fine-tuning → single GPU with LoRA/QLoRA
  • You only have CPU → llama.cpp does CPU training fine

For 99% of the AI systems you'll design, single-device training is sufficient.


Hands-on (15 min)

Simulate Parallel vs Sequential Workloads

#!/usr/bin/env python3
"""parallel-simulation.py — compare sequential vs parallel execution."""
import time
import threading
import random

# Stub — Ayva will expand with:
# - Real distributed training concepts (All-Reduce, Ring topology)
# - Communication overhead modelling
# - Scaling efficiency curves (Amdahl's Law)
# - When to use FSDP/DeepSpeed
# - LoRA vs full fine-tuning tradeoffs

def simulate_training_worker(worker_id: int, data_size: float):
    """Simulate training on a chunk of data."""
    compute = data_size * random.uniform(1.0, 1.5)
    time.sleep(compute * 0.1)  # scale down for demo
    return worker_id, compute

def sequential(data_sizes):
    results = []
    for i, size in enumerate(data_sizes):
        results.append(simulate_training_worker(i, size))
    return results

def parallel(data_sizes):
    threads = []
    results = [None] * len(data_sizes)

    def worker(i, size):
        results[i] = simulate_training_worker(i, size)

    for i, size in enumerate(data_sizes):
        t = threading.Thread(target=worker, args=(i, size))
        threads.append(t)
        t.start()
    for t in threads:
        t.join()
    return results

# Simulate with different splits
data_sizes = [10, 12, 8, 15, 9, 11]

t0 = time.time()
seq_results = sequential(data_sizes)
seq_time = time.time() - t0
print(f"Sequential: {seq_time:.2f}s")

t0 = time.time()
par_results = parallel(data_sizes)
par_time = time.time() - t0
print(f"Parallel:   {par_time:.2f}s")
print(f"Speedup:    {seq_time/par_time:.2f}x")

# Amdahl's Law visualisation
def amdahl_speedup(p: float, n: int):
    """p = proportion that can be parallelised, n = processors."""
    return 1 / ((1 - p) + p / n)

print("\nAmdahl's Law — max speedup by parallel proportion:")
for p in [0.5, 0.8, 0.9, 0.95, 0.99]:
    print(f"   {p*100:.0f}% parallel: {amdahl_speedup(p, 4):.2f}x with 4 workers, "
          f"{amdahl_speedup(p, 8):.2f}x with 8 workers")

Questions for Ayva: - When does data parallelism communication overhead outweigh benefits? - How do LoRA adapters reduce the need for distributed training? - What's the practical scaling efficiency of FSDP vs DeepSpeed ZeRO?


Key Takeaways

  • Three parallelism strategies: data (same model, more data), model (split layers), pipeline (split + pipeline)
  • Communication overhead is the main bottleneck in distributed training
  • Amdahl's Law shows diminishing returns — 8 workers with 90% parallelism is only 4.7x faster
  • For 99% of real systems, single-device training with LoRA is sufficient

References