Day 11: Distributed Training 101
Learning Objectives
- Understand the three parallelism strategies and when they apply
- Know why distributed training is complex and often unnecessary
- Simulate parallel vs sequential workloads on your VPS
Theory (15 min)
Why Distribute?
Training large models requires: 1. More memory than one device can hold (model doesn't fit) 2. More compute than one device can deliver in acceptable time
The Three Paradigms
1. Data Parallelism
Batch ──┬─▶ GPU 0 ──▶ Gradients
├─▶ GPU 1 ──▶ Gradients ──▶ All-Reduce ──▶ Update
├─▶ GPU 2 ──▶ Gradients
└─▶ GPU 3 ──▶ Gradients
Each GPU holds a full copy of the model. The batch is split across GPUs. Gradients are averaged.
Best for: Model fits on one GPU, but you want to train faster on more data.
2. Model Parallelism
Layer 0-5 ──▶ GPU 0
Layer 6-11 ──▶ GPU 1 (sequential)
Layer 12-17 ─▶ GPU 2
Layer 18-23 ─▶ GPU 3
Each GPU holds a portion of the model. The forward pass travels through GPUs sequentially.
Best for: Model is too large for one GPU.
3. Pipeline Parallelism
GPU 0: [Batch1:L0-5] → [Batch2:L0-5] → ...
GPU 1: [Batch1:L6-11] → [Batch2:L6-11] → ...
Like model parallelism, but with micro-batches to keep GPUs busy (reduce idle time when earlier stages compute).
Best for: Large models where GPU utilisation matters.
When NOT to Distribute
- Model is <7B params → single GPU is fine
- Dataset is <10GB → training is fast enough
- You're doing fine-tuning → single GPU with LoRA/QLoRA
- You only have CPU → llama.cpp does CPU training fine
For 99% of the AI systems you'll design, single-device training is sufficient.
Hands-on (15 min)
Simulate Parallel vs Sequential Workloads
#!/usr/bin/env python3
"""parallel-simulation.py — compare sequential vs parallel execution."""
import time
import threading
import random
# Stub — Ayva will expand with:
# - Real distributed training concepts (All-Reduce, Ring topology)
# - Communication overhead modelling
# - Scaling efficiency curves (Amdahl's Law)
# - When to use FSDP/DeepSpeed
# - LoRA vs full fine-tuning tradeoffs
def simulate_training_worker(worker_id: int, data_size: float):
"""Simulate training on a chunk of data."""
compute = data_size * random.uniform(1.0, 1.5)
time.sleep(compute * 0.1) # scale down for demo
return worker_id, compute
def sequential(data_sizes):
results = []
for i, size in enumerate(data_sizes):
results.append(simulate_training_worker(i, size))
return results
def parallel(data_sizes):
threads = []
results = [None] * len(data_sizes)
def worker(i, size):
results[i] = simulate_training_worker(i, size)
for i, size in enumerate(data_sizes):
t = threading.Thread(target=worker, args=(i, size))
threads.append(t)
t.start()
for t in threads:
t.join()
return results
# Simulate with different splits
data_sizes = [10, 12, 8, 15, 9, 11]
t0 = time.time()
seq_results = sequential(data_sizes)
seq_time = time.time() - t0
print(f"Sequential: {seq_time:.2f}s")
t0 = time.time()
par_results = parallel(data_sizes)
par_time = time.time() - t0
print(f"Parallel: {par_time:.2f}s")
print(f"Speedup: {seq_time/par_time:.2f}x")
# Amdahl's Law visualisation
def amdahl_speedup(p: float, n: int):
"""p = proportion that can be parallelised, n = processors."""
return 1 / ((1 - p) + p / n)
print("\nAmdahl's Law — max speedup by parallel proportion:")
for p in [0.5, 0.8, 0.9, 0.95, 0.99]:
print(f" {p*100:.0f}% parallel: {amdahl_speedup(p, 4):.2f}x with 4 workers, "
f"{amdahl_speedup(p, 8):.2f}x with 8 workers")
Questions for Ayva: - When does data parallelism communication overhead outweigh benefits? - How do LoRA adapters reduce the need for distributed training? - What's the practical scaling efficiency of FSDP vs DeepSpeed ZeRO?
Key Takeaways
- Three parallelism strategies: data (same model, more data), model (split layers), pipeline (split + pipeline)
- Communication overhead is the main bottleneck in distributed training
- Amdahl's Law shows diminishing returns — 8 workers with 90% parallelism is only 4.7x faster
- For 99% of real systems, single-device training with LoRA is sufficient