Day 11: Distributed Training 101

📂 Data & Training 📖 15 min read Needs expansion

Learning Objectives

Understand the three parallelism strategies and when they apply
Know why distributed training is complex and often unnecessary
Simulate parallel vs sequential workloads on your VPS

Theory (15 min)

Why Distribute?

Training large models requires: 1. More memory than one device can hold (model doesn't fit) 2. More compute than one device can deliver in acceptable time

The Three Paradigms

1. Data Parallelism

Batch ──┬─▶ GPU 0 ──▶ Gradients
         ├─▶ GPU 1 ──▶ Gradients  ──▶ All-Reduce ──▶ Update
         ├─▶ GPU 2 ──▶ Gradients
         └─▶ GPU 3 ──▶ Gradients

Each GPU holds a full copy of the model. The batch is split across GPUs. Gradients are averaged.

Best for: Model fits on one GPU, but you want to train faster on more data.

2. Model Parallelism

Layer 0-5 ──▶ GPU 0
Layer 6-11 ──▶ GPU 1    (sequential)
Layer 12-17 ─▶ GPU 2
Layer 18-23 ─▶ GPU 3

Each GPU holds a portion of the model. The forward pass travels through GPUs sequentially.

Best for: Model is too large for one GPU.

3. Pipeline Parallelism

GPU 0: [Batch1:L0-5] → [Batch2:L0-5] → ...
GPU 1:                [Batch1:L6-11] → [Batch2:L6-11] → ...

Like model parallelism, but with micro-batches to keep GPUs busy (reduce idle time when earlier stages compute).

Best for: Large models where GPU utilisation matters.

When NOT to Distribute

Model is <7B params → single GPU is fine
Dataset is <10GB → training is fast enough
You're doing fine-tuning → single GPU with LoRA/QLoRA
You only have CPU → llama.cpp does CPU training fine

For 99% of the AI systems you'll design, single-device training is sufficient.

Hands-on (15 min)

Simulate Parallel vs Sequential Workloads

#!/usr/bin/env python3
"""parallel-simulation.py — compare sequential vs parallel execution."""
import time
import threading
import random

# Stub — Ayva will expand with:
# - Real distributed training concepts (All-Reduce, Ring topology)
# - Communication overhead modelling
# - Scaling efficiency curves (Amdahl's Law)
# - When to use FSDP/DeepSpeed
# - LoRA vs full fine-tuning tradeoffs

def simulate_training_worker(worker_id: int, data_size: float):
    """Simulate training on a chunk of data."""
    compute = data_size * random.uniform(1.0, 1.5)
    time.sleep(compute * 0.1)  # scale down for demo
    return worker_id, compute

def sequential(data_sizes):
    results = []
    for i, size in enumerate(data_sizes):
        results.append(simulate_training_worker(i, size))
    return results

def parallel(data_sizes):
    threads = []
    results = [None] * len(data_sizes)

    def worker(i, size):
        results[i] = simulate_training_worker(i, size)

    for i, size in enumerate(data_sizes):
        t = threading.Thread(target=worker, args=(i, size))
        threads.append(t)
        t.start()
    for t in threads:
        t.join()
    return results

# Simulate with different splits
data_sizes = [10, 12, 8, 15, 9, 11]

t0 = time.time()
seq_results = sequential(data_sizes)
seq_time = time.time() - t0
print(f"Sequential: {seq_time:.2f}s")

t0 = time.time()
par_results = parallel(data_sizes)
par_time = time.time() - t0
print(f"Parallel:   {par_time:.2f}s")
print(f"Speedup:    {seq_time/par_time:.2f}x")

# Amdahl's Law visualisation
def amdahl_speedup(p: float, n: int):
    """p = proportion that can be parallelised, n = processors."""
    return 1 / ((1 - p) + p / n)

print("\nAmdahl's Law — max speedup by parallel proportion:")
for p in [0.5, 0.8, 0.9, 0.95, 0.99]:
    print(f"   {p*100:.0f}% parallel: {amdahl_speedup(p, 4):.2f}x with 4 workers, "
          f"{amdahl_speedup(p, 8):.2f}x with 8 workers")

Questions for Ayva: - When does data parallelism communication overhead outweigh benefits? - How do LoRA adapters reduce the need for distributed training? - What's the practical scaling efficiency of FSDP vs DeepSpeed ZeRO?

Key Takeaways

Three parallelism strategies: data (same model, more data), model (split layers), pipeline (split + pipeline)
Communication overhead is the main bottleneck in distributed training
Amdahl's Law shows diminishing returns — 8 workers with 90% parallelism is only 4.7x faster
For 99% of real systems, single-device training with LoRA is sufficient

🧠 AI System Design