Day 13: Experiment Tracking
Learning Objectives
- Understand why ad-hoc experiment tracking leads to irreproducible results
- Learn the MLflow workflow: log params, metrics, artifacts, model registry
- Integrate MLflow (SQLite backend) with a training script
Theory (15 min)
The Reproducibility Crisis
Most ML projects fail at reproduction: - "Which hyperparameters produced that 92% accuracy run?" - "What data was used for this model?" - "Where is the training script version that generated this?"
A single source of truth for experiments fixes this.
What to Track
| Type | Examples | Why |
|---|---|---|
| Parameters | lr=3e-5, batch_size=16, model=bert-base | Reproduce training |
| Metrics | loss, accuracy, tokens/sec, GPU mem | Compare runs |
| Artifacts | model.pt, tokenizer, config.json | Deploy from any run |
| Source | git commit, diff, script hash | Audit trail |
| Environment | Python version, CUDA version, GPU type | Debug hardware issues |
MLflow Components
MLflow Tracking āāā¶ Log params, metrics, artifacts (runs)
MLflow Registry āāā¶ Staging ā Production transitions
MLflow Models āāāāā¶ Standard model packaging format
Lightweight setup: MLflow with SQLite backend (no extra infrastructure).
Workflow
1. Run training ā MLflow logs params + metrics + model weights
2. Compare runs in UI ā pick best validation loss
3. Register model ā "prod" stage
4. Deploy from registry ā consistent artifact path
Hands-on (15 min)
Integrate MLflow with a Training Script
pip install mlflow
#!/usr/bin/env python3
"""mlflow-tracking.py ā experiment tracking with MLflow."""
import mlflow
import mlflow.pyfunc
import json
import time
import random
# Stub ā Ayva will expand with:
# - Real model training logged to MLflow
# - Hyperparameter sweeps (GridSearch / Optuna)
# - Model registry: staging ā production promotion
# - Artifact logging (model weights, tokenizer, config)
# - MLflow UI setup (mlflow server)
# - Compare runs and select best
# - Integration with the fault-tolerant training from Day 12
# Set tracking URI (SQLite)
mlflow.set_tracking_uri("sqlite:///mlruns.db")
mlflow.set_experiment("ai-system-design")
def train_with_tracking():
with mlflow.start_run(run_name=f"run-{int(time.time())}"):
# Log parameters
params = {
"learning_rate": 3e-5,
"batch_size": 16,
"epochs": 5,
"model_name": "qwen2.5-3b",
"lora_rank": 8,
"lora_alpha": 16,
"dataset": "code-alpaca-5k",
}
mlflow.log_params(params)
print(f"Logged params: {params}")
# Simulate training and log metrics
for epoch in range(5):
train_loss = max(0.5, 2.0 / (epoch + 1) + random.uniform(-0.1, 0.1))
val_loss = max(0.6, 2.2 / (epoch + 1) + random.uniform(-0.1, 0.1))
accuracy = min(0.95, 0.5 + epoch * 0.09 + random.uniform(-0.02, 0.02))
mlflow.log_metrics({
"train_loss": train_loss,
"val_loss": val_loss,
"accuracy": accuracy,
}, step=epoch)
print(f" epoch {epoch}: train={train_loss:.4f}, val={val_loss:.4f}, acc={accuracy:.3f}")
time.sleep(0.2)
# Log a dummy artifact (real: model weights)
artifact_path = "./artifacts"
import os
os.makedirs(artifact_path, exist_ok=True)
with open(f"{artifact_path}/config.json", "w") as f:
json.dump(params, f)
mlflow.log_artifact(artifact_path)
print(f"Logged artifacts from {artifact_path}")
# Register the run
run_id = mlflow.active_run().info.run_id
print(f"\nā
Run complete! run_id: {run_id}")
print(f" View: mlflow ui --backend-store-uri sqlite:///mlruns.db")
return run_id
if __name__ == "__main__":
train_with_tracking()
# View runs
print("\nš Recent runs:")
runs = mlflow.search_runs(experiment_names=["ai-system-design"])
for _, run in runs.iterrows():
print(f" {run['run_id'][:8]}... acc={run.get('metrics.accuracy', 'N/A')}")
View the MLflow UI:
mlflow ui --backend-store-uri sqlite:///mlruns.db --port 5002
Questions for Ayva: - How to integrate MLflow with distributed training (autologging)? - What's the best practice for model registry promotion workflow? - When should you use MLflow vs W&B vs Neptune?
Key Takeaways
- Experiment tracking is non-negotiable for reproducible ML ā log everything
- MLflow with SQLite is zero-infrastructure and sufficient for most teams
- Track: params, metrics, artifacts, source code, environment
- Model registry (staging ā prod transitions) enables structured deployment