🧠 AI System Design

Day 10: RAG Architecture

📂 Data & Training 📖 15 min read Needs expansion

Learning Objectives

  • Understand the full RAG pipeline: retrieve → rerank → generate
  • Learn the critical design decisions (chunking, top-k, reranking)
  • Build a RAG pipeline over your Obsidian vault

Theory (15 min)

The RAG Pipeline

RAG = Retrieval-Augmented Generation. Instead of asking the LLM to know everything, give it relevant documents at query time.

Query ──▶ Embedder ──▶ Vector Search ──▶ Top-K chunks
                               │
                               ▼
                    ┌─────────────────────┐
                    │    Reranker         │── Reorder by relevance
                    └─────────────────────┘
                               │
                               ▼
                    ┌─────────────────────┐
                    │  Prompt Builder     │── System + context + query
                    └─────────────────────┘
                               │
                               ▼
                           LLM ──▶ Answer

The Critical Knobs

1. Chunking strategy - Fixed-size: simple, but may split concepts - Semantic: split at natural boundaries (paragraphs, sections, H2 headers) - Agentic: use an LLM to decide where to split (expensive but best quality)

2. Top-K (how many chunks to retrieve) - Too few: missing information - Too many: context window overflow, distraction for the LLM - Sweet spot: 3-10 chunks depending on domain

3. Reranking The retriever uses cheap/fast embeddings. The reranker uses a cross-encoder (more expensive but more accurate) to reorder the top results.

Retriever finds 20 docs (fast, cheap embedding search)
Reranker reorders top 20 → picks best 5 (slow, accurate cross-encoder)

4. Prompt template How you present retrieved context matters enormously:

Good: "Answer based on the following context. If the context doesn't contain
       the answer, say 'I don't know'. Here is the context: {context}"
Bad:  "Here's some stuff: {context}. Now answer: {query}"

Common RAG Failure Modes

Failure Cause Fix
Missing context Retrieval missed relevant docs Increase top-K, hybrid search
Irrelevant context Noise in retrieval Reranker, better chunking
LLM ignores context Strong model priors Better prompt engineering
Lost-in-the-middle Context in middle of prompt Put most relevant docs at start/end

Hands-on (15 min)

Build RAG Over Your Obsidian Vault

#!/usr/bin/env python3
"""obsidian-rag.py — RAG over your Obsidian vault."""
import os
from pathlib import Path

# Stub — Ayva will expand with:
# - Walk Obsidian vault (/opt/obsidian-vault)
# - Parse markdown frontmatter + body
# - Chunk by headers (smart splitting)
# - Embed with local model (llama.cpp embeddings or sentence-transformers)
# - Index in Qdrant/Chroma
# - Retriever with configurable top-K
# - Cross-encoder reranker
# - Prompt builder that formats context nicely
# - Answer generation via LLM

VAULT_PATH = Path("/opt/obsidian-vault")

def scan_vault():
    files = list(VAULT_PATH.rglob("*.md"))
    print(f"Found {len(files)} markdown files in vault")
    # Show folder breakdown
    folders = {}
    for f in files:
        folder = f.parent.relative_to(VAULT_PATH)
        folders[folder] = folders.get(folder, 0) + 1
    for folder, count in sorted(folders.items()):
        print(f"  📁 {folder}: {count} files")
    return files

if __name__ == "__main__":
    if VAULT_PATH.exists():
        scan_vault()
    else:
        print(f"Vault at {VAULT_PATH} not found — run with your path")

Questions for Ayva: - Best chunking strategy for Obsidian notes (headers, tags, frontmatter)? - How to handle file updates (incremental indexing vs full re-index)? - Optimal prompt template for answering from personal notes?


Key Takeaways

  • RAG pipelines are the most practical way to ground LLM responses in your own data
  • The three stages (retrieve, rerank, generate) each have critical configuration knobs
  • Reranking is the most impactful accuracy improvement per compute cost
  • Prompt engineering is essential — the LLM needs clear instructions on how to use context

References