Day 10: RAG Architecture
Learning Objectives
- Understand the full RAG pipeline: retrieve → rerank → generate
- Learn the critical design decisions (chunking, top-k, reranking)
- Build a RAG pipeline over your Obsidian vault
Theory (15 min)
The RAG Pipeline
RAG = Retrieval-Augmented Generation. Instead of asking the LLM to know everything, give it relevant documents at query time.
Query ──▶ Embedder ──▶ Vector Search ──▶ Top-K chunks
│
▼
┌─────────────────────┐
│ Reranker │── Reorder by relevance
└─────────────────────┘
│
▼
┌─────────────────────┐
│ Prompt Builder │── System + context + query
└─────────────────────┘
│
▼
LLM ──▶ Answer
The Critical Knobs
1. Chunking strategy - Fixed-size: simple, but may split concepts - Semantic: split at natural boundaries (paragraphs, sections, H2 headers) - Agentic: use an LLM to decide where to split (expensive but best quality)
2. Top-K (how many chunks to retrieve) - Too few: missing information - Too many: context window overflow, distraction for the LLM - Sweet spot: 3-10 chunks depending on domain
3. Reranking The retriever uses cheap/fast embeddings. The reranker uses a cross-encoder (more expensive but more accurate) to reorder the top results.
Retriever finds 20 docs (fast, cheap embedding search)
Reranker reorders top 20 → picks best 5 (slow, accurate cross-encoder)
4. Prompt template How you present retrieved context matters enormously:
Good: "Answer based on the following context. If the context doesn't contain
the answer, say 'I don't know'. Here is the context: {context}"
Bad: "Here's some stuff: {context}. Now answer: {query}"
Common RAG Failure Modes
| Failure | Cause | Fix |
|---|---|---|
| Missing context | Retrieval missed relevant docs | Increase top-K, hybrid search |
| Irrelevant context | Noise in retrieval | Reranker, better chunking |
| LLM ignores context | Strong model priors | Better prompt engineering |
| Lost-in-the-middle | Context in middle of prompt | Put most relevant docs at start/end |
Hands-on (15 min)
Build RAG Over Your Obsidian Vault
#!/usr/bin/env python3
"""obsidian-rag.py — RAG over your Obsidian vault."""
import os
from pathlib import Path
# Stub — Ayva will expand with:
# - Walk Obsidian vault (/opt/obsidian-vault)
# - Parse markdown frontmatter + body
# - Chunk by headers (smart splitting)
# - Embed with local model (llama.cpp embeddings or sentence-transformers)
# - Index in Qdrant/Chroma
# - Retriever with configurable top-K
# - Cross-encoder reranker
# - Prompt builder that formats context nicely
# - Answer generation via LLM
VAULT_PATH = Path("/opt/obsidian-vault")
def scan_vault():
files = list(VAULT_PATH.rglob("*.md"))
print(f"Found {len(files)} markdown files in vault")
# Show folder breakdown
folders = {}
for f in files:
folder = f.parent.relative_to(VAULT_PATH)
folders[folder] = folders.get(folder, 0) + 1
for folder, count in sorted(folders.items()):
print(f" 📁 {folder}: {count} files")
return files
if __name__ == "__main__":
if VAULT_PATH.exists():
scan_vault()
else:
print(f"Vault at {VAULT_PATH} not found — run with your path")
Questions for Ayva: - Best chunking strategy for Obsidian notes (headers, tags, frontmatter)? - How to handle file updates (incremental indexing vs full re-index)? - Optimal prompt template for answering from personal notes?
Key Takeaways
- RAG pipelines are the most practical way to ground LLM responses in your own data
- The three stages (retrieve, rerank, generate) each have critical configuration knobs
- Reranking is the most impactful accuracy improvement per compute cost
- Prompt engineering is essential — the LLM needs clear instructions on how to use context