Day 14: Mini-Project β End-to-End RAG
Learning Objectives
- Integrate everything from Week 2 into a production-grade RAG pipeline
- Understand the end-to-end flow from document ingestion to answer generation
- Build and test a fully containerised RAG service
Theory (15 min)
The Full Stack
ββββββββββββββββββββββββ
β RAG Service β
β β
Documents βββΆ ββββββββ β ββββββββββ βββββββ β
βImportββββΌββΆβRetrieveβββΆβRerankβ β
ββββββββ β ββββββββββ ββββ¬βββ β
β β β
β βββββββββββββββββΌβββ β
β β Prompt Builder β β
β βββββββββββββββββ¬βββ β
β β β
β βββββββββββββββββΌβββ β
β β LLM Generator ββββΌββΆ Answer
β ββββββββββββββββββββ β
ββββββββββββββββββββββββ
Architecture Decisions
| Component | Options | Choose For |
|---|---|---|
| Embedding | sentence-transformers, OpenAI, llama.cpp embeddings | Local-first, cost control |
| Vector DB | Qdrant, Chroma, pgvector | Scale + features (Qdrant), simplicity (Chroma) |
| Reranker | cross-encoder/ms-marco-MiniLM, Cohere | Accuracy > cost |
| LLM | llama.cpp, vLLM, OpenAI | Latency/cost balance |
| Framework | LangChain, LlamaIndex, custom | Custom (max control, minimal deps) |
Quality Metrics
Measure your RAG pipeline:
| Metric | How | Target |
|---|---|---|
| Retrieval recall | % of relevant docs in top-K | >85% |
| Answer relevance | LLM self-evaluation | >80% |
| Precision | % of retrieved docs actually relevant | >70% |
| Latency p50 | Time from query to answer | <3s |
| Latency p99 | Worst case | <10s |
Hands-on (15 min)
Build the Integrated RAG Service
#!/usr/bin/env python3
"""rag-service.py β end-to-end RAG pipeline service."""
from http.server import HTTPServer, BaseHTTPRequestHandler
import json
import os
# Stub β Ayva will expand with:
# - Full ingestion pipeline: file β chunk β embed β index
# - Hybrid search (dense + BM25/kwd)
# - Cross-encoder reranker
# - Prompt template with provenance tracking
# - Generator with cited sources
# - REST API (POST /query, POST /ingest, GET /health)
# - Docker container with Qdrant sidecar
# - Comprehensive testing with Obsidian vault data
RERANK_TOP_K = int(os.getenv("RERANK_TOP_K", "5"))
GEN_TOP_K = int(os.getenv("GEN_TOP_K", "3"))
class RAGHandler(BaseHTTPRequestHandler):
def do_POST(self):
content_len = int(self.headers.get("Content-Length", 0))
body = json.loads(self.rfile.read(content_len))
if self.path == "/query":
answer = self.handle_query(body.get("query", ""))
elif self.path == "/ingest":
answer = self.handle_ingest(body)
else:
self.send_error(404)
return
self._respond(200, answer)
def handle_query(self, query: str) -> dict:
# TODO:
# 1. Embed query
# 2. Search vector DB (top RERANK_TOP_K)
# 3. Rerank with cross-encoder
# 4. Build prompt with top GEN_TOP_K chunks
# 5. Generate with LLM
# 6. Return answer + sources
return {
"query": query,
"answer": "[RAG pipeline placeholder β Ayva to implement]",
"sources": [],
"tokens_used": 0,
}
def handle_ingest(self, body: dict) -> dict:
# TODO:
# 1. Accept text/file URL/document
# 2. Chunk by configured strategy
# 3. Embed each chunk
# 4. Upsert to vector DB
return {"status": "ok", "chunks_indexed": 0}
def _respond(self, status, data):
self.send_response(status)
self.send_header("Content-Type", "application/json")
self.end_headers()
self.wfile.write(json.dumps(data).encode())
PORT = int(os.getenv("RAG_PORT", "8010"))
HTTPServer(("0.0.0.0", PORT), RAGHandler).serve_forever()
Questions for Ayva: - What's the optimal chunk overlap for markdown documents? - How to handle multi-turn RAG (follow-up questions)? - What evaluation methodology proves the pipeline works (NDCG, MRR, human eval)?
Key Takeaways
- A full RAG pipeline combines ingestion, retrieval, reranking, and generation
- Measure what matters: retrieval recall, answer relevance, and latency
- Custom implementation (no LangChain) gives full control and minimises dependencies
- Documented quality metrics let you tune each stage independently