🧠 AI System Design

Day 14: Mini-Project β€” End-to-End RAG

πŸ“‚ Data & Training πŸ“– 15 min read Needs expansion

Learning Objectives

  • Integrate everything from Week 2 into a production-grade RAG pipeline
  • Understand the end-to-end flow from document ingestion to answer generation
  • Build and test a fully containerised RAG service

Theory (15 min)

The Full Stack

                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚   RAG Service         β”‚
                        β”‚                       β”‚
Documents ──▢ β”Œβ”€β”€β”€β”€β”€β”€β”  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β” β”‚
              β”‚Import│──┼─▢│Retrieve│─▢│Rerankβ”‚ β”‚
              β””β”€β”€β”€β”€β”€β”€β”˜  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”˜ β”‚
                        β”‚                  β”‚     β”‚
                        β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”  β”‚
                        β”‚  β”‚  Prompt Builder  β”‚  β”‚
                        β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”˜  β”‚
                        β”‚                  β”‚     β”‚
                        β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”  β”‚
                        β”‚  β”‚  LLM Generator   │──┼─▢ Answer
                        β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Architecture Decisions

Component Options Choose For
Embedding sentence-transformers, OpenAI, llama.cpp embeddings Local-first, cost control
Vector DB Qdrant, Chroma, pgvector Scale + features (Qdrant), simplicity (Chroma)
Reranker cross-encoder/ms-marco-MiniLM, Cohere Accuracy > cost
LLM llama.cpp, vLLM, OpenAI Latency/cost balance
Framework LangChain, LlamaIndex, custom Custom (max control, minimal deps)

Quality Metrics

Measure your RAG pipeline:

Metric How Target
Retrieval recall % of relevant docs in top-K >85%
Answer relevance LLM self-evaluation >80%
Precision % of retrieved docs actually relevant >70%
Latency p50 Time from query to answer <3s
Latency p99 Worst case <10s

Hands-on (15 min)

Build the Integrated RAG Service

#!/usr/bin/env python3
"""rag-service.py β€” end-to-end RAG pipeline service."""
from http.server import HTTPServer, BaseHTTPRequestHandler
import json
import os

# Stub β€” Ayva will expand with:
# - Full ingestion pipeline: file β†’ chunk β†’ embed β†’ index
# - Hybrid search (dense + BM25/kwd)
# - Cross-encoder reranker
# - Prompt template with provenance tracking
# - Generator with cited sources
# - REST API (POST /query, POST /ingest, GET /health)
# - Docker container with Qdrant sidecar
# - Comprehensive testing with Obsidian vault data

RERANK_TOP_K = int(os.getenv("RERANK_TOP_K", "5"))
GEN_TOP_K = int(os.getenv("GEN_TOP_K", "3"))

class RAGHandler(BaseHTTPRequestHandler):
    def do_POST(self):
        content_len = int(self.headers.get("Content-Length", 0))
        body = json.loads(self.rfile.read(content_len))

        if self.path == "/query":
            answer = self.handle_query(body.get("query", ""))
        elif self.path == "/ingest":
            answer = self.handle_ingest(body)
        else:
            self.send_error(404)
            return

        self._respond(200, answer)

    def handle_query(self, query: str) -> dict:
        # TODO:
        # 1. Embed query
        # 2. Search vector DB (top RERANK_TOP_K)
        # 3. Rerank with cross-encoder
        # 4. Build prompt with top GEN_TOP_K chunks
        # 5. Generate with LLM
        # 6. Return answer + sources
        return {
            "query": query,
            "answer": "[RAG pipeline placeholder β€” Ayva to implement]",
            "sources": [],
            "tokens_used": 0,
        }

    def handle_ingest(self, body: dict) -> dict:
        # TODO:
        # 1. Accept text/file URL/document
        # 2. Chunk by configured strategy
        # 3. Embed each chunk
        # 4. Upsert to vector DB
        return {"status": "ok", "chunks_indexed": 0}

    def _respond(self, status, data):
        self.send_response(status)
        self.send_header("Content-Type", "application/json")
        self.end_headers()
        self.wfile.write(json.dumps(data).encode())

PORT = int(os.getenv("RAG_PORT", "8010"))
HTTPServer(("0.0.0.0", PORT), RAGHandler).serve_forever()

Questions for Ayva: - What's the optimal chunk overlap for markdown documents? - How to handle multi-turn RAG (follow-up questions)? - What evaluation methodology proves the pipeline works (NDCG, MRR, human eval)?


Key Takeaways

  • A full RAG pipeline combines ingestion, retrieval, reranking, and generation
  • Measure what matters: retrieval recall, answer relevance, and latency
  • Custom implementation (no LangChain) gives full control and minimises dependencies
  • Documented quality metrics let you tune each stage independently

References