Day 14: Mini-Project — End-to-End RAG

📂 Data & Training 📖 15 min read Needs expansion

Learning Objectives

Integrate everything from Week 2 into a production-grade RAG pipeline
Understand the end-to-end flow from document ingestion to answer generation
Build and test a fully containerised RAG service

Theory (15 min)

The Full Stack

                        ┌──────────────────────┐
                        │   RAG Service         │
                        │                       │
Documents ──▶ ┌──────┐  │  ┌────────┐  ┌─────┐ │
              │Import│──┼─▶│Retrieve│─▶│Rerank│ │
              └──────┘  │  └────────┘  └──┬──┘ │
                        │                  │     │
                        │  ┌───────────────▼──┐  │
                        │  │  Prompt Builder  │  │
                        │  └───────────────┬──┘  │
                        │                  │     │
                        │  ┌───────────────▼──┐  │
                        │  │  LLM Generator   │──┼─▶ Answer
                        │  └──────────────────┘  │
                        └──────────────────────┘

Architecture Decisions

Component	Options	Choose For
Embedding	sentence-transformers, OpenAI, llama.cpp embeddings	Local-first, cost control
Vector DB	Qdrant, Chroma, pgvector	Scale + features (Qdrant), simplicity (Chroma)
Reranker	cross-encoder/ms-marco-MiniLM, Cohere	Accuracy > cost
LLM	llama.cpp, vLLM, OpenAI	Latency/cost balance
Framework	LangChain, LlamaIndex, custom	Custom (max control, minimal deps)

Quality Metrics

Measure your RAG pipeline:

Metric	How	Target
Retrieval recall	% of relevant docs in top-K	>85%
Answer relevance	LLM self-evaluation	>80%
Precision	% of retrieved docs actually relevant	>70%
Latency p50	Time from query to answer	<3s
Latency p99	Worst case	<10s

Hands-on (15 min)

Build the Integrated RAG Service

#!/usr/bin/env python3
"""rag-service.py — end-to-end RAG pipeline service."""
from http.server import HTTPServer, BaseHTTPRequestHandler
import json
import os

# Stub — Ayva will expand with:
# - Full ingestion pipeline: file → chunk → embed → index
# - Hybrid search (dense + BM25/kwd)
# - Cross-encoder reranker
# - Prompt template with provenance tracking
# - Generator with cited sources
# - REST API (POST /query, POST /ingest, GET /health)
# - Docker container with Qdrant sidecar
# - Comprehensive testing with Obsidian vault data

RERANK_TOP_K = int(os.getenv("RERANK_TOP_K", "5"))
GEN_TOP_K = int(os.getenv("GEN_TOP_K", "3"))

class RAGHandler(BaseHTTPRequestHandler):
    def do_POST(self):
        content_len = int(self.headers.get("Content-Length", 0))
        body = json.loads(self.rfile.read(content_len))

        if self.path == "/query":
            answer = self.handle_query(body.get("query", ""))
        elif self.path == "/ingest":
            answer = self.handle_ingest(body)
        else:
            self.send_error(404)
            return

        self._respond(200, answer)

    def handle_query(self, query: str) -> dict:
        # TODO:
        # 1. Embed query
        # 2. Search vector DB (top RERANK_TOP_K)
        # 3. Rerank with cross-encoder
        # 4. Build prompt with top GEN_TOP_K chunks
        # 5. Generate with LLM
        # 6. Return answer + sources
        return {
            "query": query,
            "answer": "[RAG pipeline placeholder — Ayva to implement]",
            "sources": [],
            "tokens_used": 0,
        }

    def handle_ingest(self, body: dict) -> dict:
        # TODO:
        # 1. Accept text/file URL/document
        # 2. Chunk by configured strategy
        # 3. Embed each chunk
        # 4. Upsert to vector DB
        return {"status": "ok", "chunks_indexed": 0}

    def _respond(self, status, data):
        self.send_response(status)
        self.send_header("Content-Type", "application/json")
        self.end_headers()
        self.wfile.write(json.dumps(data).encode())

PORT = int(os.getenv("RAG_PORT", "8010"))
HTTPServer(("0.0.0.0", PORT), RAGHandler).serve_forever()

Questions for Ayva: - What's the optimal chunk overlap for markdown documents? - How to handle multi-turn RAG (follow-up questions)? - What evaluation methodology proves the pipeline works (NDCG, MRR, human eval)?

Key Takeaways

A full RAG pipeline combines ingestion, retrieval, reranking, and generation
Measure what matters: retrieval recall, answer relevance, and latency
Custom implementation (no LangChain) gives full control and minimises dependencies
Documented quality metrics let you tune each stage independently

🧠 AI System Design