Day 26: Case Study: Perplexity

📂 Production & Case Studies 📖 15 min read Needs expansion

Learning Objectives

Understand the architecture for real-time web-augmented answers
Learn the tradeoffs: latency vs comprehensiveness in search
Replicate a mini Perplexity: search → extract → summarise pipeline

Theory (15 min)

Perplexity's Core Innovation

Traditional search: 10 blue links, user clicks, reads, synthesises.

Perplexity: search → extract → synthesise → answer with citations.

Query: "What are the latest GPUs from NVIDIA?"

1. Search (Bing API): Top 10 results
2. Extract: Retrieve content from each URL
3. Chunk + Rerank: Find relevant passages
4. Generate: Synthesise answer with [1][2][3] citations
   → "NVIDIA announced the RTX 5090 in January 2025 [1]. It features 32GB of VRAM [2]."

Architecture

Query ──▶ Search API ──▶ URL Fetcher ──▶ Content Chunker
              │                │                │
           Bing/           httpx/          Chunk by
           Google          Playwright      paragraph
              │                │                │
              ▼                ▼                ▼
         ┌────────────────────────────────────────┐
         │           Reranker (cross-encoder)     │── Top 5 passages
         └────────────────────────────────────────┘
                           │
                           ▼
         ┌────────────────────────────────────────┐
         │           Prompt Builder               │── Context + query
         └────────────────────────────────────────┘
                           │
                           ▼
         ┌────────────────────────────────────────┐
         │           LLM (GPT-4o / Sonnet)        │── Answer w/ citations
         └────────────────────────────────────────┘

Key Design Decisions

1. Latency budget: - Search: 200-500ms - URL fetch: 500-2000ms (parallel) - Rerank: 50-100ms - Generate: 1000-3000ms - Total: ~2-5s (users accept this for comprehensive answers)

2. Citation strategy: - Every claim attributed to source [1], [2], etc. - Sources displayed inline, linking back to original - Builds trust and allows fact-checking

3. Parallel URL fetch: - Fetch 5-10 URLs simultaneously - First result arrives fastest, last may timeout - Gracefully handle partial results

4. Caching: - Popular queries cached (response + sources) - Search results cached (Bing API costs money) - Generated answers deduped by semantic similarity

Hands-on (15 min)

Build a Mini Perplexity Clone

#!/usr/bin/env python3
"""mini-perplexity.py — search → extract → summarise pipeline."""
import json
import asyncio
import httpx

# Stub — Ayva will expand with:
# - Real search API integration (Bing, SerpAPI, or SearXNG if available)
# - Parallel URL fetching with timeout handling
# - Content chunking by paragraph/section
# - Cross-encoder reranker for passage relevance scoring
# - Citation-aware prompt template
# - Answer formatting with [1][2][3] references
# - Caching layer (response + search results)

SEARCH_API = "https://api.duckduckgo.com/?"  # or use your own SearXNG
LLM_URL = "http://localhost:8080/v1/completions"

async def search_web(query: str) -> list:
    """Placeholder — replace with real search API."""
    # In production: call Bing/Google/SearXNG API
    # Return: [{"url": "...", "title": "...", "snippet": "..."}]
    return [
        {"url": "https://example.com/1", "title": "Result 1",
         "snippet": "AI system design involves..."},
        {"url": "https://example.com/2", "title": "Result 2",
         "snippet": "Key considerations include..."},
    ]

async def extract_content(urls: list[str]) -> list[dict]:
    """Fetch content from URLs in parallel."""
    async with httpx.AsyncClient(timeout=10) as client:
        async def fetch(url):
            try:
                resp = await client.get(url, follow_redirects=True)
                return {"url": url, "content": resp.text[:2000]}
            except Exception as e:
                return {"url": url, "content": f"[error: {e}]"}

        tasks = [fetch(u) for u in urls]
        return await asyncio.gather(*tasks)

async def generate_answer(query: str, context: str) -> str:
    """Generate answer with citations."""
    prompt = (
        "Answer the following question based on the context provided. "
        "Cite sources using [1], [2], etc. If the context doesn't contain "
        "the answer, say you don't know.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {query}\n\n"
        "Answer:"
    )

    async with httpx.AsyncClient(timeout=30) as client:
        resp = await client.post(LLM_URL, json={
            "prompt": prompt,
            "max_tokens": 300,
            "temperature": 0.3,
        })
        return resp.json()["choices"][0]["text"]

async def perplexity_pipeline(query: str):
    print(f"🔍 Query: {query}\n")

    # 1. Search
    print("  1. Searching web...")
    results = await search_web(query)

    # 2. Extract
    urls = [r["url"] for r in results]
    print(f"  2. Fetching {len(urls)} URLs...")
    contents = await extract_content(urls)

    # 3. Build context
    context_parts = []
    for i, c in enumerate(contents):
        context_parts.append(f"[{i+1}] From {c['url']}:\n{c['content'][:300]}")
    context = "\n\n".join(context_parts)

    # 4. Generate
    print("  3. Generating answer...")
    answer = await generate_answer(query, context)

    print(f"\n📝 Answer:\n{answer[:500]}")
    return answer

if __name__ == "__main__":
    asyncio.run(perplexity_pipeline("What is the best way to learn AI system design?"))

Questions for Ayva: - What's the best free search API for a local Perplexity clone? - How to balance search depth vs latency (how many URLs to fetch)? - What prompt template produces the most citeable, trustworthy answers?

Key Takeaways

Perplexity's innovation is combining search, extraction, and generation into one UX
Parallel URL fetching is essential for acceptable latency
Citations build trust and allow user verification
The pipeline is modular: swap any component (search engine, reranker, LLM) independently

🧠 AI System Design