🧠 AI System Design

Day 26: Case Study: Perplexity

šŸ“‚ Production & Case Studies šŸ“– 15 min read Needs expansion

Learning Objectives

  • Understand the architecture for real-time web-augmented answers
  • Learn the tradeoffs: latency vs comprehensiveness in search
  • Replicate a mini Perplexity: search → extract → summarise pipeline

Theory (15 min)

Perplexity's Core Innovation

Traditional search: 10 blue links, user clicks, reads, synthesises.

Perplexity: search → extract → synthesise → answer with citations.

Query: "What are the latest GPUs from NVIDIA?"

1. Search (Bing API): Top 10 results
2. Extract: Retrieve content from each URL
3. Chunk + Rerank: Find relevant passages
4. Generate: Synthesise answer with [1][2][3] citations
   → "NVIDIA announced the RTX 5090 in January 2025 [1]. It features 32GB of VRAM [2]."

Architecture

Query ──▶ Search API ──▶ URL Fetcher ──▶ Content Chunker
              │                │                │
           Bing/           httpx/          Chunk by
           Google          Playwright      paragraph
              │                │                │
              ā–¼                ā–¼                ā–¼
         ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
         │           Reranker (cross-encoder)     │── Top 5 passages
         ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                           │
                           ā–¼
         ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
         │           Prompt Builder               │── Context + query
         ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                           │
                           ā–¼
         ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
         │           LLM (GPT-4o / Sonnet)        │── Answer w/ citations
         ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Key Design Decisions

1. Latency budget: - Search: 200-500ms - URL fetch: 500-2000ms (parallel) - Rerank: 50-100ms - Generate: 1000-3000ms - Total: ~2-5s (users accept this for comprehensive answers)

2. Citation strategy: - Every claim attributed to source [1], [2], etc. - Sources displayed inline, linking back to original - Builds trust and allows fact-checking

3. Parallel URL fetch: - Fetch 5-10 URLs simultaneously - First result arrives fastest, last may timeout - Gracefully handle partial results

4. Caching: - Popular queries cached (response + sources) - Search results cached (Bing API costs money) - Generated answers deduped by semantic similarity


Hands-on (15 min)

Build a Mini Perplexity Clone

#!/usr/bin/env python3
"""mini-perplexity.py — search → extract → summarise pipeline."""
import json
import asyncio
import httpx

# Stub — Ayva will expand with:
# - Real search API integration (Bing, SerpAPI, or SearXNG if available)
# - Parallel URL fetching with timeout handling
# - Content chunking by paragraph/section
# - Cross-encoder reranker for passage relevance scoring
# - Citation-aware prompt template
# - Answer formatting with [1][2][3] references
# - Caching layer (response + search results)

SEARCH_API = "https://api.duckduckgo.com/?"  # or use your own SearXNG
LLM_URL = "http://localhost:8080/v1/completions"

async def search_web(query: str) -> list:
    """Placeholder — replace with real search API."""
    # In production: call Bing/Google/SearXNG API
    # Return: [{"url": "...", "title": "...", "snippet": "..."}]
    return [
        {"url": "https://example.com/1", "title": "Result 1",
         "snippet": "AI system design involves..."},
        {"url": "https://example.com/2", "title": "Result 2",
         "snippet": "Key considerations include..."},
    ]

async def extract_content(urls: list[str]) -> list[dict]:
    """Fetch content from URLs in parallel."""
    async with httpx.AsyncClient(timeout=10) as client:
        async def fetch(url):
            try:
                resp = await client.get(url, follow_redirects=True)
                return {"url": url, "content": resp.text[:2000]}
            except Exception as e:
                return {"url": url, "content": f"[error: {e}]"}

        tasks = [fetch(u) for u in urls]
        return await asyncio.gather(*tasks)

async def generate_answer(query: str, context: str) -> str:
    """Generate answer with citations."""
    prompt = (
        "Answer the following question based on the context provided. "
        "Cite sources using [1], [2], etc. If the context doesn't contain "
        "the answer, say you don't know.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {query}\n\n"
        "Answer:"
    )

    async with httpx.AsyncClient(timeout=30) as client:
        resp = await client.post(LLM_URL, json={
            "prompt": prompt,
            "max_tokens": 300,
            "temperature": 0.3,
        })
        return resp.json()["choices"][0]["text"]

async def perplexity_pipeline(query: str):
    print(f"šŸ” Query: {query}\n")

    # 1. Search
    print("  1. Searching web...")
    results = await search_web(query)

    # 2. Extract
    urls = [r["url"] for r in results]
    print(f"  2. Fetching {len(urls)} URLs...")
    contents = await extract_content(urls)

    # 3. Build context
    context_parts = []
    for i, c in enumerate(contents):
        context_parts.append(f"[{i+1}] From {c['url']}:\n{c['content'][:300]}")
    context = "\n\n".join(context_parts)

    # 4. Generate
    print("  3. Generating answer...")
    answer = await generate_answer(query, context)

    print(f"\nšŸ“ Answer:\n{answer[:500]}")
    return answer

if __name__ == "__main__":
    asyncio.run(perplexity_pipeline("What is the best way to learn AI system design?"))

Questions for Ayva: - What's the best free search API for a local Perplexity clone? - How to balance search depth vs latency (how many URLs to fetch)? - What prompt template produces the most citeable, trustworthy answers?


Key Takeaways

  • Perplexity's innovation is combining search, extraction, and generation into one UX
  • Parallel URL fetching is essential for acceptable latency
  • Citations build trust and allow user verification
  • The pipeline is modular: swap any component (search engine, reranker, LLM) independently

References