Day 26: Case Study: Perplexity
Learning Objectives
- Understand the architecture for real-time web-augmented answers
- Learn the tradeoffs: latency vs comprehensiveness in search
- Replicate a mini Perplexity: search ā extract ā summarise pipeline
Theory (15 min)
Perplexity's Core Innovation
Traditional search: 10 blue links, user clicks, reads, synthesises.
Perplexity: search ā extract ā synthesise ā answer with citations.
Query: "What are the latest GPUs from NVIDIA?"
1. Search (Bing API): Top 10 results
2. Extract: Retrieve content from each URL
3. Chunk + Rerank: Find relevant passages
4. Generate: Synthesise answer with [1][2][3] citations
ā "NVIDIA announced the RTX 5090 in January 2025 [1]. It features 32GB of VRAM [2]."
Architecture
Query āāā¶ Search API āāā¶ URL Fetcher āāā¶ Content Chunker
ā ā ā
Bing/ httpx/ Chunk by
Google Playwright paragraph
ā ā ā
ā¼ ā¼ ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā Reranker (cross-encoder) āāā Top 5 passages
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā Prompt Builder āāā Context + query
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā LLM (GPT-4o / Sonnet) āāā Answer w/ citations
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Key Design Decisions
1. Latency budget: - Search: 200-500ms - URL fetch: 500-2000ms (parallel) - Rerank: 50-100ms - Generate: 1000-3000ms - Total: ~2-5s (users accept this for comprehensive answers)
2. Citation strategy: - Every claim attributed to source [1], [2], etc. - Sources displayed inline, linking back to original - Builds trust and allows fact-checking
3. Parallel URL fetch: - Fetch 5-10 URLs simultaneously - First result arrives fastest, last may timeout - Gracefully handle partial results
4. Caching: - Popular queries cached (response + sources) - Search results cached (Bing API costs money) - Generated answers deduped by semantic similarity
Hands-on (15 min)
Build a Mini Perplexity Clone
#!/usr/bin/env python3
"""mini-perplexity.py ā search ā extract ā summarise pipeline."""
import json
import asyncio
import httpx
# Stub ā Ayva will expand with:
# - Real search API integration (Bing, SerpAPI, or SearXNG if available)
# - Parallel URL fetching with timeout handling
# - Content chunking by paragraph/section
# - Cross-encoder reranker for passage relevance scoring
# - Citation-aware prompt template
# - Answer formatting with [1][2][3] references
# - Caching layer (response + search results)
SEARCH_API = "https://api.duckduckgo.com/?" # or use your own SearXNG
LLM_URL = "http://localhost:8080/v1/completions"
async def search_web(query: str) -> list:
"""Placeholder ā replace with real search API."""
# In production: call Bing/Google/SearXNG API
# Return: [{"url": "...", "title": "...", "snippet": "..."}]
return [
{"url": "https://example.com/1", "title": "Result 1",
"snippet": "AI system design involves..."},
{"url": "https://example.com/2", "title": "Result 2",
"snippet": "Key considerations include..."},
]
async def extract_content(urls: list[str]) -> list[dict]:
"""Fetch content from URLs in parallel."""
async with httpx.AsyncClient(timeout=10) as client:
async def fetch(url):
try:
resp = await client.get(url, follow_redirects=True)
return {"url": url, "content": resp.text[:2000]}
except Exception as e:
return {"url": url, "content": f"[error: {e}]"}
tasks = [fetch(u) for u in urls]
return await asyncio.gather(*tasks)
async def generate_answer(query: str, context: str) -> str:
"""Generate answer with citations."""
prompt = (
"Answer the following question based on the context provided. "
"Cite sources using [1], [2], etc. If the context doesn't contain "
"the answer, say you don't know.\n\n"
f"Context:\n{context}\n\n"
f"Question: {query}\n\n"
"Answer:"
)
async with httpx.AsyncClient(timeout=30) as client:
resp = await client.post(LLM_URL, json={
"prompt": prompt,
"max_tokens": 300,
"temperature": 0.3,
})
return resp.json()["choices"][0]["text"]
async def perplexity_pipeline(query: str):
print(f"š Query: {query}\n")
# 1. Search
print(" 1. Searching web...")
results = await search_web(query)
# 2. Extract
urls = [r["url"] for r in results]
print(f" 2. Fetching {len(urls)} URLs...")
contents = await extract_content(urls)
# 3. Build context
context_parts = []
for i, c in enumerate(contents):
context_parts.append(f"[{i+1}] From {c['url']}:\n{c['content'][:300]}")
context = "\n\n".join(context_parts)
# 4. Generate
print(" 3. Generating answer...")
answer = await generate_answer(query, context)
print(f"\nš Answer:\n{answer[:500]}")
return answer
if __name__ == "__main__":
asyncio.run(perplexity_pipeline("What is the best way to learn AI system design?"))
Questions for Ayva: - What's the best free search API for a local Perplexity clone? - How to balance search depth vs latency (how many URLs to fetch)? - What prompt template produces the most citeable, trustworthy answers?
Key Takeaways
- Perplexity's innovation is combining search, extraction, and generation into one UX
- Parallel URL fetching is essential for acceptable latency
- Citations build trust and allow user verification
- The pipeline is modular: swap any component (search engine, reranker, LLM) independently