🧠 AI System Design

Day 9: Vector Databases

📂 Data & Training 📖 15 min read Needs expansion

Learning Objectives

  • Understand what vector databases do that regular databases can't
  • Learn HNSW vs IVF index types and their tradeoffs
  • Deploy Qdrant, index 1k+ documents, benchmark search quality and speed

Theory (15 min)

Why a Vector Database?

Regular databases (Postgres) store structured data. Vector databases store embeddings — high-dimensional vectors that represent meaning.

A vector DB answers: "Find me the N vectors most similar to this query vector."

How Vector Search Works

Brute force (kNN): Compare query to every vector — O(N) distance computations. Fine for 10k, terrible for 10M.

Approximate Nearest Neighbour (ANN) sacrifices some accuracy for speed:

Brute force: 100% recall, 100ms for 100k vectors
ANN:         95% recall, 1ms for 100k vectors

Index Types

Index How It Works Build Time Search Speed Recall Memory
Flat (brute force) Compare all None Slowest 100% Low
IVF (Inverted File) Cluster → search nearest clusters Medium Fast 90-95% Low
HNSW (Hierarchical NSW) Multi-layer graph navigation Slow Fastest 95-99% High
IVF+HNSW Hybrid Slow Fast 95-99% Medium

HNSW is the default for most use cases (best accuracy/speed tradeoff). Use IVF for memory-constrained or billion-scale scenarios.

The Distance Function Matters

Metric Use Case Example
Cosine Semantic similarity (normalised vectors) Default for embeddings
Euclidean (L2) Absolute distance in vector space Clustering
Dot product Magnitude-sensitive similarity Some trained models

Hybrid Search (Dense + Sparse)

Best practice: combine dense embeddings (semantic) with sparse keywords (exact match):

Score = α * dense_similarity + (1-α) * keyword_score

This catches cases that pure semantic search misses (proper names, IDs, exact terms).


Hands-on (15 min)

Deploy Qdrant + Index Documents

# Run Qdrant in Docker
docker run -d --name qdrant -p 6333:6333 -p 6334:6334 qdrant/qdrant

# Install client
pip install qdrant-client sentence-transformers
#!/usr/bin/env python3
"""vector-db-bench.py — index and search documents in Qdrant."""
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import numpy as np
import time

# Stub — Ayva will research and expand with:
# - Multiple index type comparisons (Flat, HNSW, IVF)
# - Recall@k benchmarks
# - Memory usage tracking
# - Hybrid search (full-text + vector)
# - Collection configuration best practices

client = QdrantClient("localhost", port=6333)

# Create collection
COLLECTION = "test_benchmark"
client.recreate_collection(
    collection_name=COLLECTION,
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)

# Generate synthetic vectors (simulate real embeddings)
N_VECTORS = 10000
DIM = 384
vectors = np.random.randn(N_VECTORS, DIM).astype(np.float32)
vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)

# Index
start = time.time()
client.upsert(
    collection_name=COLLECTION,
    points=[
        PointStruct(id=i, vector=v.tolist(), payload={"text": f"doc_{i}"})
        for i, v in enumerate(vectors)
    ],
)
print(f"Indexed {N_VECTORS} vectors in {time.time()-start:.2f}s")

# Search
query = np.random.randn(DIM).astype(np.float32)
query = query / np.linalg.norm(query)

start = time.time()
results = client.search(
    collection_name=COLLECTION,
    query_vector=query.tolist(),
    limit=10,
)
print(f"Search returned {len(results)} results in {time.time()-start:.3f}ms")

for r in results:
    print(f"  [{r.score:.4f}] {r.payload['text']}")

Questions for Ayva: - What's the recall/performance curve for HNSW with different ef_construct/m values? - How does IVF with different nprobe settings compare? - When should you use Qdrant vs Chroma vs Pinecone vs pgvector?


Key Takeaways

  • Vector databases enable semantic search at scale by using ANN indexes
  • HNSW is the best general-purpose index (fast, accurate, but memory-hungry)
  • IVF trades some accuracy for lower memory usage
  • Hybrid search (dense + sparse/BM25) consistently outperforms pure vector search
  • The distance function should match how your embedding model was trained

References