Day 9: Vector Databases

📂 Data & Training 📖 15 min read Needs expansion

Learning Objectives

Understand what vector databases do that regular databases can't
Learn HNSW vs IVF index types and their tradeoffs
Deploy Qdrant, index 1k+ documents, benchmark search quality and speed

Theory (15 min)

Why a Vector Database?

Regular databases (Postgres) store structured data. Vector databases store embeddings — high-dimensional vectors that represent meaning.

A vector DB answers: "Find me the N vectors most similar to this query vector."

How Vector Search Works

Brute force (kNN): Compare query to every vector — O(N) distance computations. Fine for 10k, terrible for 10M.

Approximate Nearest Neighbour (ANN) sacrifices some accuracy for speed:

Brute force: 100% recall, 100ms for 100k vectors
ANN:         95% recall, 1ms for 100k vectors

Index Types

Index	How It Works	Build Time	Search Speed	Recall	Memory
Flat (brute force)	Compare all	None	Slowest	100%	Low
IVF (Inverted File)	Cluster → search nearest clusters	Medium	Fast	90-95%	Low
HNSW (Hierarchical NSW)	Multi-layer graph navigation	Slow	Fastest	95-99%	High
IVF+HNSW	Hybrid	Slow	Fast	95-99%	Medium

HNSW is the default for most use cases (best accuracy/speed tradeoff). Use IVF for memory-constrained or billion-scale scenarios.

The Distance Function Matters

Metric	Use Case	Example
Cosine	Semantic similarity (normalised vectors)	Default for embeddings
Euclidean (L2)	Absolute distance in vector space	Clustering
Dot product	Magnitude-sensitive similarity	Some trained models

Hybrid Search (Dense + Sparse)

Best practice: combine dense embeddings (semantic) with sparse keywords (exact match):

Score = α * dense_similarity + (1-α) * keyword_score

This catches cases that pure semantic search misses (proper names, IDs, exact terms).

Hands-on (15 min)

Deploy Qdrant + Index Documents

# Run Qdrant in Docker
docker run -d --name qdrant -p 6333:6333 -p 6334:6334 qdrant/qdrant

# Install client
pip install qdrant-client sentence-transformers

#!/usr/bin/env python3
"""vector-db-bench.py — index and search documents in Qdrant."""
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import numpy as np
import time

# Stub — Ayva will research and expand with:
# - Multiple index type comparisons (Flat, HNSW, IVF)
# - Recall@k benchmarks
# - Memory usage tracking
# - Hybrid search (full-text + vector)
# - Collection configuration best practices

client = QdrantClient("localhost", port=6333)

# Create collection
COLLECTION = "test_benchmark"
client.recreate_collection(
    collection_name=COLLECTION,
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)

# Generate synthetic vectors (simulate real embeddings)
N_VECTORS = 10000
DIM = 384
vectors = np.random.randn(N_VECTORS, DIM).astype(np.float32)
vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)

# Index
start = time.time()
client.upsert(
    collection_name=COLLECTION,
    points=[
        PointStruct(id=i, vector=v.tolist(), payload={"text": f"doc_{i}"})
        for i, v in enumerate(vectors)
    ],
)
print(f"Indexed {N_VECTORS} vectors in {time.time()-start:.2f}s")

# Search
query = np.random.randn(DIM).astype(np.float32)
query = query / np.linalg.norm(query)

start = time.time()
results = client.search(
    collection_name=COLLECTION,
    query_vector=query.tolist(),
    limit=10,
)
print(f"Search returned {len(results)} results in {time.time()-start:.3f}ms")

for r in results:
    print(f"  [{r.score:.4f}] {r.payload['text']}")

Questions for Ayva: - What's the recall/performance curve for HNSW with different ef_construct/m values? - How does IVF with different nprobe settings compare? - When should you use Qdrant vs Chroma vs Pinecone vs pgvector?

Key Takeaways

Vector databases enable semantic search at scale by using ANN indexes
HNSW is the best general-purpose index (fast, accurate, but memory-hungry)
IVF trades some accuracy for lower memory usage
Hybrid search (dense + sparse/BM25) consistently outperforms pure vector search
The distance function should match how your embedding model was trained

🧠 AI System Design