Day 9: Vector Databases
Learning Objectives
- Understand what vector databases do that regular databases can't
- Learn HNSW vs IVF index types and their tradeoffs
- Deploy Qdrant, index 1k+ documents, benchmark search quality and speed
Theory (15 min)
Why a Vector Database?
Regular databases (Postgres) store structured data. Vector databases store embeddings — high-dimensional vectors that represent meaning.
A vector DB answers: "Find me the N vectors most similar to this query vector."
How Vector Search Works
Brute force (kNN): Compare query to every vector — O(N) distance computations. Fine for 10k, terrible for 10M.
Approximate Nearest Neighbour (ANN) sacrifices some accuracy for speed:
Brute force: 100% recall, 100ms for 100k vectors
ANN: 95% recall, 1ms for 100k vectors
Index Types
| Index | How It Works | Build Time | Search Speed | Recall | Memory |
|---|---|---|---|---|---|
| Flat (brute force) | Compare all | None | Slowest | 100% | Low |
| IVF (Inverted File) | Cluster → search nearest clusters | Medium | Fast | 90-95% | Low |
| HNSW (Hierarchical NSW) | Multi-layer graph navigation | Slow | Fastest | 95-99% | High |
| IVF+HNSW | Hybrid | Slow | Fast | 95-99% | Medium |
HNSW is the default for most use cases (best accuracy/speed tradeoff). Use IVF for memory-constrained or billion-scale scenarios.
The Distance Function Matters
| Metric | Use Case | Example |
|---|---|---|
| Cosine | Semantic similarity (normalised vectors) | Default for embeddings |
| Euclidean (L2) | Absolute distance in vector space | Clustering |
| Dot product | Magnitude-sensitive similarity | Some trained models |
Hybrid Search (Dense + Sparse)
Best practice: combine dense embeddings (semantic) with sparse keywords (exact match):
Score = α * dense_similarity + (1-α) * keyword_score
This catches cases that pure semantic search misses (proper names, IDs, exact terms).
Hands-on (15 min)
Deploy Qdrant + Index Documents
# Run Qdrant in Docker
docker run -d --name qdrant -p 6333:6333 -p 6334:6334 qdrant/qdrant
# Install client
pip install qdrant-client sentence-transformers
#!/usr/bin/env python3
"""vector-db-bench.py — index and search documents in Qdrant."""
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import numpy as np
import time
# Stub — Ayva will research and expand with:
# - Multiple index type comparisons (Flat, HNSW, IVF)
# - Recall@k benchmarks
# - Memory usage tracking
# - Hybrid search (full-text + vector)
# - Collection configuration best practices
client = QdrantClient("localhost", port=6333)
# Create collection
COLLECTION = "test_benchmark"
client.recreate_collection(
collection_name=COLLECTION,
vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)
# Generate synthetic vectors (simulate real embeddings)
N_VECTORS = 10000
DIM = 384
vectors = np.random.randn(N_VECTORS, DIM).astype(np.float32)
vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)
# Index
start = time.time()
client.upsert(
collection_name=COLLECTION,
points=[
PointStruct(id=i, vector=v.tolist(), payload={"text": f"doc_{i}"})
for i, v in enumerate(vectors)
],
)
print(f"Indexed {N_VECTORS} vectors in {time.time()-start:.2f}s")
# Search
query = np.random.randn(DIM).astype(np.float32)
query = query / np.linalg.norm(query)
start = time.time()
results = client.search(
collection_name=COLLECTION,
query_vector=query.tolist(),
limit=10,
)
print(f"Search returned {len(results)} results in {time.time()-start:.3f}ms")
for r in results:
print(f" [{r.score:.4f}] {r.payload['text']}")
Questions for Ayva: - What's the recall/performance curve for HNSW with different ef_construct/m values? - How does IVF with different nprobe settings compare? - When should you use Qdrant vs Chroma vs Pinecone vs pgvector?
Key Takeaways
- Vector databases enable semantic search at scale by using ANN indexes
- HNSW is the best general-purpose index (fast, accurate, but memory-hungry)
- IVF trades some accuracy for lower memory usage
- Hybrid search (dense + sparse/BM25) consistently outperforms pure vector search
- The distance function should match how your embedding model was trained