Vector Database Scaling

Retrieval pressure grows with every conversation. Watch an exact nearest-neighbour search slow as embeddings pile up, then scale it with approximate search, namespaces, shards and replicas.

Every conversation adds embeddings to your agent's memory. Grow the index and watch exact search crawl — then switch to approximate search and add a namespace filter to keep retrieval fast.

🧠 Embeddings in the index1,000,000

searching 1,000,000 vectors

Query latency

500 ms

Recall

100%

every match found

Search space

full

Embedding space — ⭐ query · green = nearest neighbours returned

Exact search scans every vector — retrieval is now too slow for a live agent.

Scaling the index

🏷️Namespaces / filters

Search only the relevant slice — shrinks N before the search even runs.

🕸️ANN index (HNSW)

Graph search instead of full scan — latency stays flat as N grows.

🗄️Shards

Split vectors across nodes for storage & write scale.

📄Replicas

Copies of the index to add query throughput.

Scaling the agent — stateless workers + checkpoints

Scaling out means running many agent workers behind a load balancer — which only works if a worker holds no critical state in local memory. Put the review's progress in a checkpoint store and any worker can resume it. Run a review partway, then kill the worker.

intake

analyze

retrieve

finance

risk

respond

What just happened

▹A vector database answers 'find the most similar embeddings.' Exact nearest-neighbour search compares the query against every vector — so latency grows linearly and an AI app's retrieval slows as its memory grows.
▹Approximate nearest-neighbour (ANN, e.g. HNSW) searches a graph instead of scanning everything: latency stays near-flat as the index grows, at the cost of occasionally missing a true match (slightly lower recall).
▹Beyond the algorithm, you scale the same way as any database: namespaces/metadata filters shrink the search space, shards spread vectors across nodes for write/storage scale, and replicas add query throughput.
▹To scale the agent itself, the workers must be stateless — their progress lives in a checkpoint store, not local memory. Then if a worker dies mid-review, another resumes from the last checkpoint instead of losing the work.