All labs
Lab 30
Scaling AI Systems

Vector Database Scaling

Retrieval pressure grows with every conversation. Watch an exact nearest-neighbour search slow as embeddings pile up, then scale it with approximate search, namespaces, shards and replicas.

Every conversation adds embeddings to your agent's memory. Grow the index and watch exact search crawl β€” then switch to approximate search and add a namespace filter to keep retrieval fast.
1,000,000
searching 1,000,000 vectors
Query latency
500 ms
Recall
100%
every match found
Search space
full
Embedding space β€” ⭐ query Β· green = nearest neighbours returned
β˜…
Exact search scans every vector β€” retrieval is now too slow for a live agent.
Scaling the index
🏷️Namespaces / filters
Search only the relevant slice β€” shrinks N before the search even runs.
πŸ•ΈοΈANN index (HNSW)
Graph search instead of full scan β€” latency stays flat as N grows.
πŸ—„οΈShards
Split vectors across nodes for storage & write scale.
πŸ“„Replicas
Copies of the index to add query throughput.
Scaling the agent β€” stateless workers + checkpoints

Scaling out means running many agent workers behind a load balancer β€” which only works if a worker holds no critical state in local memory. Put the review's progress in a checkpoint store and any worker can resume it. Run a review partway, then kill the worker.

intake
analyze
retrieve
finance
risk
respond
What just happened