Retrieval pressure grows with every conversation. Watch an exact nearest-neighbour search slow as embeddings pile up, then scale it with approximate search, namespaces, shards and replicas.
Every conversation adds embeddings to your agent's memory. Grow the index and watch exact search crawl β then switch to approximate search and add a namespace filter to keep retrieval fast.
1,000,000
searching 1,000,000 vectors
Query latency
500 ms
Recall
100%
every match found
Search space
full
Embedding space β β query Β· green = nearest neighbours returned
Exact search scans every vector β retrieval is now too slow for a live agent.
Scaling the index
π·οΈNamespaces / filters
Search only the relevant slice β shrinks N before the search even runs.
πΈοΈANN index (HNSW)
Graph search instead of full scan β latency stays flat as N grows.
ποΈShards
Split vectors across nodes for storage & write scale.
πReplicas
Copies of the index to add query throughput.
Scaling the agent β stateless workers + checkpoints
Scaling out means running many agent workers behind a load balancer β which only works if a worker holds no critical state in local memory. Put the review's progress in a checkpoint store and any worker can resume it. Run a review partway, then kill the worker.
intake
analyze
retrieve
finance
risk
respond
What just happened
βΉA vector database answers 'find the most similar embeddings.' Exact nearest-neighbour search compares the query against every vector β so latency grows linearly and an AI app's retrieval slows as its memory grows.
βΉApproximate nearest-neighbour (ANN, e.g. HNSW) searches a graph instead of scanning everything: latency stays near-flat as the index grows, at the cost of occasionally missing a true match (slightly lower recall).
βΉBeyond the algorithm, you scale the same way as any database: namespaces/metadata filters shrink the search space, shards spread vectors across nodes for write/storage scale, and replicas add query throughput.
βΉTo scale the agent itself, the workers must be stateless β their progress lives in a checkpoint store, not local memory. Then if a worker dies mid-review, another resumes from the last checkpoint instead of losing the work.