Production Systems8 min read

Why Our Vector Search Was Slow: A Postmortem

What we missed, what broke, and what actually fixed it

Sam OdongoApr 2025

The first thing I checked was the embedding model. p95 query latency had climbed to 4.3 seconds and I was staring at a dashboard that three days earlier had shown sub-600ms. Nothing in the deployment had changed, at least not intentionally. The corpus had grown, we had pushed a config update to increase top-k from 8 to 20, and somewhere in that the system had quietly become unusable. I assumed the embedding step was the problem because that is usually the expensive part. It was not the problem.

This is a postmortem of about two weeks of debugging a vector search system that was performing far worse in production than it had in any test we had run. The short version is that it was not one thing. It was about four things that compounded, each individually defensible, together creating a system we had effectively never tested.

When we built the retrieval layer, the mental model was: embed the query, search the index, get back the top-k most similar chunks, pass them to the model. We expected query latency in the 200 to 400 millisecond range, most of which we assumed would be embedding time. The vector search itself was supposed to be fast. That is kind of the whole selling point.

Production disagreed. On a corpus of around 280,000 chunks, p50 latency was 1.1 seconds and p95 was 4.3 seconds. Embedding was taking 55 milliseconds. The gap between 55ms and 4.3 seconds was entirely in the vector search and the post-processing around it. We had assumed the vector DB would handle scale gracefully by default because that is what the documentation implied and, frankly, what we wanted to believe.

The inconsistency was the first signal that something was structurally wrong. Some queries came back in 300ms. Others, with no obvious difference in complexity or length, took 5 seconds. That variance is almost always a sign of something in the index traversal or the candidate selection, not network jitter or embedding variance. I did not recognize it for what it was immediately.

The first two hours of debugging were spent on things that were not the problem.

I checked the embedding service first. Profiling showed it was consistently fast, 50 to 60 milliseconds regardless of query length within our typical range. Not the issue. I checked the network path to the vector database: about 18 milliseconds round-trip. Also not the issue. I checked whether the database server was memory-constrained, found that it was sitting at about 60% memory utilization, added more, redeployed. Latency did not move.

At some point I looked at the slow queries to see if there was a pattern. There was, kind of. The slow queries were longer. But that seemed intuitive: longer queries produce longer embedding vectors, longer embedding takes more time. Except embedding time was consistent. The length correlation was real but the mechanism I assumed was wrong.

What I should have done earlier was look at the HNSW configuration we had deployed with. I did not because it had not changed and the documentation suggested the defaults were reasonable. Both of those things were true and both were irrelevant.

HNSW, the index structure most vector databases use under the hood, has a few parameters that matter a lot for the latency vs recall tradeoff. Three are the most consequential.

M controls how many bidirectional links each node maintains in the graph. Higher M means a denser graph: better recall because there are more paths to explore, but more memory and slower traversal because each step processes more neighbors. We had set M to 32. For most recommendation or search use cases, M between 12 and 16 is standard. We had doubled it chasing recall quality without benchmarking what it cost at query time.

ef_construction controls how many candidates are considered during index building when adding each node. Higher values produce a better quality index. We had set this to 200, which was fine. The index was well-constructed. This was not the problem.

ef_search is the parameter that determines how many candidates are explored during a query. This is the main latency knob at query time. Higher ef_search means better recall but slower queries. We had set it to 100. On a corpus of 30,000 chunks during testing, exploring 100 candidates was fast. On 280,000 chunks, the same ef_search value caused the traversal to work through a much larger portion of the graph to collect those 100 candidates, because the graph structure at that scale creates longer effective traversal paths from the query vector to the nearest neighbors.

The mismatch was scale. The same absolute parameters that performed well on a 30k corpus became expensive on a 280k corpus. We had not re-benchmarked after the index grew. The configuration looked the same so we assumed the behavior would be the same.

Dropping M from 32 to 16 and ef_search from 100 to 50 and rebuilding the index brought p95 latency from 4.3 seconds to 1.1 seconds. Better. Still not where it needed to be. And the recall drop from reducing ef_search was measurable in the output quality, which told us the chunking was also part of the problem.

We had chunked the corpus at 1,000 tokens with 200-token overlap. This was not an arbitrary choice: we had read that larger chunks preserve more context, which is true, and we were worried about splitting relevant content across chunk boundaries. We had tested retrieval quality using a small set of hand-crafted queries and the chunks looked fine.

The problem was what 1,000-token chunks do to embedding specificity. A chunk that contains two or three distinct topics has an embedding that is a kind of average over those topics. It ends up being moderately similar to queries about any of those topics rather than highly similar to queries about the specific one it contains. This drags down the precision of the top results.

Low precision means you need higher top-k to find the actually relevant chunks. We had noticed retrieval quality was mediocre and responded by increasing top-k from 8 to 20, which was when latency spiked. The top-k increase was treating a symptom while making the underlying cause worse: more candidates required higher ef_search to find reliably, which pushed latency up further.

When we rechunked at 400 tokens with sentence boundary detection, a few things happened. Each chunk became more semantically specific. Embedding precision improved: the right chunk started scoring much higher than the competing chunks for a given query. We could reduce top-k from 20 back to 8 and retrieval quality was better than it had been at top-k 20 with the large chunks. Smaller top-k meant the search needed to find fewer candidates, which meant we could work with a lower ef_search and still get good recall. The whole latency budget opened up.

The query pattern problem was the one that took longest to recognize because it required looking at real user behavior rather than system metrics.

Our test queries were clean. Things like "what is the repayment schedule for a six-month loan" or "how do I update my account details." Short, grammatically complete, specific. Production queries were different. Users typed things like "if i paid already but they say i haven't whats next," or "loan repayment i made last Tuesday still showing pending my account," or just "charges." Longer, noisier, often mid-thought rather than complete sentences.

There were two effects. The immediate one was latency: embedding longer strings takes slightly more time, and some of the query preprocessing we had written for test queries was not handling the noisier real-world format cleanly, occasionally producing malformed inputs that triggered retries. The more significant effect was retrieval quality.

Long, ambiguous queries embed into a different region of the vector space than short, specific queries about the same topic. The HNSW graph had been built from chunks with specific, topic-focused embeddings. The noisy production queries were often landing far enough from those embeddings that the nearest neighbors were adjacent-topic chunks rather than directly relevant ones. The system had been benchmarked on queries that mapped neatly to the chunk embeddings. Real users were not generating those queries.

The fix here was partly technical and partly procedural. We added a query cleaning step that normalized common patterns in user input before embedding. We also sampled 100 real queries from the first week of production traffic and used those as the evaluation set going forward. The benchmark we had been using was not measuring the system we had actually deployed.

Pulling the timeline together after the fact: four changes, each looking reasonable at the time, stacked into a system that was slow in production and had never been tested at production conditions.

M=32 was chosen for recall quality without a latency cost benchmark. The ef_search of 100 was copied from documentation examples without validating it against our corpus size. The 1,000-token chunks were chosen based on the intuition that more context is better, without measuring the effect on embedding precision or the downstream top-k requirement. The top-k increase to 20 was a response to poor retrieval quality that made the latency problem acute. And the test queries were clean in a way that real user queries are not, so none of this surfaced before deployment.

After the fixes: M=16, ef_search=50 (rebuilt index), 400-token chunks with sentence boundaries, top-k=8, query normalization, evaluation set built from real traffic. p50 latency went from 1.1 seconds to 180 milliseconds. p95 went from 4.3 seconds to 420 milliseconds. Retrieval quality, measured against the real query set, improved compared to the original high-top-k configuration.

A few things I would do differently on the next system.

Test with the target corpus size before deployment, not a representative sample. The behavior of HNSW is not linear with corpus size. A corpus that is ten times larger is not ten times slower: it can be substantially worse if the graph traversal paths grow in a way your parameters do not account for. The only way to know is to test at scale.

Build a latency breakdown from the start. Not just end-to-end latency but time spent in embedding, time spent in index traversal, time spent in post-processing. When everything is lumped into a single latency metric, wrong assumptions about which component is slow go unchallenged for a long time. In this case, the embedding assumption cost about two hours of investigation that a 10-line timing instrumentation would have resolved in minutes.

Collect real user queries before finalizing the evaluation set. It sounds obvious. In practice, you are building the evaluation harness before users exist, so you use the queries you can generate. What you can generate is always cleaner than what users will type. The evaluation set should be updated with real traffic as soon as you have it, and any benchmark numbers from synthetic queries should be treated as optimistic until real traffic confirms them.

The HNSW parameter defaults in most vector databases are documented as reasonable starting points for general use. They are not pre-tuned for your corpus size, your query distribution, or your latency budget. Treat them as a starting point that requires benchmarking, not a configuration you can deploy and move past.

The broader lesson is the one that shows up in most retrieval system postmortems: vector search performance is a function of the index configuration, the chunking strategy, and the query distribution, all interacting. Getting one right while the others are misconfigured still produces a bad system. We had a well-built embedding pipeline, a carefully chosen embedding model, and a reasonable vector database. None of that mattered because the index was too dense for the corpus, the chunks were too large for precision, and we had optimized for queries that users were not sending.

None of the individual decisions were obviously wrong. M=32 is defensible. Large chunks are defensible. An ef_search of 100 is defensible. What is not defensible is deploying a combination of those choices to a production corpus size without benchmarking any of them at that scale, against real queries, with latency instrumentation in place. That is the mistake. The rest is just what it looked like.

Related Research

A Reliability-First Framework for Enterprise AI

Retrieval quality, hallucination measurement, and system observability as first-class engineering concerns. The research that formalizes what postmortems like this one point at informally.

Read →

Back to articles Reply by email