AI / ML14 min read

Building a RAG System That Survives Production

Sam Odongo2025

The first version of our RAG system looked fine in testing. Precision was decent on the queries we had written ourselves. The demo ran cleanly and the LLM synthesized coherent answers. Then real users arrived with real questions, and most of those questions were either too short, poorly specified, or phrased in ways that matched nothing in the index. Retrieval recall dropped. The LLM started hallucinating to fill the gaps. The system that looked capable in a controlled environment turned out to be retrieval theater.

That gap, between demo quality and production quality, is entirely an engineering problem. The model is rarely the issue. The retrieval layer is.

Why chunking is the first thing that breaks

Naive chunking means splitting documents into fixed-size token windows, maybe with some overlap. It is what most tutorials show because it is trivial to implement. It works acceptably on clean, well-structured documents where a single concept fits neatly into a 512-token window. Real documents are not like that.

Legal contracts have dense clause dependencies that span pages. Technical documentation has code blocks that should not be split mid-function. Research papers have abstracts that summarize concepts explained in detail five sections later. Internal knowledge bases have content where the answer to a question appears two paragraphs above the section heading that names the topic. Fixed-size chunking treats all of this identically.

The failure mode is insidious because it looks like a retrieval quality problem rather than a chunking problem. A query about contract termination retrieves a chunk containing the word "termination" but the chunk is actually about employment termination in a benefits section three pages away from the clause the user needs. Cosine similarity between the query embedding and that chunk is non-zero. The chunk gets retrieved. The LLM uses it. Nobody notices until a user acts on the wrong information.

What helps is chunking with awareness of document structure. For well-structured documents, section-level chunking with paragraph-level fallback. For code, always respect function and class boundaries. For conversational or threaded data, keep context together rather than splitting by line count.

Chunk size is a parameter you tune against your query distribution, not a default you set once. Short, specific queries benefit from smaller chunks with higher information density. Multi-hop questions sometimes need larger context windows to avoid fragmenting the reasoning chain. You learn the right range by analyzing retrieval failures, not by following a convention.

One change that improved our retrieval more than anything in the chunking layer was attaching document-level metadata to each chunk: section title, document type, source date. Filtering on metadata before vector search is dramatically cheaper than relying on embedding similarity alone to select the right corpus segment. It also catches cases where semantic similarity is high but the document is the wrong type entirely.

Overlap is often recommended as the fix for boundary problems. It helps at the margins but creates its own issues: duplicate or near-duplicate chunks inflate index size and can bias retrieval toward content that appears near chunk boundaries. Use overlap deliberately and measure whether it actually changes retrieval outcomes on your data before treating it as a universal solution.

Semantic search is not enough on its own

Embedding-based retrieval is appealing because it handles synonyms, paraphrases, and conceptual similarity without explicit rules. In theory you can ask "how do I rotate a matrix" and retrieve a document about transposition even if that word never appeared in the query. The model handles the semantic bridge.

In practice, dense retrieval has a well-documented weakness: it fails on specific, rare, or domain-specific terms. An embedding model trained on general text will not reliably separate "BM25" from "TF-IDF" in a technical corpus about information retrieval. Clinical drug names, internal product codenames, version strings, uncommon acronyms: all of these defeat embedding similarity in ways that keyword search handles trivially.

The argument for keyword search is usually framed as old technology versus new. That framing is wrong. BM25 and its variants encode something semantically meaningful: term frequency relative to the corpus. A term that appears rarely in the corpus but appears in both the query and a specific document is a strong relevance signal. Dense retrieval does not capture this directly.

The failure mode of pure semantic search is confident but wrong retrieval. The model returns chunks that are topically adjacent to the query but miss the specific entity or term the user actually asked about. For RAG, this is often worse than returning nothing, because the LLM uses the wrong context to construct a plausible-sounding but incorrect answer.

Keyword retrieval has its own failure mode: exact matches that are not semantically relevant. Surface-level term overlap without the grounding to know whether the match actually matters. Neither approach is sufficient alone. Running into production with only one of them is a choice you will revisit.

Hybrid retrieval in practice

Hybrid retrieval runs both dense and sparse retrievers and merges the results before passing context to the LLM. The standard merge approach is Reciprocal Rank Fusion: take the ranked results from each retriever and compute a combined score based on rank position rather than raw score. The reason for rank-based fusion rather than score-based fusion is that dense and sparse scores are not comparable. BM25 scores are term-frequency-based. Cosine similarity scores are bounded differently. Normalizing them requires assumptions that tend not to hold across query types.

The tuning question is the weighting between retrievers. How much do you trust semantic similarity versus keyword overlap for your query distribution? This is not a question you can answer from first principles. You need a labeled evaluation set that reflects real user queries, and you adjust the weighting to maximize recall at your target k. A balanced default with the ability to shift based on query characteristics is more robust than committing to a fixed ratio before you have data.

There is a cost worth naming. Hybrid retrieval roughly doubles index query latency compared to single-retriever approaches because you are running two retrieval pipelines before re-ranking. Two indexes to maintain. Two systems to monitor. Two failure modes to handle. Whether the accuracy improvement justifies the complexity depends on how badly pure semantic or pure keyword retrieval was failing on your data.

In most general-domain deployments over heterogeneous document types, hybrid retrieval wins. On narrow, homogeneous corpora where query patterns are well understood, a well-tuned single retriever is often sufficient and cheaper to operate. Know which situation you are in before adding complexity.

Latency budgets and the re-ranking tradeoff

A RAG request has three significant latency contributors: embedding the query, retrieving context, and generating the response. Generation dominates. A typical frontier model generates at somewhere between 20 and 50 tokens per second for a streamed response, which means several seconds before a user sees a complete answer regardless of how fast retrieval is.

Given that generation takes several seconds, the pressure on retrieval latency is more about consistency than absolute numbers. A p50 retrieval latency of 80ms is acceptable. A p99 of 4 seconds means users sometimes wait 15 seconds before generation even starts. Those tail latencies are usually driven by large index scans, cache misses, or downstream operations that perform differently under load than they do in testing.

Where this gets complicated is in multi-step pipelines. Query decomposition, reformulation, parallel sub-queries to support complex questions: the retrieval cost multiplies. A single hybrid retrieval at 80ms is fine. Four parallel retrievals with re-ranking can push past 500ms before the LLM starts. That is a meaningful difference in perceived responsiveness and needs to be planned for, not discovered after deployment.

Cross-encoder re-ranking deserves its own mention. Cross-encoders take the query and a candidate chunk together as input and produce a relevance score more accurately than the bi-encoder similarity used for initial retrieval. The tradeoff is that cross-encoders cannot be indexed and run at query time on the top-k candidates. For k of 20 and 50ms latency per pair, you are adding a second or more to retrieval latency. Whether that accuracy improvement is worth the cost depends on how much wrong context is getting through your bi-encoder.

We ran cross-encoder re-ranking for three weeks before disabling it. Offline evaluation showed clear accuracy improvements on our test set. Online metrics showed no statistically significant change in user behavior at our query volume. The latency cost was real. The benefit was not measurable. We kept the infrastructure staged and moved on.

The decision about where to embed queries, server-side with a network round-trip or client-side requiring you to ship the model, depends on your latency budget and infrastructure constraints. If your users are on unreliable connections, adding a network round-trip for embedding before retrieval can meaningfully hurt p95 latency in ways that benchmarks on a stable connection never reveal.

Embedding model choice matters more than vector database choice

A significant amount of engineering attention goes to the wrong place here. Teams spend weeks evaluating Pinecone versus Weaviate versus Qdrant versus pgvector. The vector database is an index. Its job is to store vectors and return approximate nearest neighbors quickly. Beyond correctness and latency at your scale, the differences between mature vector databases are marginal for most deployments.

The embedding model is a different story. It determines the geometry of your vector space, which directly determines what "similar" means for your queries and documents. Two chunks can be textually different but semantically equivalent according to your embedding model. Two chunks can share vocabulary and embed far apart. The model controls both, and you cannot fix it by switching databases.

General-purpose embedding models are trained to capture broad semantic similarity across general English text. They work well when your corpus is general prose with no specialized vocabulary. As soon as your domain diverges, whether that is medical records, legal documents, internal technical documentation, or code, general embeddings lose fidelity. Terms that are semantically distinct in your domain end up close together in the embedding space because the model has no training signal to separate them.

Domain-specific fine-tuning is the answer, but it requires labeled query-document pairs for contrastive training. If you do not have those, you are choosing between a general-purpose model with known limitations and investing in data collection to build something better. That is a real engineering and business tradeoff, not something you can default your way out of.

We switched embedding models once, mid-project, after retrieval failures on technical terminology became systematic. Re-indexing four million chunks took 18 hours and cost more than our monthly inference bill. The improvement in retrieval quality was significant and worth it. But it would have been far less disruptive if we had evaluated embedding quality on domain-specific queries before building the first index.

The switching cost is the part most teams underestimate. If you build on embeddings from one model and later move to a better one, every existing vector becomes incompatible. You re-index everything. For a large corpus in an environment with cost constraints, that is not a trivial operation. Treat the embedding model as a long-term architectural decision, not a configuration parameter.

The operational pieces that do not make the tutorials

Index staleness is a real production problem that most RAG write-ups skip entirely. Documents get updated, deprecated, or deleted. If your vector index does not reflect those changes, users retrieve outdated context with no indication that it is outdated. The LLM presents it confidently. The longer you ignore index freshness, the more your production system diverges from the actual corpus.

Full re-index on a schedule is simple and correct but expensive for large corpora. Incremental updates keyed by document ID are cheaper but require a reliable change detection mechanism and careful handling of deletions, which most vector databases support less gracefully than insertions. The right approach depends on corpus size, update frequency, and cost ceiling. There is no universal answer.

Query logging is not optional at any serious deployment scale. You cannot evaluate or improve a retrieval system without knowing what queries it is actually receiving. Log the raw query, the retrieved chunks with their similarity scores, and the generated response. Without this data, tuning chunking or retrieval parameters is guesswork. With it, you can identify systematic failure modes: queries that consistently retrieve irrelevant context, high-frequency queries the index handles poorly, chunks that surface in the top-k for nearly every query regardless of relevance.

Prompts are part of the system. How you instruct the LLM to use retrieved context, what to do when context does not contain a relevant answer, how to handle conflicting information across chunks: all of this materially affects output quality and is not something you can set once and leave. Prompt behavior drifts as the corpus changes. Treating the prompt as a static artifact is a reliable way to accumulate quiet quality regressions.

None of this is exciting work. Index refresh pipelines, query log analysis, chunk boundary audits, prompt iteration. The system that holds up in production is built on these, not on the model choice or the database benchmark. The retrieval architecture you can observe, measure, and incrementally improve will outperform the theoretically superior one you cannot see inside.