AI / ML15 min read

When RL Meets RAG: Systems That Know What They Don't Know

Building AI systems that adapt, defer, and avoid confident nonsense

Sam OdongoApr 2025

The M-Pesa support assistant had been live for about three months when a customer tried to reverse a payment she had sent to the wrong number. She asked the bot. The bot told her, confidently, that she could initiate a reversal within 24 hours by dialing a specific USSD code and following a prompt sequence. This was accurate information. It described a flow that had been deprecated six months earlier. The current process required a different sequence and an explicit confirmation step the bot never mentioned. She tried the old flow, it did not work, she tried again, and by the time she reached a human agent she had passed the actual reversal window.

The bot was not guessing. It had retrieved a real document. The document was just outdated, and the retrieval system had no way to know that, and the model had no way to know it should flag uncertainty about temporal validity. It answered because answering is what it does.

That is the expensive version of a problem that shows up in cheaper ways constantly. The answer was plausible, the source was legitimate, the confidence was unjustified, and the user paid for it. When we pulled the query logs afterward, that same outdated flow had been retrieved and used in responses at least 40 times in the preceding two weeks. Nobody had escalated most of those interactions. The problem was invisible until a user with a tighter deadline and a more auditable outcome happened to hit it.

RAG was supposed to fix this. The premise is sound: ground the model's output in retrieved documents, constrain the answer to current and specific context, reduce the model's dependence on its training distribution. And it does reduce hallucination meaningfully. It does not eliminate it.

The failure modes that survive RAG are subtler. The retrieval returns something, but it is the wrong version of the right document. Or a document that addresses a related but distinct question. Or two documents that contradict each other, with no signal to the model that the contradiction matters. The model reads what you give it and produces something consistent with that input. If the input is subtly wrong, the output is plausibly wrong, which is often worse than obviously wrong because it does not trigger skepticism in the user.

The deeper issue is that a standard RAG system does not adapt. The retrieval strategy, the document corpus, the chunking, the prompt: all static. A system with a deprecated document in the index will retrieve that document forever until a human removes it. A retrieval strategy that consistently underserves a specific query type will keep underserving it, with no mechanism to notice the pattern. The system runs the same playbook on every query, whether or not that playbook has been working.

Retrieval-augmented generation reduces the rate at which a model generates from thin air. It does not teach the model when its retrieved context is insufficient, outdated, or contradictory. That requires something the base architecture does not provide: a feedback loop.

Where RL enters, practically

Not reinforcement learning in the full academic sense. Not policy gradients over long horizons, not elaborate reward modeling infrastructure that requires a dedicated ML team to maintain. The idea that matters in a deployed system is simpler: the system should be able to update its behavior based on evidence about whether its current behavior is working. That is a feedback loop, and feedback loops are what RL contributes here.

The signals are not hard to find. A user who rates a response negatively. A user who retries the same question with different phrasing immediately after receiving an answer, which is a strong implicit signal that the first answer was not useful. A user who abandons the session after a transactional query, which in a support context usually means they gave up rather than succeeded. A human agent whose escalation notes contain the correct answer to a question the bot handled badly two hours earlier.

These are all feedback. They carry information about which responses were reliable and which were not. The question is what you do with them.

The gap between collecting feedback and doing something useful with it is where most practical implementations stall. An incorrect answer could have come from bad retrieval, a stale document, poor chunking, a misleading prompt, or a genuine reasoning failure in the model. Knowing the answer was wrong does not tell you which of these to fix. The useful frame is not "what was wrong with this answer" but "what decision in the pipeline was wrong, and how do we make better decisions next time."

RL applied to a RAG system is primarily about decision-making in the pipeline: which documents to retrieve, how many to retrieve, whether to ask a clarifying question before answering, whether to answer at all. Not about the content of responses directly. The system gets smarter about its own reliability, not about the domain it covers.

What it means to know you don't know

Most language models are not designed to express calibrated uncertainty. They are optimized to produce fluent, coherent completions. Uncertainty in training data often appears as soft hedging language: "it may depend on," "in some cases," "typically." The model learns to produce these phrases in situations where a confident answer would have been equally acceptable. It does not learn to produce them specifically when they are warranted. Asking a model to be uncertain and expecting it to be calibrated about when is asking for something the training did not provide.

What you can do is build uncertainty signals into the retrieval layer, where they are more tractable. This turned out to be where most of the real leverage was.

Retrieval confidence is the most direct signal. The top-k retrieved documents come with similarity scores. A best match at cosine similarity 0.76 means the system found something genuinely relevant. A best match at 0.41 means the system found the least-bad thing in the index, which is a materially different situation. Treating both the same way in the response generation step is a design choice, and not a good one.

Retrieval agreement is the second signal. If the top five documents all support the same answer, that is different from a situation where they pull in different directions or where two explicitly contradict each other. A system that detects retrieval disagreement and surfaces it, rather than silently synthesizing a blend, produces fewer confident wrong answers. It also produces responses that are occasionally more useful, because the conflict itself is sometimes the answer to what the user was asking.

Document recency matters for any system where information expires. Financial policies, product specs, regulatory requirements, operational procedures: all of these change. The publication date of a retrieved document is a relevance signal. A policy document from 18 months ago is not as trustworthy as one from last month when the query is about current procedure. Incorporating document age into the retrieval scoring, or at minimum into the confidence calculation, would have caught the M-Pesa reversal case before it ever reached a user. That was a computable signal the system was ignoring entirely.

None of these are sophisticated ideas. They are retrievable, measurable signals that most deployed RAG systems ignore because they were not needed to make the system produce an answer. Producing an answer was the original design objective. Whether the answer should be produced was a second objective nobody had articulated.

How the RL layer actually operates

The system accumulates a history of responses with associated feedback. Each response carries attributes from the retrieval step: what documents were used, what the similarity scores were, whether the documents agreed or conflicted, how old the retrieved material was, what category the query fell into. The feedback, delayed and noisy as it is, tells you whether that response worked.

Patterns emerge faster than expected. Responses generated when retrieval confidence was below 0.5 had a significantly higher rate of negative feedback. Responses drawing on documents older than 12 months for policy questions escalated at roughly twice the rate of responses using recent documents. These were not subtle correlations requiring careful statistical analysis. They were clear enough to be visible after a few hundred interactions per query category.

What you do with this is not retrain the underlying model. What you do is adjust decision thresholds: the confidence floor below which the system asks a clarifying question instead of generating an answer, the recency cutoff below which temporal uncertainty gets flagged, the retrieval disagreement threshold above which the system declines to synthesize and instead presents the conflicting sources to the user.

These are policy decisions. The RL framing is useful because it makes explicit that these thresholds should be learned from data, not set by intuition once and forgotten. The system has a history. It can identify which query types consistently produce bad outcomes and tighten its constraints for those types, while relaxing them for the query patterns where it has a strong track record. This is different from getting smarter about content. It is the system getting smarter about itself.

Building the feedback loop: messier than it sounds

The practical implementation was messier than the description implies, because feedback signals are noisy in ways that are hard to anticipate before you are collecting them at scale.

Explicit ratings were the most direct signal and the least reliable. Users who were very unhappy or very satisfied rated responses. The large middle, users who got a mediocre answer and moved on, left nothing. More importantly, a thumbs-down did not distinguish between "wrong information," "right information in the wrong format," and "I already knew this, I needed something else." Those are three different failure modes requiring three different fixes. Treating them as a single signal produced a reward structure that rewarded confident responses, because confident responses correlated with satisfied users on average, which was true on average and wrong in the tail cases where confidence was the specific problem.

The retry signal was cleaner but required inference. If a user asked about M-Pesa reversals, got a response, and immediately followed up with "what if I already tried that and it didn't work," that is almost certainly a signal the first answer was wrong or incomplete. Identifying these patterns required a windowed session model that could link queries within the same user session, which was a non-trivial piece of infrastructure on the data volumes we had, and required careful handling of sessions that spanned network drops and reconnections, which happened more than it should in the environments we were operating in.

Human escalation was the most reliable ground truth and the most delayed. When a user ended up with a human agent, we could often trace it back to a specific bot interaction. The agent's resolution notes contained the actual correct answer, which could be compared to what the bot had said. Those cases were the highest-quality training signal we had. But by the time that signal arrived, hours had passed, and the feedback loop had to account for the delay without over-indexing on stale information.

It took about eight weeks to get the feedback pipeline to a state we trusted enough to act on. The early version rewarded confidence because users reported satisfaction more often after confident responses, which was statistically true and directionally misleading. Correcting for this required deliberately adjusting the reward to penalize high-confidence responses that later correlated with escalations, even if the user did not provide negative feedback at the time of the interaction.

What the evolved system looks like in practice

Walk through the M-Pesa reversal query in the system after these changes.

A user sends: "I sent money to the wrong number last night, how do I reverse it."

The retrieval runs a hybrid query, semantic embedding search combined with keyword matching for "reversal" and "wrong number." Three documents surface. Document A is a current policy document dated two months ago, similarity 0.76. Document B is an older FAQ dated 18 months ago describing the deprecated USSD flow, similarity 0.71. Document C is a troubleshooting article from six months ago, similarity 0.58.

In the original system, all three enter the context window. Document B has specific, step-by-step language that tends to dominate synthesis. The output describes the deprecated steps with confidence. The user follows them and fails.

In the evolved system, the recency weight applies to Document B. At 18 months old for a current-policy query, its effective retrieval score drops to 0.44 and it falls below the threshold. Documents A and C are the context the model receives. The system also detects that A and C agree on the reversal process but that A contains a specific confirmation step C does not mention. Rather than blending the two silently, the response highlights the confirmation step as important and cites the policy document directly.

The response is slightly less fluid than what the original system would have produced. It is not shorter. It has a citation. But a user following these steps will reach the correct outcome. That is what reliability looks like when you are not optimizing for impressiveness in a demo.

What actually improved, and what did not

The high-confidence wrong answer rate dropped significantly. H-Score in the policy query category fell from around 22% to under 8% over four months. Escalation rates for policy questions dropped. The specific category of outdated policy errors essentially disappeared once recency weighting was in place, because almost all of those cases involved documents that were objectively old, which turned out to be a simple and checkable condition nobody had thought to check.

Query types where the system learned to ask clarifying questions before answering showed better outcomes, not just safer responses. When the system asks "are you trying to reverse a payment you sent, or one you received?" before generating an answer, the answer it produces is correct for the specific situation. The extra conversational turn had a real cost: some users found it annoying, and latency-sensitive interactions on slow connections felt noticeably slower. But outcome quality was higher and escalation rate was lower. That tradeoff was worth it for the high-stakes query categories where we applied it. Not for all queries, and making that distinction correctly was itself a thing the system had to learn.

Sparse data queries did not improve. When a user asks about a product feature that launched recently and the knowledge base has one document about it, there is not enough retrieval signal to establish confidence reliably. The system either answers with low confidence and happens to be correct, or refuses and is being overcautious. There is no pattern to learn from because there is not enough volume. The RL approach requires history. For query categories with thin history, it has nothing to work with.

Feedback noise in stressed-user categories was persistent. For queries about account access and security issues, users rated responses negatively at higher rates regardless of whether the information was correct. The correlation between feedback and answer quality was weaker in these categories because user emotional state was contaminating the signal. We could not reliably separate "the bot gave wrong advice" from "the user was stressed about a security problem and rated everything negatively." We still cannot.

The adversarial long tail remains. Users who phrase questions unusually, combine multiple topics in a single query, or are trying to work around policies rather than understand them: these continue to produce failures at a higher rate than the core distribution. The system has learned what it knows and what it does not know within the query patterns it has seen. It has not learned to recognize when a query falls outside the distribution entirely, which is a harder problem and one that sparse retrieval confidence does not fully capture.

Why this matters beyond the SACCO or the fintech

Enterprise systems fail in expensive ways when they are confidently wrong. This is not specific to M-Pesa. A logistics assistant that confidently quotes the wrong delivery window creates customer expectations that the operations team then has to manage. An HR assistant that misquotes leave policy creates compliance exposure. A procurement tool that invents a supplier condition that does not exist creates downstream contract problems. The category of failure is the same: a system optimized for fluency over correctness, deployed in a context where the cost of a wrong answer is not borne by the system.

A system that says "I am not certain about this, here is what I found, you may want to verify with a policy document or a human agent" is less impressive in a demo. It is often a better system in production, because the escalation rate is lower, the error rate is lower, and the human agents who are still in the loop are handling genuinely novel cases rather than cleaning up after confident machine errors.

The RL-plus-RAG architecture is not something you install. It is a commitment to building feedback infrastructure alongside the core system, and to being willing to let the system learn from its own history rather than tuning it from intuition. That takes longer to build and longer to show results than tuning prompts or switching models. The payoff is a system whose reliability improves over time rather than staying fixed at wherever it happened to land at deployment.

The thing that surprised me most was how much of the improvement came from retrieval decisions rather than generation. Better document selection, recency weighting, confidence thresholds on the retrieval side: these changed system behavior more than any prompt adjustment we made. The model was not the bottleneck. The model had never been the bottleneck. What the model could not do was select its own context, evaluate its own retrieval quality, or decide when not to answer. That is the problem the RL layer solved. The model was fine all along. It just needed a better system around it.

A Reliability-First Framework for Enterprise AI

The formal research behind combining RL and RAG: system architecture, H-Score metric, and uncertainty quantification for autonomous agents.

Read →

Hallucination as an Engineering Problem: H-Score

How H-Score works in a live deployment, what signals drive it, and what actually moved the number in production.

Read →

Back to articles Reply by email