AI / ML13 min read

Hallucination as an Engineering Problem: H-Score

Most teams treat hallucination as model quality. I treat it as a measurable system failure.

Sam OdongoApr 2025

About six weeks after the SACCO support assistant went live, one of their members called to dispute a deduction. She had asked the bot about early loan repayment and it had told her, precisely and confidently, that a 3% processing fee applied to any settlement made before the 18-month mark. She paid it expecting the deduction. The fee did not exist. The SACCO had no such policy. The assistant had invented it, and it had done so with the same tone it used for everything else: clear, specific, final.

That is what makes this failure mode expensive. Not the frequency. Not the dramatic cases. The damage accumulates in the answers that are structurally plausible, contextually appropriate, and specifically wrong. Users do not know to be skeptical. They act on what the system tells them. By the time the error surfaces, it has already had consequences.

When I traced the failure, the retrieval system had returned a fee schedule document that was partially relevant. The chunk boundaries had split the document at the wrong place, separating the conditions under which the fee applied from the fee figure itself. The model received the number without the conditionality, inferred the rest from its training distribution on policy language, and produced something that sounded exactly right. No one had a metric to show this was happening at any meaningful rate. The team knew, abstractly, that sometimes the model made things up. That was all they had.

The word "hallucination" is doing a lot of work in most engineering conversations and very little of it is useful. When someone says the model hallucinated, what they mean is: the output was wrong in a specific way that I cannot attribute to a clear cause. It is a description of a symptom collapsed into a single term that implies the cause lives inside the model, which is usually where engineers have the least leverage.

Teams respond to this in recognizable patterns. They upgrade the model. They switch providers. They lower the temperature. Sometimes one of these helps. More often it shifts the failure distribution without actually reducing it. Nobody can tell whether it helped because there was no number before and no number after. There is just a vague sense that things improved or did not. That is not engineering. That is managing intuitions about a system you cannot see inside.

The other thing the word does is distribute blame in an unhelpful direction. "The model hallucinated" sounds like a weather event: unfortunate, somewhat random, not really attributable to any specific decision. It removes the system designer from the picture. But in most cases I have worked through, the hallucination was not random. It was a predictable output of specific design choices made upstream. Retrieval that returned loosely relevant documents. Chunks that stripped necessary context. Prompts that rewarded confident completion even in the absence of grounding. Latency constraints that foreclosed a verification step that would have caught the problem.

The system produced a wrong answer because the system was designed in a way that made wrong answers likely in that situation. The model was just the last component in the chain.

Hallucination is a system property

This is not a semantic distinction. It changes what you measure, what you fix, and whether you expect to make progress.

If hallucination is a model property, your primary lever is the model itself, and you are largely waiting on what frontier labs ship. If it is a system property, the failure rate is a function of your retrieval architecture, your chunking strategy, your prompt design, your latency budget, and your confidence handling. Most of those things you can change.

Think about where plausible-but-wrong answers actually come from. The model reads what you give it and produces something consistent with that input. When the retrieved context is incomplete, the model fills in the gap. When a chunk cuts off the sentence that would have qualified a claim, the model does not know that qualification existed. When your prompt says "answer the user's question based on the following documents" and the documents do not actually answer the question, the model still answers. It was built to complete. You told it to answer. It answered.

In a constrained operating environment this pressure is higher. When API calls are expensive, you are not doing multiple retrieval rounds or running a verification pass. When the knowledge base was built from inconsistent sources, partially translated documents, and FAQs written by different teams over several years, the retrieval layer is working with noisier inputs than any benchmark assumes. When users ask questions in colloquial Swahili mixed with English, your embedding model degrades in ways you have probably not measured. When network latency is unpredictable, any design that adds round-trips adds variance to the failure rate.

All of this is invisible if you are evaluating on a curated test set with clean documents and well-formed questions. Which is what most teams are doing.

Defining H-Score

H-Score is the proportion of responses in a sample window that are plausible and specific but unsupported by the retrieved context or any verifiable source.

The word "specific" is load-bearing. A vague answer is not a hallucination even if it is unhelpful. A refusal, "I could not find information about this," is not a hallucination even if the information existed somewhere in the knowledge base. H-Score targets responses that assert a specific fact, figure, policy, procedure, or condition that cannot be traced to the retrieved documents or confirmed against a ground-truth source.

Computing it is partly automated, partly not. The automated component is grounding coverage: for each substantive claim in a response, can the supporting text be located in the retrieved chunks? You run a lightweight verification step that checks whether the key entities, numbers, and assertions in the response are traceable to the retrieved context. Responses where the verification cannot find the grounding are flagged as hallucination candidates.

The manual component is the review pass. Flagged responses get checked by a human against the source documents. This is slower but necessary because the grounding check has both false positives, responses that are ungrounded in the retrieved context but were correct from the model's training knowledge, and false negatives, responses where the retrieved text was wrong and the model faithfully reproduced the error. H-Score is not measuring absolute factual accuracy. It is measuring how often the system produces specific assertions not anchored to the evidence it was given. Those are related problems but not the same one.

For the SACCO deployment, we sampled 200 queries per week across query categories: policy questions, eligibility checks, repayment calculations, complaint handling. A reviewer checked flagged responses against the source documents. Week one, H-Score was 18%. That was a system running confidently. That was a number worth having.

Tracking H-Score weekly gives you something that accuracy metrics alone do not: a number that moves when you change the retrieval architecture, the chunking strategy, or the prompt design. And that tells you whether what you changed actually helped.

The moment I realized I needed it

It was not the call about the phantom fee. That was the consequence. The moment I needed H-Score was about two weeks earlier, when I was looking at the standard evaluation results and they looked fine.

Answer relevance was around 0.82 on the internal test set. Coherence was high. The model was not producing garbage. Support escalations were slightly elevated but not alarmingly so. By the metrics we were tracking, the system was working. We had evaluated it on conditions it was well-suited for and concluded, incorrectly, that it was well-suited for conditions in general.

The test set was curated. The questions were clear and unambiguous. The source documents had been pre-cleaned and chunked by someone who understood their structure. Of course the numbers looked good. We had designed the evaluation to succeed in exactly the conditions where the system was strongest. Real user queries are a different distribution: shorter, less structured, more likely to be about edge cases and exceptions, more likely to land in the parts of the knowledge base that were built inconsistently.

The 0.82 relevance score was measuring something adjacent to helpfulness on a subset of the distribution. It was not measuring how often the system produced specific wrong information. H-Score was the gap between what our evaluation was telling us and what users were experiencing. Once we had it, we had a number to move.

The tradeoffs that do not resolve cleanly

Improving grounding enforcement in the prompt was the first intervention. Instead of asking the model to answer based on the documents, we required it to cite specific text and to explicitly state when a question could not be answered from the available context. H-Score dropped in the policy question category. It also introduced a new problem: the model started over-citing, quoting long chunks when a concise answer was possible, and occasionally refusing questions it could have handled with light inference. Some users found the citations useful. Others found the responses verbose. Both types of feedback arrived simultaneously.

Tightening the retrieval similarity threshold reduced H-Score in categories where loose retrieval had been the main driver. It also reduced coverage in long-tail queries. The system was now returning "I could not find relevant information" more often, which is honest but is not a satisfying response for a user with a real question who is calling from a slow connection after working hours.

The latency tradeoff was the most constrained. Adding a self-verification pass, where the system checks its own answer against the retrieved documents before returning, does reduce H-Score. In a well-connected environment with predictable API latency, this is viable. In this deployment, API response times varied enough that the verification pass occasionally added 4 to 5 seconds to an interaction that users expected to be fast. We implemented it selectively, for any response containing a number, a fee, a deadline, or an eligibility condition. For general informational queries we accepted a higher H-Score in exchange for response speed.

That selective verification decision is still imperfect. The categories are not cleanly separable. Users embed fee questions inside conversational phrasing. The category classifier is not perfect. We catch most of the high-stakes cases, not all of them. The imperfection is documented and accepted rather than ignored.

This is where the reframe from "model problem" to "system problem" pays off in a way that is not just philosophical. When you know hallucination rate is a system property, you can make engineering tradeoffs against it: accuracy versus latency, coverage versus grounding, cost versus verification. When it is a model property, the only tradeoff is which model to use, which gives you far fewer levers.

What actually moved the number

Hybrid retrieval had the largest single impact. The knowledge base contained both structured policy documents and conversational FAQs. Pure semantic retrieval found topically adjacent content but missed exact matches for specific policy names, product codes, and fee categories. BM25 alone missed paraphrased versions of the same questions. Combining both with reciprocal rank fusion brought H-Score down by around seven percentage points in the categories where retrieval quality was the primary driver. Not everywhere. The improvement was concentrated where the retrieval failure had been concentrated.

Chunk boundary correction was slower to implement and harder to isolate in the metrics, but it mattered in a specific and traceable way. The original chunking used a rough token limit with a small overlap. Policy documents have clauses that reference each other: the fee schedule references the conditions; the conditions reference the product type. A chunk containing the fee figure without the conditions under which it applied was actively harmful context, because the model treated the fee as unconditional. Restructuring the chunker to respect section boundaries and keep related clauses together changed the hallucination pattern in the fee and policy categories noticeably. The model started producing qualified answers because it was receiving qualified context.

Confidence-based refusal, implemented as a threshold below which the system returns a soft refusal rather than attempting an answer, helped in the long-tail queries where the retrieved context was thin. It did not eliminate hallucination there but reduced the rate at which the system produced confident, specific, wrong answers when it had almost nothing relevant to work with.

These three changes, in combination, brought H-Score from 18% to around 6% over about two months. The remaining 6% is not random noise. It clusters in specific query patterns, which is useful because it tells you exactly where to look next.

What still breaks

Calculation-heavy queries are a persistent problem. The retrieval returns the right policy document. The model understands the policy. Then it performs arithmetic incorrectly on user-supplied numbers and presents the result with the same confidence it uses for everything else. That is not a retrieval failure or a chunking failure. It is a reasoning failure, and it sits in a different part of the problem space. H-Score catches it but no prompt adjustment has reliably fixed it. For high-stakes calculations we have moved to extracting the formula and the inputs separately and computing deterministically, then wrapping the result in a natural language response. It works but it requires explicit tooling for each calculation type the system needs to handle.

Comparison queries are the other persistent weak point. When a user asks to compare two products or two loan options, the model has a tendency to conflate attributes across documents: partially correct about each product but not accurate about either. The retrieved context for both sides often exists. The failure is in synthesis across multiple documents, not in retrieval. We have not found a prompt strategy that handles this reliably at scale without also producing responses that are so hedged they stop being useful.

Then there is the long tail. No knowledge base covers everything, and users find the gaps faster than you expect. When a question falls outside what the system was designed to handle and the refusal threshold is not tuned well enough to catch it, the system reaches. The reaching still happens at a low rate. But the queries that trigger it get stranger over time as the user base grows and interacts with the system in ways you did not design for.

H-Score goes up when the query distribution shifts, which happens when the product changes, when a new user cohort starts using the system, or when an external event makes a previously rare question suddenly common. We have had the fee dispute type of failure reappear in a different form twice since the initial fix, triggered by policy changes that were not reflected in the knowledge base quickly enough. The fix in those cases was not a retrieval or chunking improvement. It was a content freshness problem.

H-Score is lower now than when we started. It is not zero and it will not be. The useful property is not the absolute number. It is that the number moves in response to engineering decisions, and it tells you which direction you moved in. That sounds obvious but it is not where most teams are. Most teams are managing a system they cannot see inside, making changes and waiting to hear whether things got better from the users who notice when they do not.

When H-Score goes up in the weekly sample, we look at the query categories driving the increase, trace the failure pattern, and decide whether the root cause is in retrieval, chunking, the prompt, the confidence threshold, or the knowledge base itself. Sometimes the fix is clear. Sometimes we put the category into a known-limitations list and route those queries to the human support team. That is also an engineering decision, not a failure to solve the problem.

The reframe matters because it changes what you think you can do. Hallucination is not something you wait on the model providers to fix. It is something you measure, locate, and engineer around, with real tradeoffs and imperfect solutions, like every other reliability problem in a production system.