MLOps

Three Ways Our ML Model Broke in Deployment

What testing did not catch, and production exposed immediately.

Feb 2025·10 min read·MLOps

The client noticed before we did. That is probably the worst way to find out something is wrong.

It was a Thursday morning call. The stakeholder, politely but clearly, asked why the risk scores from the previous day looked unusual. Several customers the model had flagged as high-risk were obviously fine. A batch of customers the model had cleared were showing activity patterns that, to any human reviewing the data, looked like they warranted attention. The model was confident. The model was wrong. And we had no alert that told us so.

We went back and looked at the pipeline. Everything was running. Jobs were completing. Records were being written to the output table. The monitoring we had in place, which mostly tracked job success, failure, and record counts, was showing green. The model was producing predictions. We had no way of knowing those predictions had deteriorated until a person who understood the business logic looked at the outputs and noticed they did not make sense.

That call started three weeks of investigation that eventually surfaced three separate issues, each of which had been quietly present for longer than we would have liked to admit.

Before I get into what broke, here is what we thought we had built.

The model was a binary classifier trained to identify customers at elevated risk of a specific type of financial irregularity. Not dramatic scale. A few hundred thousand records, batch predictions run nightly, results written to a table the client's risk team reviewed each morning. The feature engineering was reasonably involved: rolling aggregations over 7, 14, and 30-day windows, a handful of ratio features, some categorical encodings. We used a gradient boosting model trained on about 18 months of historical data, validated on a held-out time period, and tuned until the metrics on the validation set looked solid.

The serving pipeline pulled features from a feature store, ran inference, and wrote outputs. We had tested the end-to-end pipeline in staging using a representative sample of production data. The metrics looked fine. We deployed with reasonable confidence.

Six weeks later, that Thursday call.

The first thing we investigated was the most obvious hypothesis: maybe the model was receiving bad data. We ran spot checks on the input features going into inference. They looked plausible. Not obviously wrong. We moved on to other hypotheses, which turned out to be a mistake, because the data hypothesis was actually correct. We had just not checked deeply enough.

What we should have noticed was that the distribution of several key input features had shifted materially from the distribution the model had been trained on. The client had made product changes about three weeks after we deployed. New transaction categories were introduced. Some existing categories were renamed or consolidated. The mapping table we used to encode categorical features had not been updated to account for the new categories, which meant anything falling into a new category was being encoded as a null and then imputed with a default value set during preprocessing.

This is data drift in the specific, frustrating form it usually takes in practice: not a clean distributional shift that monitoring would immediately flag, but a localized change in a categorical feature that cascades into bad encodings that feed a model trained on a different encoding scheme. The model was not hallucinating. It was doing exactly what it had been trained to do. It had just been trained on a world that no longer existed.

We had not set up any monitoring on the input feature distributions. We were monitoring pipeline health, not data health. Those are different things, and we had conflated them. Pipeline health tells you the system is running. Data health tells you whether what the system is running on still resembles the data it was designed for. We had the former and assumed it implied the latter.

The second issue took longer to find because it was genuinely quiet.

During the investigation, someone pulled a sample of raw inference requests and compared them to what the feature store was actually returning. We found that about 12% of records in any given batch had at least one feature with a missing value. The preprocessing code, which had been written to be robust, was catching these on a feature-by-feature basis and substituting each missing value with the mean of that feature from the training set, without logging that it had done so.

This sounds like reasonable defensive programming. In a controlled environment, with infrequent missing values caused by the usual upstream data quality issues, it probably is. In our case the missing values were not random. They were concentrated in the new transaction categories that did not have encodings. So we were not imputing noise. We were systematically replacing an informative signal with noise, then feeding that noise into a model and recording its confident outputs.

The pipeline never raised an error. No alert fired. The imputation was happening silently, at scale, on a meaningful fraction of the input data, and the only trace it left was in a log file that nobody was reading. We found it by accident while looking for something else entirely.

A system that crashes is annoying. A system that silently does the wrong thing, at production scale, for weeks, while looking healthy from the outside, is a different category of problem. The silence was the issue. We had designed the system to be resilient against upstream data quality problems, and what we had actually designed was a system that was resilient against being caught.

The third issue was the feature store.

When we built the serving pipeline, we specified feature aggregation windows that we thought matched what the training pipeline had used. 7-day rolling sums, 14-day rolling averages, that kind of thing. What we had not checked carefully enough was the boundary condition: how the window was calculated relative to the current date, and specifically whether "7 days" meant the 7 calendar days before the prediction date or the 7 days before midnight of the prediction date, and how each pipeline handled prediction dates that fell on weekends when transaction volumes were lower.

The training pipeline and the serving pipeline had been written separately, at different times, by people who shared the same intent but had slightly different implementations. The difference was not large. A matter of a few hours at the boundary, consistently applied. But it was systematic, and it meant the model was being served features that were not quite the features it had been trained on.

In retrospect the training-serving skew was detectable. If we had logged feature values from both pipelines and compared distributions for the same customers on the same dates, we would have seen it. We did not do that. We tested that the serving pipeline produced outputs of the right shape and the right data types. We did not test that the values were numerically consistent with what training would have produced for the same inputs.

The impact was harder to quantify than the other two failures, which made it harder to motivate fixing quickly. The predictions were not wildly wrong, just subtly wrong in ways that accumulated across the customer population and degraded the precision of the risk scores enough to matter to the business but not enough to show up clearly in aggregate metrics. That particular combination, consequential but not obvious, is its own failure mode.

Debugging all three of these simultaneously, without knowing there were three things to debug, is genuinely disorienting.

The first few days we were convinced the problem was the model itself. We re-ran evaluation on the original validation set and the metrics were fine, which led us briefly to conclude the model was okay and something downstream was broken. That conclusion was half right in a way that delayed us. Then we found the drift issue and thought we had found the problem. We updated the encoding mapping, redeployed, and the predictions improved somewhat but not fully. That partial improvement was misleading. It made us think we had fixed things when we had actually only fixed one of three things. The remaining degradation was now easier to attribute to noise or to the model adjusting to the data changes, and we almost let ourselves believe that.

The silent imputation took another week to find, mostly because nobody thought to look at what was happening inside the preprocessing code at inference time. We had reviewed that code before deployment. We had not instrumented it. There is a difference between code you have read and code you have watched run at scale. We had done the former and assumed it was sufficient.

The feature store skew was last. By the time we found it we had been debugging for about two weeks, the team was tired, and the investigation had that particular quality where you start suspecting things that are increasingly unlikely because the likely things have already been ruled out. We found it because one engineer, probably out of stubbornness more than method, decided to log every feature value at serving time for a week and compare it numerically to what the training pipeline would have produced for the same customers on the same dates. The mismatch was there. Small, consistent, and everywhere.

What testing did not catch, across all three failures, was not a gap in test coverage in the conventional sense. The tests we had were passing. The staging environment looked healthy. The gap was in what we were not watching.

We tested that the system produced outputs. We did not test whether the outputs were meaningful relative to the inputs. We tested that the pipeline ran. We did not test whether the data flowing through it still resembled the data the model had learned from. We tested that features were being computed. We did not test whether they were being computed the same way in serving as in training. In each case the test was real and it passed. In each case it was testing the wrong thing.

The staging environment made this worse in a subtle way. Staging used a representative sample of production data from a specific point in time. It was representative of what production looked like when we built it. Three weeks after deployment, when the client made product changes, staging no longer represented production. But we had no signal telling us that, because staging is not production. It does not change when production changes. It just sits there, looking fine, while production drifts.

What changed afterward was mostly about observability, because that was the through-line across all three failures. We had not been watching the system at the right level of detail.

We added distribution monitoring on the input features, not just record counts. If any feature's distribution shifts beyond a threshold from its training distribution, a warning is filed before the model runs, not after the outputs have been written to the reporting table. Thresholds are hard to set without generating noise. The first version generated a lot of noise. We tuned it. It is not perfect but it means a categorical encoding failure that affects 12% of records is visible before it shapes a week of decisions.

We added explicit logging in the preprocessing pipeline for every imputation event, including which feature, which record, and what value was substituted. That log feeds a dashboard. If the imputation rate for any feature exceeds 2% in a batch, someone looks at it before the batch finalizes. Not after.

We added a feature consistency check that runs as part of the deployment process. Before any model update goes to serving, a sample of records runs through both the training feature pipeline and the serving feature pipeline and the outputs are compared numerically. If they diverge beyond a defined tolerance, the deployment is blocked. This is the check that should have existed from the beginning, and the reason it did not is that it felt like over-engineering at the time. It does not feel that way anymore.

What still makes me uneasy: the monitoring we have now would catch the failures we encountered. It would not necessarily catch failures we have not yet encountered. We added checks for the specific things that broke. A different kind of drift, or a different kind of silent failure, or a serving pipeline bug that produces values within the acceptable tolerance but systematically in the wrong direction, would still get through. The observability is better. It is not comprehensive. I am not sure comprehensive is achievable, which is a different kind of discomfort than simply not having tried.

The feature store is also still a manual process in important ways. We have the consistency check at deployment time, but there is no automated reconciliation between the training and serving pipelines when either one changes independently. If someone updates the training pipeline and forgets to update the corresponding serving logic, the check will catch it at the next deployment. But there is a window. During that window the system is quietly in a state we have been in before, and we know how that tends to end.

The part I think about most is the silent imputation. We fixed it by logging. But the fix assumes someone is reading the logs. Someone is, right now. What happens when the team rotates, when the dashboard gets stale, when the alert threshold gets relaxed because it was generating too much noise during a data migration? The system is only as reliable as the attention paid to it. Attention, unlike code, does not persist.

Related Research

Reliability Framework for AI-Driven Financial Systems

The formal model for uncertainty quantification, hallucination detection, and confidence-gated escalation in production AI systems -- the architectural response to the silent failure modes documented here.

Read →