Production Systems10 min read

Inside a Real-Time Fraud Detection System

Why catching fraud in milliseconds is harder than it sounds

Sam OdongoMar 2025

A transaction hits the scoring service at 11:47 PM. Cross-border payment. The card is US-issued. The user's last dozen transactions were in Austin. This one is originating from a merchant in Lagos, $2,400. The system has roughly 200 milliseconds to pull user features, score the transaction, apply business rules, and return a decision before the payment gateway times out. Network latency to the feature store is already eating 40 of those milliseconds.

The system blocks it. The user is a consultant who just landed for a three-week project. The transaction is completely legitimate. He calls the bank. The support agent can see the block but cannot reverse it through the fraud system in real time. The transaction is never retried. He uses a different card. The fraud system logs a false positive that will arrive in the training pipeline in approximately 72 hours, after passing through a manual review queue.

This is not a description of a broken system. This is a functioning fraud detection system operating at the edge of its confidence, doing exactly what it was configured to do, and still getting it wrong in a way that costs the business a customer and costs the user an hour of frustration. Understanding why requires going into the architecture.

Fraud detection has to work in real time because the decision point is the transaction authorization. By the time a user sees the payment screen, the window to respond is already open. A payment gateway waiting for a fraud score has its own timeout expectations, and that timeout is set by the payment network infrastructure, not by your engineering team. If the fraud system is not fast enough, the gateway either approves without a score, which is a different kind of risk, or declines by default, which produces false positives at scale. Neither is acceptable.

The latency budget is also smaller than the headline number suggests. The fraud scoring service is one component in a chain that includes the payment gateway, the card network authorization, the issuing bank, and often a third-party fraud network. Each component consumes a slice of the available time. In practice, by the time a scoring request reaches the detection service, the remaining budget can be under 150 milliseconds. Everything inside that budget: feature retrieval, model inference, rule evaluation, response serialization.

This is categorically different from most ML problems. Most ML problems give you time to be thoughtful. You can compute expensive features, run multiple model passes, retry on failure, and tune in a test environment before anything reaches a user. Real-time fraud detection does not offer any of that. The latency constraint is set by infrastructure you do not control, and everything else has to fit inside it.

The event streaming layer is where transaction data gets into the system fast enough to matter. In the systems I've worked on, this is typically Kafka or Kinesis at the ingestion layer: the payment gateway publishes transaction events to a stream as they arrive, and the fraud detection infrastructure consumes from that stream in real time. The choice between them is mostly operational; both provide durable, ordered, partitioned streams with the throughput characteristics real-time payment systems need.

Partitioning strategy matters more than it is usually given credit for. Fraud detection often needs to make decisions in the context of other recent transactions from the same account. If two transactions from the same account land in different partitions, they may be processed by different consumer instances with no shared state. This is fine if you are looking up features from a shared store, but it creates ordering ambiguity when the velocity features themselves depend on processing order. Partition by account identifier rather than transaction ID is the standard answer, with its own tradeoffs around hot partitions for high-activity accounts.

On a project involving cross-border payment flows for a global fintech client through Turing, we had transaction events originating from several regional gateways, all landing in a central Kinesis stream for scoring. The naive implementation treated it as a single undifferentiated stream. During a US gateway maintenance window, traffic shifted to a backup, partition assignments scrambled on the consumer side, and we spent the better part of an hour diagnosing why some scoring consumers were processing events in unexpected order. The tests had used a single regional source. The production failure needed multiple regions under failover conditions to surface. It always does.

Features are the real engineering problem in fraud detection, and where most systems accumulate the most technical debt.

The signals that actually distinguish fraud from legitimate behavior are not the fields on the transaction record. Amount, merchant category, time of day: useful context but not sufficient. What separates fraud is behavioral context: how does this transaction compare to what this user normally does, what happened in this account in the last fifteen minutes, how many failed authentication attempts preceded this, what is the velocity of charges across this card in the last hour. Those features require knowing history. And computing history from raw event data inside a 150-millisecond window is not feasible.

The answer is a feature store: a low-latency store that maintains precomputed and continuously updated features for each entity, readable with a single lookup. Instead of computing "average transaction amount over last 90 days" at scoring time, you maintain that number incrementally as transactions arrive and read it directly. The single lookup latency on a well-provisioned feature store is 5 to 15 milliseconds. Recomputing it from a data warehouse on every transaction is not.

The tradeoff is freshness. A feature store is telling you about the state of the world as of the last update. For features that change slowly, like a user's average monthly spend, an hourly update cadence is usually fine. For velocity features, the ones that capture rapid behavioral changes that are the signature of account takeover, an hourly cadence is too slow. A fraud ring that compromises an account and begins probing with small transactions will build a detectable velocity pattern in the event stream well before the feature store reflects it.

In practice, you end up running two parallel pipelines. An offline or near-real-time pipeline that maintains slower-moving behavioral features, updated every few minutes or hourly. A streaming pipeline that computes velocity features over short windows, updated continuously from the event stream. At scoring time you merge both. When the two pipelines are temporarily inconsistent, the model sees a picture of the user that is correct in parts and stale in others, and the score reflects that ambiguity whether or not you intended it to.

The fraud ring exploit I mentioned earlier worked precisely because of this window. The attackers would compromise an account, make a series of small transactions to establish a low-risk velocity pattern, then execute a large transfer before the offline pipeline updated the account profile. The streaming velocity features detected the rapid-fire small transactions. The stale offline features made the account look established and low-risk. The model saw both and scored below the block threshold. The fix was moving several slow-moving features to a shorter update cadence: more compute, smaller exploitable window.

Model serving in this context is where teams consistently over-engineer in the wrong direction. The instinct after building a good feature pipeline is to serve the most accurate model possible. The constraint that is easy to forget is that model inference is not free, and the budget is shared with everything else.

A gradient boosted tree at reasonable depth scores a transaction in a few milliseconds. A deep learning model with several layers might take 30 to 50 milliseconds for inference alone, before network round-trips and serialization. In a 150-millisecond total budget, that is a significant difference. The model architecture decision is not just an accuracy decision. It is an accuracy-under-latency-constraint decision, which changes the option space considerably. In most production fraud detection systems I have worked with or studied, tree-based models serve the latency-critical path and deeper models are reserved for asynchronous review queues where latency does not bind.

The serving layer wraps the model in a microservice that handles feature retrieval, inference, and post-inference rule application. The rules layer matters and is frequently underappreciated. Raw model scores are not decisions. A score of 0.73 needs to become "block," "challenge," or "approve," and that mapping involves thresholds that encode business logic: acceptable fraud loss rate, tolerable false positive rate, cost of manual review. These are not model parameters and they are not determined by the data science team alone. They are business decisions that live in the serving layer and can be adjusted without retraining.

This separation is practically useful. When fraud patterns shift, threshold adjustments can happen faster than model retrains. When a specific merchant category or corridor needs different treatment, a rule handles it without touching the model. The hybrid approach, a model for core scoring combined with a rule layer for specific patterns and overrides, is more common in real production systems than pure model-driven approaches because different parts of the system need to change at different speeds and different teams need to own different parts.

The false positive problem is the one that creates the most friction between the fraud team and everyone else in the business, and it is genuinely hard in a way that pure technical solutions do not fully resolve.

Every blocked legitimate transaction is a real user having a bad experience. A meaningful fraction calls support, which costs money. A smaller fraction does not bother to retry and takes their business elsewhere. The fraud system protected against a risk that did not exist and created a cost that is immediate and specific. The false negative problem, fraud that gets through, also costs money but arrives differently: as a monthly chargeback report with a dollar amount and a liability determination, absorbed by a defined fraud budget.

This asymmetry in how the costs arrive creates a predictable incentive problem. Fraud loss is visible, attributable, and reported upward. False positive cost is diffuse: spread across customer satisfaction metrics, support ticket volume, churn numbers, owned by different teams on different dashboards. Connecting a user's cancellation six weeks after a frustrating blocked transaction back to the fraud configuration requires cross-functional data work that many organizations do not do consistently.

What I have seen in practice is that false positive rates are tracked, periodically discussed, and then not reduced quickly enough because the threshold adjustment feels riskier than it is. Lowering the sensitivity has a known cost: a calculable number of additional fraudulent transactions. Keeping the threshold and losing some legitimate users to friction has a diffuse and harder-to-attribute cost. The incentive structure does not naturally favor aggressive false positive reduction, even when the aggregate cost of false positives exceeds the aggregate cost of the marginal fraud.

The international dimension adds a layer of complexity that does not show up in single-region systems or academic treatment of the problem.

Cross-border payment corridors have different fraud profiles from domestic transactions. The features that predict fraud on a domestic debit card transaction are not the same features that predict fraud on a cross-border wire. Training a single model on a mixed corpus produces a model that is adequate at both and excellent at neither. Building corridor-specific models requires enough labeled fraud data per corridor to train reliably, which is a data volume problem that new or low-volume corridors may not meet for months.

Data inconsistency across regions is a subtler problem. Features computed from US transaction history do not have the same distribution as the same features computed from European or African transaction history, even for the same underlying user. Currency normalization, merchant category code mappings that differ between regional networks, transaction amount distributions that vary by cost of living: all of these introduce noise that looks like signal to a model trained naively on merged data. On consulting projects working with globally distributed payment data, normalizing these differences before feature computation was consistently one of the higher-value and lower-glamour pieces of the data engineering work.

Network latency across regions is a physical constraint that affects which architecture decisions are feasible. Serving real-time scoring from a single US-region feature store to transactions originating in Southeast Asia adds round-trip latency that the budget may not accommodate. Multi-region feature store replication, with its associated consistency tradeoffs, is often the answer. You are now managing the consistency of feature data across regions as an operational concern, and a replication lag that is acceptable for most features becomes a problem for the velocity features that need to be fresh.

The production failures that are hardest to prevent are the ones that look fine in testing and require a specific combination of conditions to surface.

Late events are the most common. In a cross-regional system, events from different sources arrive with variable lag depending on regional infrastructure health and the quality of the integration with each partner. When events arrive significantly late, the velocity features computed from the stream do not reflect the actual sequence of events in the real world. A transaction that occurred before a series of failed authentication attempts may arrive in the stream after them, making the account look safer than it was at the time of the transaction.

Duplicate events have different but related consequences. In the fraud scoring pipeline, duplicate transaction events can inflate velocity counts, making a normal user look like they are transacting at an unusual rate. A retry storm on the event bus during a failover, not an uncommon occurrence, can produce a burst of duplicates that fires false positive blocks across a large number of legitimate transactions simultaneously. This has the useful property of being very visible very quickly. It has the less useful property of being visible to customers at the same time, which is not how you want to learn about event deduplication failures.

The deduplication logic that prevents this needs to be stateful, maintaining a window of seen event IDs to reject duplicates, and it needs to be correct under the specific failure conditions that cause duplicates to appear in the first place, which are precisely the conditions under which stateful stream processing is most likely to have consistency issues. Testing this correctly requires injecting duplicate events in a test environment under simulated failure conditions, which is the kind of test that tends not to get written until after the first production incident.

Observability in a fraud detection system needs to be more specific than standard service monitoring to be useful. Latency at p50 and p95 and p99, broken down by component, so you know whether a slowdown is in feature retrieval, model inference, or the rules layer. Queue depth on the event stream consumers, because a consumer falling behind is adding latency to decisions that have not been made yet, before any service metric shows a problem.

The false positive rate broken down by segment matters more than the aggregate. An overall false positive rate of 2% can hide an 8% rate on cross-border transactions and 0.5% on domestic ones. The remediation for those two situations is different, and the aggregate number obscures which direction to look. Segment-level false positive monitoring requires cross-functional data work to connect fraud decisions to downstream user behavior, but it is the version of the metric that is actually actionable.

Label latency is the metric most commonly tracked insufficiently and with the most downstream consequences. Fraud labels, the ground truth the model is trained and evaluated against, arrive with delay. Chargebacks take weeks. Manual reviews take days. Support tickets close without a formal fraud determination. The model is optimized against a delayed and incomplete view of reality, and if that delay is not measured and accounted for, evaluation metrics overstate actual system performance in ways that are difficult to detect until a model fails to catch an emerging fraud pattern it should have recognized.

What actually moves these systems forward over time is less often model architecture changes and more often data infrastructure improvements. Better feature pipelines that reduce stale data. Tighter event deduplication that removes noise from velocity counts. Shorter label latency that makes training and evaluation more representative of production conditions. Faster feedback from manual review decisions into the feature store, so that patterns identified by human reviewers inform the model before they are exploited at scale.

The feedback loop from manual review is particularly valuable and often missing from the architecture in practice. When a human reviewer labels a blocked transaction as a false positive, that label is ground truth the model has not seen. If it arrives in the training pipeline in near-real time rather than in a weekly batch export, the model can adapt to emerging false positive patterns faster. Building that pipeline is data engineering work. It has more impact on system quality than most model changes, and it is consistently underinvested relative to the model work.

The model is downstream of the data. In fraud detection more than almost any other applied ML domain, the quality of the data pipeline determines the ceiling on model quality. A better model served stale features from a poorly maintained feature store will underperform a simpler model served fresh, consistent, well-deduped features. Every team that has spent enough time in this space learns this. Most of them learned it from a production incident rather than from a design review.