Production Systems12 min read

How M-Pesa Processes 10 Million Transactions a Day

Queues, validation, and settlement behind East Africa's most critical financial rail

Sam OdongoMar 2025

The first of every month in Kenya is not a normal traffic day. Employers batch-process salaries overnight. Landlords start collecting rent. School fees come due. Families remit money upcountry. The pattern does not taper in gradually: it hits. Between roughly eight and ten in the morning, M-Pesa transaction volumes spike to multiples of the daily average, sustained over hours, not a momentary peak. Everyone sends money at roughly the same time because everyone gets paid at roughly the same time. Every one of those transactions is expected to resolve quickly, correctly, and with a confirmation message the sender reads immediately.

Ten million transactions a day works out to about 115 per second on average. On a salary morning, that figure is plausibly five to ten times higher for an extended window. The system handling all of this has to maintain financial correctness throughout: no double charges, no debits without corresponding credits, no lost funds, every transaction resolving to a known state even when the mobile network between the user's handset and the processing layer is doing what Kenyan mobile networks sometimes do.

That combination, high concurrency with strong correctness guarantees, is what makes this an interesting engineering problem. Most high-throughput systems can afford some approximation. A recommendation feed that drops a few items is fine. A search result that is a second stale is acceptable. A financial system cannot afford to be approximately right. Every transaction either happened or it did not. The amounts are exact. The record of what happened has to be auditable years later. Strong correctness and high throughput are in natural tension, and the architecture of a system like M-Pesa is largely the story of how that tension gets managed under real load.

Most M-Pesa transactions start over USSD. It is worth being precise about what that means, because USSD shapes the architecture in ways that are not immediately obvious. USSD is not a data protocol in the modern sense. It is a session-based signaling channel in the GSM stack, designed for simple menu-driven interactions, running on the same infrastructure that handles voice calls. When a user dials the M-Pesa shortcode, they are not making an HTTP request. They are opening a signaling session through the mobile network, and that session is time-bounded. It has a timeout. If a response does not arrive within a few seconds, the session drops.

This timeout is a hard architectural constraint. You cannot tell a USSD user to hold while you run a distributed transaction involving balance checks, fraud signals, and ledger writes across multiple systems. The end-to-end latency of that full processing chain is too high and too variable to fit inside a USSD response window, especially under load. Which means the gateway receiving the USSD session cannot be the same thing that completes the transaction.

The answer to this is queuing, and it is not a workaround. It is the core architectural decision. The USSD gateway receives the transaction request, does the minimum validation needed to confirm the session is legitimate and the request is well-formed, and places the transaction into a processing queue. The user receives a prompt quickly enough to keep the USSD session alive. Processing happens downstream, asynchronously, at whatever rate the processing layer can sustain without being overwhelmed.

The queue is a buffer between the rate at which requests arrive and the rate at which the system can safely process them. On a quiet Tuesday afternoon, the buffer is nearly empty and the experience is effectively synchronous. On a salary morning, the buffer absorbs the spike and the processing layer works through it at a rate the downstream systems can handle. The user sees a slightly longer wait. The alternative is worse.

Direct synchronous processing at peak load does not degrade gracefully. It causes timeouts, which cause retries, which increase load, which cause more timeouts. This failure pattern, a retry storm feeding back into an already overloaded system, is one of the most reliably destructive things that can happen to high-throughput infrastructure. The queue-based architecture prevents it by decoupling acceptance from processing. You can accept requests at the arrival rate without being forced to process them at that rate simultaneously. This is standard event-driven architecture thinking, applied to a very specific pressure problem.

Before a transaction enters the processing queue, it goes through validation, and validation is where most of the correctness guarantees live. The checks involved are deceptively simple to list: does the sending account have sufficient balance, is the PIN correct, is the transaction within the limits set for the account, is the recipient in a valid state. In practice, each of these has failure modes that require careful handling.

Balance checking is a concurrency problem. If two transactions against the same account arrive simultaneously and both check the balance before either updates it, both might see a balance sufficient for their individual amount, but not for both together. The classic read-modify-write race condition. At the transaction volumes a system like this handles, this is not theoretical. High-activity accounts, merchants, aggregators, users who receive and send frequently, will see concurrent transaction attempts regularly. The validation layer has to handle this correctly, which typically means some form of locking or optimistic concurrency control at the account level, with appropriate handling for the case where the check-and-update fails because another transaction modified the balance first.

PIN verification crosses the boundary between the telecom layer and the financial layer in a way that has security implications. Failed PIN attempts have to be tracked and rate-limited: unlimited retries turn an account takeover from a cryptographic problem into a brute-force problem, which is a much weaker protection. The rate limiting has to work correctly even when requests arrive through different routes or when the same user generates multiple failed attempts from frustration rather than from an attack. Getting this wrong in either direction, too loose or too aggressive, has real user impact.

Fraud signal evaluation adds another layer. In a system used for salary payments, school fees, and daily purchases, the transaction patterns are rich and consistent enough to establish behavioral baselines. Sudden large transfers to new recipients, transactions at unusual hours, sequences that match known fraud patterns: these generate signals that the validation layer likely incorporates, either as hard rejections or as triggers for additional verification steps. The specifics of how M-Pesa implements this are not public, but in any financial system at this scale, fraud detection is not optional and it is not something that runs as a batch job after the fact.

Once validation passes, the actual transaction processing is what distributed systems engineers would recognize as an atomicity problem. You need to debit the sender and credit the recipient such that either both happen or neither happens, and this guarantee has to hold even if the processing system crashes partway through. The naive implementation, debit the sender and then credit the recipient as two sequential operations, has an obvious failure mode: the debit succeeds and then the process dies before the credit. The sender's balance has decreased. The recipient has received nothing. Real money is missing from the ledger.

The standard approach in financial systems is a transaction journal or write-ahead log: before executing any ledger operation, write a record of the intended transaction to durable storage. If the process fails mid-execution, the recovery process reads the journal, identifies incomplete transactions, and either completes or reverses them deterministically. The journal is the source of truth; the ledger state is derived from it. This is not a novel idea. It is how databases have handled atomicity for forty years. Applied to a distributed system where sender and recipient may be on different account partitions, the coordination is more complex, but the principle is the same.

The confirmation message the user receives is not just a notification. It is the output of this entire chain executing correctly. "You have sent KES 5,000 to John. New M-Pesa balance: KES 2,340." Every number in that message is an assertion about the current state of the ledger, and it needs to be accurate. Confirming before the transaction is durably committed is worse than not confirming at all, because the user immediately acts on that information. In a financial system, the confirmation is a promise.

Below the real-time transaction processing is a layer most users never think about: settlement. M-Pesa electronic value is backed by float, real money held in trust accounts at commercial banks. When a transaction happens, the M-Pesa ledger updates in real time. The underlying cash does not move in real time. It moves through interbank settlement processes that run on their own schedule, typically end-of-day or at defined settlement windows.

This means two ledgers are running simultaneously: the M-Pesa ledger tracking real-time electronic balances, and the underlying trust account positions that settle on a slower cadence. These two ledgers have to agree. Reconciliation processes run to verify that they do, and when discrepancies appear, which they do, they have to be investigated and resolved. This is the unglamorous but essential part of financial operations at scale: the daily reconciliation that catches anything the real-time processing miscounted, missed, or double-recorded.

For cross-system transfers, M-Pesa to bank account, M-Pesa to another mobile money system, the settlement complexity increases. These operations cross organizational and technical boundaries, involve more parties in the settlement chain, and introduce delays that pure M-Pesa transactions do not have. The user-visible difference is that these transfers work slightly differently: different timing, sometimes different confirmation flows, because they are operating across more coordination layers.

Failure handling is where system design gets honest about what it expects to actually happen. Networks drop. USSD sessions time out mid-transaction. Servers crash between operations. Storage writes fail. At ten million transactions per day, all of these are regular events, not edge cases. The system is built around this expectation, not in spite of it.

The network failure during a USSD session creates a specific and well-known problem. The user initiates a transaction. It enters processing. The USSD session drops before the confirmation arrives. The user does not know whether the transaction executed. They retry. Now the system potentially has two requests for the same operation, and it must process exactly one. This is the idempotency problem, and it is solved by assigning each transaction a unique identifier and making the processing layer check for that identifier before executing. If a transaction with that ID already completed, return the existing result. Do not execute it again.

This requires the transaction store to be consulted on every incoming request, which is an additional read on every transaction. That is a real cost at this scale. It is also not optional: the alternative, processing duplicates, means users get double-charged or funds get double-credited, which is worse than the latency cost of an extra read.

Retry storms are a distinct failure mode that requires explicit mitigation. When the processing layer slows down or becomes temporarily unavailable, clients retry. If all clients retry on the same interval, they create a synchronized wave of requests hitting the system simultaneously. If the system is already under pressure, that wave makes the situation worse and can turn a temporary slowdown into a full outage. The mitigation is exponential backoff with jitter: clients wait progressively longer between retries, with randomness added to desynchronize them. Whether the USSD stack implements this precisely in these terms is something I cannot say from the outside, but the systems that do not implement some version of this learn the lesson the hard way.

I spent about four months at Safaricom in early 2013, in the Network Operations Center, as an intern. I was not working on M-Pesa architecture. I was monitoring 2G and 3G network performance: watching live network metrics, observing how fault management systems surfaced problems, learning what real-time system pressure looks like when the underlying infrastructure is a live mobile network.

What that experience gave me that is relevant here is a concrete sense of how infrastructure failures actually present themselves. They rarely announce themselves cleanly. They arrive as signals: a degrading metric on one cell site, increased error rates on a particular channel, latency trending upward on a backhaul link. The work of the NOC was largely about catching those early signals before they became customer-visible incidents. That required the right monitoring, the right thresholds, and the right human response procedures. The engineers watching those dashboards had calibrated intuitions about which metrics represented normal variance and which represented the early stages of something that would compound if ignored.

The same principle applies to any large-scale production system. At the transaction volumes M-Pesa operates at, something is almost always slightly wrong somewhere: a slow database replica, a queue that is backing up, a server running hotter than expected. The question is not whether problems exist but whether they are visible, whether the thresholds are set to catch them before they cascade, and whether the operational response is fast enough to intervene before degradation becomes failure. Reliability at this scale is not a property you achieve once at launch. It is something you actively observe and defend, continuously, with a monitoring stack that makes problems findable before users find them first.

One observation from the NOC that stuck with me: the systems that looked fine on the summary dashboards were not always the systems that were fine. The detail mattered. A cell site that was technically operational but running at elevated error rates was not the same as one running cleanly, even though both showed as "up." At financial transaction scale, the equivalent distinction, technically processing versus processing correctly at expected quality, is the difference between a system you can trust and one that is quietly accumulating problems.

There is a tradeoff that gets made so routinely in systems like this that it barely registers as a decision: under load, the system slows down rather than fails. This is intentional. When the processing queue fills up and confirmations take a few more seconds than usual, that is the system behaving correctly. It is accepting load at the incoming rate, processing at the maximum sustainable rate, and distributing the delay across users rather than crashing and giving everyone an error.

From the user's perspective, a five-second wait for a confirmation is annoying. A failed transaction is worse. A failed transaction on salary day, when someone is trying to pay rent or send fees and the window to do it is closing, is the kind of experience that damages trust in ways that take a long time to rebuild. The engineering decision to prefer graceful degradation over hard failure reflects the reality that the cost of a slow response is much lower than the cost of a wrong one, and much lower than the cost of a failed one in a system whose primary asset is trust.

Clean rejection is part of this too. When the queue reaches capacity, new requests need to be rejected explicitly, with a clear error that tells the client to try again later, rather than accepted into a backlog the system cannot drain. An explicit rejection is information. It tells the client what happened. It allows the client to retry intelligently. Accepting a request into an unbounded queue and then timing out hours later leaves the user with no information and no recourse. Graceful degradation includes graceful refusal.

The most common misconception about a system like M-Pesa is that it is simple: money goes in, money comes out, there is a record. This view underestimates the coordination problem by roughly an order of magnitude.

Moving electronic value between accounts is not the hard part. The hard parts are what surrounds it. Ensuring exactly one debit when a user initiates one transaction, even when the network creates ambiguity about whether the first request arrived. Ensuring balance checks are not defeated by concurrent transactions against the same account. Ensuring failures in any individual component can be detected and recovered from without corrupting the ledger. Ensuring the system continues to function under degraded performance when parts of the infrastructure are unavailable, which on a mobile network is a routine condition rather than a disaster scenario. Ensuring every transaction is eventually auditable, because financial records do not expire and disputes arrive years after the fact.

These are distributed systems problems with well-understood names: idempotency, optimistic concurrency control, write-ahead logging, reconciliation, circuit breaking, graceful degradation. The interesting engineering in a system like M-Pesa is not the invention of new solutions but the precise application of these principles at this specific scale, against a mobile network infrastructure that varies in ways that datacenter-to-datacenter distributed systems do not, with financial correctness requirements that most high-throughput systems do not face.

I cannot tell you how M-Pesa's internal architecture works in detail. What I can say, with reasonable confidence, is that the problems it is solving are recognizable, the principles for addressing them are well-established, and the difficulty is in the execution: building, operating, and continuously improving a system at this scale, on this infrastructure, serving this many people for whom it is not a convenience but the primary way money moves. That is what makes it worth understanding.

Related Research

A Reliability-First Framework for Enterprise AI

The same reliability principles that keep financial rails running, applied to autonomous AI agents: fault tolerance, observability, and measurable correctness guarantees in production systems.

Read →

Back to articles Reply by email