Designing for Intermittent Connectivity
What breaks when the network disappears, and how to design around it.
The dashboard looked fine. That was the first problem.
The pipeline had been running for six weeks without incident, pulling transaction records from a regional API, enriching them with customer metadata, and writing aggregates to a reporting database. The team had tested it. It passed. The SLA was four-nines uptime and they were hitting it comfortably. Then a submarine cable issue on the East African coast caused intermittent packet loss for about four hours one Tuesday afternoon. Not a full blackout. Just enough degradation that TCP connections were timing out unpredictably, retries were silently failing, and the pipeline's error handling, which had been designed for "API returns 500," had no idea what to do with "API eventually returns nothing."
The dashboard stayed green. The pipeline stayed running. And for four hours it was writing stale data, filling gaps with the last successful value it had seen, because that is what the fallback code path did. Nobody had read that code path in months.
The team discovered the problem three days later when a finance analyst noticed the daily transaction counts did not match the bank reconciliation. By then the data was already in the warehouse, already in reports, already trusted.
That is what intermittent connectivity actually looks like in production. Not a dramatic outage that pages everyone at 3am. A quiet, partial degradation that the system absorbs in the wrong way, and that you find later, when it is more expensive to fix.
Most cloud-native systems are designed with an assumption baked so deep it is almost invisible: the network is there. Not necessarily fast, not necessarily cheap, but there. AWS SDKs have default timeouts, sure, but they are designed for flaky microservices within the same region, not for a client sitting behind a mobile network during evening peak hours when half the city's traffic is routing through the same congested uplink.
The cloud architecture playbook treats connectivity as a managed resource. You pick your region, you get your SLA, and you design around the assumption that your instances can reach each other and the internet at predictable latencies. That assumption holds well for systems that live entirely in the cloud. It falls apart when any part of the chain touches the real world: a field sensor, an on-premise database, an office network, a mobile device that is also the only connectivity a small business has.
I have built pipelines that span both. One for a logistics client where the source data came from a fleet of trucks running Android tablets over cellular, syncing trip records to an AWS Kinesis stream whenever they had signal. Another for a client whose data warehouse was on-premise, behind an internet connection that was excellent 95% of the time and completely absent the other 5% in a completely unpredictable pattern. The cloud tooling was identical in both cases. The operational behavior was completely different.
When connectivity disappears, the first thing you discover is that most systems do not actually know it is gone. They think they are waiting. TCP is patient. An application that opens a socket and does not get a response will often sit there, holding the connection open, consuming a thread or a file descriptor, waiting for something that is not coming. Connection pools get exhausted. Thread pools get exhausted. Eventually the system starts rejecting new work, not because it knows it is offline, but because it has run out of internal resources waiting for a network that is not responding.
What does not happen, usually, is a clean failure. The clean failure, where your system says "I am offline, I will stop accepting work, I will buffer what I have, I will resume when connectivity returns," requires explicit design. Nobody gets it for free.
During that four-hour window: each record would attempt to enrich itself by calling an external customer metadata API. The API call would time out after 30 seconds. The pipeline had a retry with exponential backoff. The backoff got to 120 seconds before giving up. Each record was therefore taking between 30 and 150 seconds to fail, compared to the 200 milliseconds it took in normal operation. Throughput dropped from thousands of records per minute to dozens. The backlog grew. Eventually the pipeline started dropping records from the head of the queue to keep up with incoming data. Not logging them. Dropping them.
All of this because the failure mode was slow rather than fast. Slow failures are the dangerous ones. A system that fails fast is a system you can design around. A system that hangs takes everything with it.
The real engineering problem is not "how do I retry a request." That is the entry point, but it is not where the complexity lives. The real problems are: how do I preserve data integrity across an outage of unknown duration, how do I ensure that when I come back online the data I process is consistent with what I missed while I was offline, and how do I do this without creating duplicates.
Idempotency is the word that comes up in every distributed systems textbook, but it is worth being concrete about what it actually requires. An idempotent operation can be applied multiple times and produce the same result. That sounds straightforward until you think about what it demands from your data model.
Consider a simple case: you are writing a record to a database when connectivity drops. Your write fails. You retry. Connectivity is back. You write again. If your database does not know this is a retry, you now have two records. If you are counting transactions, you have counted one transaction twice. If you are calculating a running total, you have added the same value twice.
The fix is either a unique key that the database enforces, so the second write becomes an upsert, or a client-generated idempotency token that the server remembers. Both require forethought. The first requires schema design. The second requires the server to maintain state about what it has already processed, which itself requires storage and has a retention policy and is another thing that can fail. In practice, for most Kafka-based pipelines, exactly-once semantics is what you want but transactional APIs and careful consumer group management are what you need to actually get there. And even then, what you usually achieve is at-least-once delivery with idempotent consumers, which is close enough for most use cases but is not the same thing.
Local queuing and durable buffering are what make offline behavior recoverable. The pattern is not complicated in principle: when you cannot send data, you store it somewhere durable. When you can send it, you flush the buffer. The devil is in the "somewhere durable" part.
Memory does not count. A buffer in memory is lost when the process dies, which is exactly what happens when the power blinks and the server restarts. I have seen this failure more times than I would like. The system is designed to buffer during outages, but the buffer lives in the application's heap, and the first time there is a power event the buffer disappears along with everything in it.
Disk is better but not free. You need to decide how much disk to allocate, what happens when it fills up, and how you handle corrupted writes if the disk write itself is interrupted by a power event. Write-ahead logging, the pattern that databases use internally, is the right answer here. You write to a log file sequentially, you flush before acknowledging, and you replay the log on startup. It is not glamorous, but it survives power events in a way that in-memory buffers do not.
For data pipelines specifically, the cleanest solution I have worked with is running a local Kafka instance or equivalent as a durable queue, and treating it as the primary write target regardless of whether the downstream systems are available. The pipeline writes to local Kafka. A separate process reads from local Kafka and forwards to the cloud. When connectivity drops, the forwarding process stalls, the local queue grows, and nothing is lost. When connectivity returns, the forwarding process catches up. The pipeline never knows the cloud was unavailable.
This architecture adds latency and adds infrastructure, but it changes the system's relationship to connectivity from dependent to tolerant. That tradeoff is almost always worth it when the alternative is silent data loss.
Reconnection is where things get hard in a different way.
When a system comes back online after a multi-hour outage, it has work to catch up on. In a well-designed system that buffered everything, it now has hours of buffered data waiting to be processed. It will try to process all of it as fast as it can. And the downstream systems, which have been receiving nothing for four hours, are about to receive four hours of data in whatever time the catch-up takes.
Downstream systems are usually not designed for this. A database that normally receives 1,000 writes per minute is not going to handle 10,000 writes per minute gracefully, especially if it has foreign key constraints and index updates and triggers running on each write. An API that normally sees 100 calls per minute will start rate-limiting at ten times that load. A dashboard that was rendering normally will start querying the database for data that is being written concurrently and will return inconsistent snapshots.
Catch-up processing needs to be rate-limited. This sounds obvious and is almost never implemented before the first outage teaches the lesson. The forwarding process should have a configurable rate limit. When it detects it is in catch-up mode because lag is large, it should process at a rate the downstream can absorb, not at the maximum rate possible. This extends the catch-up time but prevents the downstream from being hammered. It also means your system is in a state of partial consistency during catch-up: the source has data that the downstream has not processed yet. Any query during this window will return data that is behind the source. That is fine, and it is the right behavior, but it needs to be understood by the people reading dashboards.
Downstream inconsistency during catch-up is not a bug. Downstream inconsistency that is invisible and that nobody knows about is a bug.
Eventual consistency gets thrown around loosely. In practice, for systems designed around intermittent connectivity, it means: there will be windows where different parts of your system disagree about the current state of the world, and those windows can last longer than you expect, and that is acceptable as long as they converge on the right answer eventually.
The "eventually" is doing a lot of work in that sentence. You need to decide what it means for your use case. If you are building a real-time fraud detection system, eventually consistent probably means seconds or minutes. If you are building a daily analytics pipeline, it might mean hours. If you are aggregating field sensor data for a weekly report, it might mean days. These are different systems with different architectures, but the underlying constraint is the same.
What changes is how aggressively you signal the inconsistency. A real-time system should expose staleness directly in its interface: "last updated 4 minutes ago." An analytics system should watermark its data. A batch system should document its lag assumptions. What you should not do is present data as current when it may not be. The silent stale data problem is not a UX problem. It is a trust problem.
The hybrid architectures that work well in environments with real connectivity variability tend to follow a consistent pattern: local-first processing with asynchronous cloud sync. Do as much as possible locally or at the edge. Write to durable local storage first. Sync to the cloud when you can, and design the sync to be resumable.
For data pipelines, this often looks like: a local process that ingests, validates, and transforms incoming data and writes clean records to a local queue; a sync process that reads from the local queue and pushes to cloud storage or a managed streaming service; and a monitoring process that tracks the lag between the two and alerts when it grows beyond a threshold. The three processes are independent. A failure in the sync process does not bring down the ingestion process.
Batch fallback is underrated. I have worked on streaming pipelines that had sophisticated real-time processing but no graceful degradation when the stream stalled. When connectivity dropped, the stream stalled, the downstream got no data, and the reporting was broken until the stream recovered. Adding a batch fallback, a daily job that reads from the source directly and writes aggregates to the same downstream destination, meant that even during extended outages the data was never more than 24 hours stale. That is not as good as real-time. It is infinitely better than blank dashboards.
If I were designing one of these systems today, from scratch, the biggest change I would make is in how I test for connectivity failure. Most teams test the happy path thoroughly and the error path superficially. They test "what happens when the API returns 500" but not "what happens when the API just stops responding for six hours." They test "what happens when the network is slow" but not "what happens when the network goes away entirely and then comes back."
Chaos engineering exists for this. But you do not need a sophisticated tool to start. You can simulate a network outage with a firewall rule. You can simulate a power event by killing processes abruptly. You can simulate reconnection by re-enabling the firewall rule after a defined interval and watching what happens. What you will find, almost certainly, is that reconnection is messier than you expected, that your buffering is less durable than you thought, and that your downstream systems handle catch-up load worse than you hoped. All of those findings are cheaper to address in testing than in production.
The other investment worth making is observability specifically around connectivity state. Not just "is the pipeline running" but "how large is the local buffer," "what is the lag between local writes and cloud sync," "how long has it been since we successfully forwarded a record." These metrics tell you the health of the connectivity layer, not just the health of the application. When connectivity degrades, the application metrics often look fine right up until they do not. The connectivity metrics will show you the degradation in real time, before the dashboard goes stale and before the finance team starts asking questions.
Building for intermittent connectivity is not exotic engineering. It is what systems need to be when they operate in the real world, where cables get cut, where power blinks, where mobile networks congest during peak hours, and where the gap between "the network is available" and "the network is reliably available at the latency and bandwidth you assumed" is where most production incidents actually live. Designing for that gap is not pessimism. It is just engineering.