Data Engineering11 min read

How Modern Data Pipelines Work

Sam Odongo2024

Most data pipelines do not fail dramatically. They fail slowly. A backlog grows quietly in a queue nobody is watching. A schema change upstream silently corrupts three downstream tables. A batch job that used to run in 40 minutes now takes four hours and nobody knows exactly when that started happening. The system keeps running. The data keeps flowing. The trust erodes.

Building a pipeline that survives production is a different problem than building one that passes testing. This is about the architectural decisions that determine which category yours ends up in.

The fundamental split: batch versus streaming

Every data pipeline is, at its core, a decision about when to move data. You can move it in large periodic chunks, or you can move it continuously as it is produced. That choice has cascading consequences for everything else: your infrastructure, your failure modes, your operational complexity, and your latency guarantees.

Batch processing is the older model and still the right answer for many workloads. You accumulate data over some window, typically hours or a day, then process it all at once. The mental model is simple. The failure modes are contained. If a job fails, you rerun it. Idempotency is achievable because you control exactly which data you are processing at any given time. The tooling is mature and the cost profile is predictable.

The tradeoff is latency. In a pure batch system, the freshest your data can ever be is one batch interval old. For a daily job, that means your analytics are always looking at yesterday. For an hourly job, you are an hour behind. Whether that matters depends entirely on what the data is for. A billing reconciliation job that runs nightly does not need sub-minute freshness. A fraud detection system that flags transactions after the customer has already left the building is a different story.

Streaming processing moves data continuously, event by event or in micro-batches of seconds. The latency ceiling is much lower, sometimes single-digit seconds from event production to query availability. The cost is complexity. Streaming systems require you to think carefully about things that batch systems handle implicitly: out-of-order events, late-arriving data, exactly-once semantics, state management across time windows.

The important thing to understand is that streaming does not eliminate batch processing. It changes where the batch boundary sits. Even in a streaming system, you are making windowing decisions: tumbling windows, sliding windows, session windows. The question shifts from "how often do we run the job" to "how do we define the time boundaries for aggregations." The cognitive load is similar, just distributed differently.

Most mature data platforms end up running both. Streaming for the low-latency paths where freshness matters, batch for the heavy historical reprocessing and cost-sensitive analytical workloads.

Going all-in on streaming before you have a genuine latency requirement is a common way to accumulate complexity without the corresponding benefit.

Medallion architecture and why it works

The medallion pattern, organizing data into bronze, silver, and gold layers, has become something close to an industry standard. Like most patterns that achieve wide adoption, it works because it solves a real problem rather than because it is theoretically elegant.

The problem it solves is data trust. When raw data lands in a storage layer alongside processed, aggregated, and joined data, debugging data quality issues becomes archaeology. You cannot tell whether a number is wrong because the source system sent bad data, because a transformation introduced a bug, or because the aggregation logic changed. The layers enforce separation between what you received, what you cleaned, and what you computed.

Bronze is raw ingestion. You store data exactly as it arrived, with minimal transformation. Schema-on-read, append-only, never modified after landing. The bronze layer is your audit trail. When something goes wrong downstream, you can always reprocess from bronze back to a known good state. This property is more valuable than it sounds until the first time you need it at 3am.

Silver is where conformance happens. You parse bronze data into typed schemas, apply deduplication, validate against contracts, and join related entities. The silver layer should be usable by analysts who understand the domain but should not require them to understand the quirks of the source systems. Nulls are handled consistently. Timestamps are in a known timezone. Field names mean the same thing across tables.

Gold is purpose-built for consumption. Aggregated metrics, wide denormalized tables, views optimized for specific reporting or machine learning use cases. The gold layer trades flexibility for query performance. A gold table might precompute a metric that requires joining five silver tables, because that join runs thousands of times a day and the cost of precomputation is justified.

The failure mode of the medallion pattern is over-engineering the silver layer. Silver is not your canonical data model. It is a cleaned, typed version of what you received. Scope it accordingly.

Schema management is where pipelines actually break

No architectural diagram ever shows the schema change that silently broke four downstream tables on a Tuesday afternoon. But it happens constantly, and it is responsible for more pipeline failures than any infrastructure issue.

Source systems change. An upstream team adds a column, renames a field, changes a type from integer to string, or starts sending null where they used to send an empty string. In a batch system with a weekly deployment cycle, you might not discover this until the next run. In a streaming system, you might discover it immediately but find that you have already ingested an hour of malformed data.

Schema evolution strategy should be a first-class architectural decision, not an afterthought. The two ends of the spectrum are schema-on-write (you enforce a schema at ingestion time, rejecting data that does not conform) and schema-on-read (you store raw bytes or semi-structured data and parse at query time). Neither is universally correct.

Schema-on-write gives you early failure. Bad data gets caught at the border, not discovered six transformations later when the business notices a dashboard looks wrong. The cost is brittleness to legitimate upstream changes. Adding a new required field to the schema means coordinating with every producer simultaneously.

Schema-on-read gives you flexibility. You can ingest data that does not match any predefined structure and figure out what to do with it later. The cost is that "later" often means "when someone notices a number is wrong," and by then the bad data has propagated.

A practical middle path is schema registries with forward and backward compatibility rules. Producers register schemas and can only evolve them in compatible ways: adding optional fields, not removing required ones, not changing types incompatibly. This lets schemas evolve without coordination while still providing guardrails against destructive changes.

Whatever approach you choose, the bronze layer should preserve raw data before any schema enforcement. This gives you the ability to replay data through a new schema parser when the schema does change, rather than having lost the original events permanently.

Idempotency and exactly-once semantics

Two properties matter more than most engineers give them credit for when building pipelines that survive production failures: idempotency and exactly-once delivery.

Idempotency means that running the same operation multiple times produces the same result as running it once. In a batch pipeline, this means you can safely rerun a failed job without worrying that it will double-count events or insert duplicate rows. In a streaming pipeline, it means a message that gets redelivered due to a broker restart does not corrupt your state.

Achieving idempotency usually requires being intentional about your write operations. Upserts instead of inserts. Deterministic deduplication keys. Write-ahead staging to a temporary location before atomic promotion. The specific mechanism depends on your storage layer, but the principle is the same: design your writes so that replaying them is safe.

Exactly-once semantics is the stronger guarantee: each event is processed exactly once, not at-least-once or at-most-once. It sounds like what you obviously want, but it comes with a real cost. True exactly-once in a distributed system requires coordination between the message broker and the sink, transactional semantics across both, and often meaningful throughput overhead.

At-least-once delivery is cheaper and sufficient for most workloads if your processing is idempotent. The practical question is not "do I want exactly-once" but "what is the cost of processing a message twice, and how do I make that safe."

The failure mode to watch for is systems that claim exactly-once semantics but deliver it only within a single processing node. If your consumer crashes and restarts, or if you are running multiple consumer instances, the guarantee may not hold in the way you expect. Read the fine print.

Backfilling and reprocessing at scale

The ability to reprocess historical data is one of the most underrated properties of a well-designed pipeline. You will need it. A bug in your transformation logic, a schema change you did not handle correctly, a new feature that requires a column that did not exist in the original processing, a regulatory requirement to recompute a metric differently. Reprocessing is not an edge case. It is a routine operation.

This has direct implications for pipeline design. Pipelines that write to mutable storage without any record of what version of the logic produced a given row are difficult to reprocess correctly. You do not know which rows were produced by the buggy logic and which were produced after the fix.

Immutable append-only storage at the bronze and silver layers makes reprocessing tractable. You can always identify rows by the timestamp they were processed and rerun from that point. Table formats that support time travel, allowing you to query data as it existed at a prior point in time, make this even cleaner.

Partitioning strategy matters enormously for backfill performance. A table partitioned by ingestion date can be reprocessed one partition at a time, with parallelism limited only by your compute capacity. A table with no partitioning forces full table scans for every backfill run.

Cost is the hidden variable in backfill design. Reprocessing two years of data in a cloud environment can be expensive if your pipeline is not designed for it. Knowing roughly what a full historical reprocess would cost, before you need to do one, is useful information to have.

Observability is not optional

A pipeline you cannot observe is not a production pipeline. It is a pipeline that will fail silently until a stakeholder notices a dashboard looks wrong and files a ticket.

The metrics that matter are not the ones that are easy to collect. Job success and failure rates are table stakes. What actually tells you whether your pipeline is healthy is data freshness, row count anomalies, schema drift detection, and end-to-end latency from event production to query availability.

Data freshness means knowing, for each table or partition, when the most recent data was produced and whether that matches expectations. A table that should be updated hourly but has not received new data in three hours is a problem. Without automated freshness monitoring, you find out about this when someone asks why this morning's numbers look flat.

Row count anomalies catch a class of bugs that do not surface as job failures. A transformation that filters incorrectly may run successfully to completion while silently dropping 40% of events. Comparing row counts across pipeline stages, and alerting when they deviate significantly from historical patterns, is one of the highest-value monitoring investments you can make.

Data contracts, formal agreements between producers and consumers about what data will look like, are increasingly treated as infrastructure rather than documentation. When a producer violates a contract, the consuming pipeline should fail loudly at ingestion rather than quietly producing wrong results. The operational overhead of maintaining contracts is real, but it is smaller than the operational overhead of debugging silent data corruption across a complex pipeline graph.

The architecture that actually survives

There is no single architecture that is correct for all workloads, which is the unsatisfying answer that also happens to be true. What survives production is an architecture that was designed with its failure modes in mind, not just its happy path.

That means bronze storage that is append-only, partitioned sensibly, and preserved long enough to reprocess. It means transformation layers that are idempotent by design. It means schema management that catches breaking changes at the border rather than three tables downstream. It means observability that tells you about data quality problems before your users do.

The choice between streaming and batch, the specific storage formats, the orchestration tool, the compute engine: these matter, but they matter less than the structural properties. A batch pipeline built on mature fundamentals will outperform a streaming pipeline built without idempotency or observability, regardless of what the architecture diagram looks like.

The pipelines that survive are the ones where, when something goes wrong, you know what happened, you know where it happened, and you can reprocess cleanly from a known good state. Everything else is a consequence of designing for that outcome from the start.

Back to articles Reply by email