Post

Schema evolution without breaking downstream jobs

In this article, let us see how to handle schema evolution in a data pipeline without breaking all the jobs that depend on it. This becomes important when your source system adds a column, renames one, changes a datatype, or starts sending nested fields that were not there earlier. In a small demo pipeline, we might just update the code and rerun it. In a real project, a schema change can break ingestion jobs, transformation models, dashboards, data quality checks, and downstream exports.

For our use case, let us assume we have an orders dataset landing daily in cloud storage, and we load it into a bronze table first, then transform it into silver and gold tables. The challenge is that the source team can change the file structure without coordinating every release with the data team. So we need a design that is flexible enough to accept safe changes and strict enough to prevent silent data issues.

What schema evolution usually looks like

In practice, schema evolution is not one single thing. It can happen in different ways:

  • a new nullable column gets added
  • a column order changes in CSV or JSON payloads
  • a datatype changes from integer to string
  • a nested field gets added inside a struct
  • a column is removed or renamed
  • the meaning of a field changes even if the name stays the same

Some of these are easy to handle. Some of them are dangerous. Adding a nullable column is usually manageable. Renaming customer_id to client_id without warning can be painful if ten downstream models depend on the old name.

This is why I prefer to think about schema evolution in terms of compatibility rather than just change. The question is not whether the schema changed. The question is whether downstream consumers can still work safely.

A simple pattern that works well

For most batch pipelines, a practical pattern is this:

  1. Load source data as-is into a raw or bronze table
  2. Validate the incoming schema against expected rules
  3. Allow safe additive changes automatically
  4. Block or quarantine breaking changes
  5. Keep a stable contract in silver and gold layers

This gives some flexibility at the ingestion layer while protecting the tables that analysts and applications actually use.

Bronze can be flexible, silver should be stable

If we are loading JSON or parquet files, bronze is usually the best place to absorb schema changes. For example, in Spark we can enable schema merge for parquet when needed:

1
df = spark.read.option("mergeSchema", "true").parquet("s3://demo/orders/")

If the source adds a nullable column called discount_code, the bronze table can accept it without much trouble. But I would still avoid exposing bronze directly to reports or business users. Bronze is where we preserve source truth, not where we promise stable semantics.

Then in silver, we explicitly select the fields we support:

1
2
3
4
5
6
7
select
  order_id,
  customer_id,
  order_timestamp,
  total_amount,
  coalesce(discount_code, 'NONE') as discount_code
from bronze_orders

This is boring, but boring is good here. Downstream jobs see a stable structure, and we decide how new columns should be handled instead of letting them flow everywhere automatically.

Additive changes versus breaking changes

A simple comparison helps when deciding what to automate.

Change typeUsually safe?What I would do
New nullable columnYesAccept in bronze, review for silver
Column reorderUsually yesIgnore if format supports names
New nested fieldOften yesAccept in bronze, test consumers
Type widening, like int to bigintMaybeValidate carefully before rollout
Type change, like int to stringNoQuarantine and investigate
Column rename or removalNoTreat as breaking change
Semantic meaning changeNoVersion the contract and communicate

The main mistake I have seen is treating all schema changes the same way. If we blindly allow everything, we get silent bad data. If we block everything, operations become painful and every harmless column addition becomes a fire drill.

Keep an expected schema contract

One practical approach is to store an expected schema in code. This does not have to be fancy. It can be a JSON file, YAML, or a table in your metadata store.

For example:

1
2
3
4
5
6
7
8
9
{
  "table": "orders",
  "columns": [
    {"name": "order_id", "type": "string", "nullable": false},
    {"name": "customer_id", "type": "string", "nullable": false},
    {"name": "order_timestamp", "type": "timestamp", "nullable": false},
    {"name": "total_amount", "type": "decimal(12,2)", "nullable": true}
  ]
}

Then before loading or promoting data, compare the actual incoming schema with the expected one. Pseudo-code could look like this:

1
2
3
4
5
6
7
8
9
10
11
for source_column in incoming_schema:
    if source_column not in expected_schema:
        mark_as_additive_change(source_column)

for expected_column in expected_schema:
    if expected_column missing from incoming_schema:
        raise_breaking_change(expected_column)

for common_column in shared_columns:
    if incompatible_type_change(common_column):
        raise_breaking_change(common_column)

This can feed an alert, a Slack message, or a failed orchestration step depending on how strict your platform needs to be.

A staging view can reduce breakage

If downstream jobs read directly from physical bronze tables, even harmless schema changes can cause surprises. I prefer using a staging view or model that normalizes the columns we want.

For example:

1
2
3
4
5
6
7
create or replace view stg_orders as
select
  cast(order_id as string) as order_id,
  cast(customer_id as string) as customer_id,
  cast(order_timestamp as timestamp) as order_timestamp,
  cast(total_amount as numeric) as total_amount
from bronze_orders

This is useful when one source sends total_amount as integer cents for a week and then changes to decimal currency values. The bronze table might store both awkwardly, but the staging layer can enforce a single contract.

Be careful with CSV sources

CSV files are where schema evolution gets more annoying. Since CSV has weak type information, you rely on headers and parsing rules. A reordered column list is fine only if you read by header names. If your job depends on column positions, one small source change can shift values into the wrong fields without an obvious error.

For CSV pipelines, I usually do the following:

  • validate that required headers exist
  • reject duplicate column names
  • trim whitespace from headers
  • map source names to canonical names where needed
  • cast everything explicitly in the transformation layer

This adds a bit more code, but it avoids weird bugs later.

What changes in production

In a demo, we might just compare schemas and fail the job. In production, I would usually add a few more controls:

  • schema registry or metadata table for approved versions
  • alerting integrated with orchestration
  • quarantined landing area for bad files
  • backfill strategy after fixing the contract
  • consumer impact assessment before promoting schema updates

Another thing I like in production is versioning important datasets. If a breaking change is unavoidable, creating orders_v2 for a migration period is often safer than changing the existing contract overnight. It creates a little extra maintenance, but it prevents downstream jobs from all failing at once.

One example with downstream job protection

Suppose a daily aggregation job calculates revenue by day:

1
2
3
4
5
select
  date(order_timestamp) as order_date,
  sum(total_amount) as revenue
from stg_orders
group by 1

If the source changes total_amount from decimal to string, this gold job should not be the first place where the failure shows up. The ingestion or staging validation should catch it earlier. That is the real goal of schema evolution handling. We are not trying to make every change invisible. We are trying to detect incompatible changes close to the source and protect downstream jobs from surprise failures.

Limitations and caveats

There are still a few things to be careful about:

  1. Some engines support schema merge, but that does not mean your business logic is automatically correct.
  2. Nested schema evolution can behave differently across Spark, BigQuery, Iceberg, and Delta Lake, so test on your actual platform.
  3. A column can remain technically compatible while its meaning changes. This is harder to catch with pure schema validation.
  4. Backfills become tricky if older partitions do not have the newer fields. You may need defaults or conditional logic.

So schema evolution support in the storage layer is helpful, but it is not the full solution.

Conclusion

Schema changes are normal in data engineering, but downstream breakage should not be normal. If we keep bronze flexible, silver stable, and validate changes with a simple contract, we can handle most source evolution without making every downstream team nervous. The exact tooling can differ, but the principle stays the same: accept safe changes, stop breaking ones early, and keep the consumer-facing layers predictable.

This post is licensed under CC BY 4.0 by the author.