Building reliable backfills in data pipelines

Posted Jan 18, 2025

By Ashok KS 7 min read

In this article let us see how to build reliable backfills in data pipelines, why we need them, and what things usually break when we run them in a hurry. Backfills sound simple at first. We missed some data, so let us rerun the pipeline for older dates. But in practice, a careless backfill can create duplicate rows, overload downstream systems, and confuse business users because numbers keep changing without warning.

For our use case, think of a daily orders pipeline that loads raw files into a warehouse and then builds a reporting table. The day-to-day schedule works fine, but one source system had an outage for five days. We now need to load those five days again and rebuild the affected tables in a controlled way.

What a backfill really means

A backfill is just reprocessing historical data for a known period. But there are different types of backfills, and the design depends on which one we are doing.

Backfill type	Typical reason	Main risk
Missing partitions	Upstream outage or job failure	Gaps in data
Logic change	Bug fix in transformation	Old reports change suddenly
Full historical rebuild	New model or new source	Very expensive run

If we do not distinguish these cases, we end up using the same script for everything and that usually becomes painful later. A small partition replay should not need the same process as a six month rebuild.

First design rule: process by a clear unit

The easiest way to make backfills reliable is to process data by a clear unit such as business_date, event_date, or batch_id. If your pipeline only knows how to run “everything after the latest timestamp” then backfills become awkward.

For example, let us say our source files land in folders like below.

s3://company-raw/orders/dt=2025-01-10/
s3://company-raw/orders/dt=2025-01-11/
s3://company-raw/orders/dt=2025-01-12/

This makes it much easier because our pipeline can run per partition. We can say “rerun 2025-01-10 to 2025-01-12” instead of trying to infer what changed.

A simple config for a backfill job could look like this.

  
job: orders_curated
mode: backfill
start_date: 2025-01-10
end_date: 2025-01-12
write_disposition: overwrite_partition

The main point is that the pipeline should accept explicit boundaries. If your code cannot take these parameters, that is usually the first thing to fix.

Second design rule: make writes idempotent

In many projects, the actual problem is not rerunning the read logic, but rerunning the write logic safely. If the target table only supports append, then every retry can create duplicates.

For partitioned warehouse tables, I prefer one of these two patterns:

Delete and reload only the target partition
Merge into the target table using a stable business key

A BigQuery MERGE example would look like this.

  
MERGE analytics.orders t
USING staging.orders_backfill s
ON t.order_id = s.order_id
AND t.order_date = s.order_date
WHEN MATCHED THEN
  UPDATE SET
    customer_id = s.customer_id,
    amount = s.amount,
    status = s.status,
    updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
  INSERT (order_id, order_date, customer_id, amount, status, updated_at)
  VALUES (s.order_id, s.order_date, s.customer_id, s.amount, s.status, CURRENT_TIMESTAMP());

This is slower than a blind append, but it is much safer. For our use case, I would rather pay a bit more compute cost than explain duplicate revenue in a dashboard.

If the destination is a file-based data lake, the same idea still applies. Rewrite only the affected partition paths, not the whole table, unless the change really requires a full rebuild.

Keep a backfill control table

One thing that helps a lot is maintaining a small control table that tracks each backfill request. This is useful for visibility and also to prevent running the same range again by mistake.

A simple schema could be:

  
CREATE TABLE ops.backfill_runs (
  pipeline_name STRING,
  start_date DATE,
  end_date DATE,
  reason STRING,
  requested_by STRING,
  status STRING,
  created_at TIMESTAMP,
  completed_at TIMESTAMP
);

Before triggering the pipeline, the orchestration layer writes a row into this table. When the run finishes, it updates the status. This does not need to be fancy. Even a simple table gives the team a shared view of what happened.

Example orchestration flow

Let us take an Airflow-style pseudo-code example.

  
def run_backfill(start_date, end_date):
    dates = list_dates(start_date, end_date)

    for dt in dates:
        extract_raw_partition(dt)
        validate_raw_partition(dt)
        rebuild_staging_partition(dt)
        merge_into_curated(dt)
        run_data_quality_checks(dt)
        mark_partition_complete(dt)

Notice that the flow marks each partition independently. This matters because a backfill over 30 days should not fail completely at day 29 and leave us guessing what was already done.

A checkpoint table or completion marker per partition helps us resume safely. If 20 days already succeeded, the next rerun should only continue from day 21.

Data quality checks are more important in backfills

For a daily scheduled run, small issues sometimes go unnoticed. During a backfill, those same issues get multiplied across many partitions. So I think backfill runs deserve stronger checks than normal pipeline runs.

At minimum, I would check the below for every processed date:

source file count or source row count
destination row count
duplicate business keys
null checks on important columns
date boundary validation

For example:

  
SELECT order_date, COUNT(*)
FROM analytics.orders
WHERE order_date BETWEEN '2025-01-10' AND '2025-01-12'
GROUP BY order_date
ORDER BY order_date;

And for duplicate detection:

  
SELECT order_id, order_date, COUNT(*)
FROM analytics.orders
WHERE order_date BETWEEN '2025-01-10' AND '2025-01-12'
GROUP BY order_id, order_date
HAVING COUNT(*) > 1;

These are simple checks, but they catch many common problems. In a production use case, I would also compare aggregates against the source system if that is possible.

Things to be careful about

There are a few practical issues that teams usually underestimate.

1. Late arriving dimensions

If fact tables depend on dimension tables, backfilling facts alone may still produce wrong joins. We might need to backfill reference data first.

2. Changing business logic

If the code today is different from the code that created the historical data earlier, then a backfill is not just filling a gap. It is rewriting history. Sometimes that is correct, but business users should know.

3. Load on source systems

A big backfill can hit APIs, databases, or warehouses much harder than a regular schedule. Use rate limits, chunking, or off-hours windows where possible.

4. Downstream expectations

If dashboards, ML features, or finance extracts are reading the data during the backfill, numbers may look unstable until the run completes. It is better to communicate that upfront.

What changes in production

For a simple demo, we can run a script with start and end dates and then manually validate the results. In production, I would add a few more guardrails:

approval step for large date ranges
concurrency limits so normal schedules are not starved
metrics and alerts per partition
audit logging for who requested the backfill
automatic resume from the last successful checkpoint

I also prefer having a separate compute pool or queue for backfills if the platform supports it. Otherwise a heavy historical run can slow down the regular ingestion jobs.

A simple mental model

If I need to explain reliable backfills in one line, it would be this: process a well-defined slice, write idempotently, validate the result, and keep enough metadata to resume safely. Most failures happen when one of these four is missing.

Backfills are not the most exciting part of data engineering, but they matter a lot because every real pipeline eventually needs them. If we design for them early, recovery becomes routine instead of stressful. That is usually the difference between a pipeline that only works on good days and one that is actually reliable.

Data Engineering, ETL

This post is licensed under CC BY 4.0 by the author.