Post

Handling retries and idempotency in ETL jobs

In this article let us see how to handle retries and idempotency in ETL jobs, and why this matters when a pipeline fails halfway and we need to run it again without creating bad data. Most teams start with a simple pipeline that works when everything is green, but the real trouble starts when the network times out, the warehouse load fails, or the orchestrator retries a task automatically. If we do not design for that situation, a retry can duplicate rows, overwrite correct data, or leave one part of the pipeline updated while another part is stale.

For a small demo pipeline, retries may look like an operational problem. In production, retries are part of the design itself. We should assume jobs will fail sometimes and build them in a way that a second run gives the same final result as the first successful run. That is the main idea behind idempotency.

What retry-safe really means

A retry-safe ETL job is not just a job that can be run again. It is a job that can be run again and still keep the destination in a correct state.

In simple terms:

SituationBad resultBetter result
Load step retries after partial insertDuplicate rowsMerge or replace only the intended batch
File already processed onceSame file loaded twiceDetect and skip or upsert deterministically
Job fails after writing temp dataBroken checkpointCommit only after full success
API call is repeatedDuplicate downstream actionUse request keys or state tracking

A lot of retry issues happen because we think only at the task level. The actual data correctness depends on the whole flow, including extract, stage, transform, load, and metadata updates.

A simple pipeline where retries go wrong

Assume we ingest order data every hour from cloud storage into a staging table, then transform it into a final analytics table.

A first version could look like this:

1
2
3
4
5
6
7
8
def run_job(file_path):
    rows = read_csv(file_path)
    write_to_table("stg_orders", rows, mode="append")
    execute_sql("""
        insert into mart_orders
        select * from stg_orders
        where load_file = ''
    """)

This looks fine until the insert into mart_orders succeeds but the orchestrator still marks the task failed because of a timeout or connection drop. Then the scheduler retries the whole job. The staging append happens again, the final insert happens again, and now we have duplicates in both places.

This is why append everywhere is risky unless the data itself has a reliable deduplication key and we actually use it.

Pattern 1: Track a stable batch key

Every run should have a stable identifier. This can be a source file name, source system extract ID, business date, or a generated batch ID that is stored with the data. The important part is that retries of the same logical batch reuse the same key.

For example:

1
2
3
4
{
  "batch_id": "orders_2025_02_18_10",
  "source_file": "orders-2025-02-18-10.csv"
}

If the retry uses a new batch ID every time, then we have already made deduplication harder. I prefer to derive the batch key from the source event when possible, since it makes debugging much easier.

Pattern 2: Load into staging, then merge

Instead of blindly inserting into the final table, use a deterministic MERGE or equivalent upsert logic. This is usually the safest default when records have a natural business key.

Example in SQL:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
merge into mart_orders t
using (
    select
        order_id,
        customer_id,
        order_status,
        order_total,
        updated_at,
        batch_id
    from stg_orders
    where batch_id = 'orders_2025_02_18_10'
) s
on t.order_id = s.order_id
when matched and s.updated_at >= t.updated_at then
    update set
        customer_id = s.customer_id,
        order_status = s.order_status,
        order_total = s.order_total,
        updated_at = s.updated_at,
        batch_id = s.batch_id
when not matched then
    insert (order_id, customer_id, order_status, order_total, updated_at, batch_id)
    values (s.order_id, s.customer_id, s.order_status, s.order_total, s.updated_at, s.batch_id);

Now if the same batch is retried, the same rows are matched again and the target ends up in the same final state. That is much better than repeated plain inserts.

Pattern 3: Maintain a processed batch table

Another useful technique is to store job metadata in a control table. This table can track which batches started, which completed, row counts, hashes, and timestamps.

Example schema:

1
2
3
4
5
6
7
8
9
create table etl_batch_log (
    batch_id string,
    pipeline_name string,
    source_name string,
    status string,
    row_count integer,
    started_at timestamp,
    completed_at timestamp
);

Then the pipeline flow becomes:

  1. Check if batch_id already completed
  2. If yes, skip
  3. If not, mark it RUNNING
  4. Load and transform data
  5. Mark it COMPLETED only after the final commit succeeds

Pseudo-code:

1
2
3
4
5
6
7
8
9
10
11
if is_completed(batch_id):
    return

mark_running(batch_id)
try:
    load_stage(batch_id)
    merge_target(batch_id)
    mark_completed(batch_id)
except Exception:
    mark_failed(batch_id)
    raise

This is not enough by itself, but it gives the orchestrator and operators a clear state model. Also, if the job is rerun manually, we can see whether the batch really needs reprocessing.

Pattern 4: Make staging cleanup deliberate

One mistake I have seen is deleting staging data too early. If staging rows are removed before the final load is confirmed, then a retry may not have the exact same input anymore. For that reason I prefer either:

  • partitioned staging tables with batch-level retention, or
  • temporary tables that live for the entire transaction or run

For our use case, a good compromise is to keep staging data for a few days and clean it with a separate housekeeping job. It costs a bit more storage, but it makes recovery easier.

Pattern 5: Design retries at the orchestration layer too

Even when SQL is idempotent, the orchestration setup still matters. Airflow, Glue workflows, Step Functions, and similar tools may retry tasks because of infrastructure issues, not business logic errors.

A few things to be careful about:

  • Do not retry forever on bad input data
  • Use shorter retries for transient network errors
  • Pass the same batch key on retry
  • Separate extract retry rules from load retry rules
  • Alert when a job keeps failing with the same batch

For example, if an API extract is not idempotent and each call returns moving data, then retrying the extract step may produce a different input set from the original attempt. In that case we may need to land the first response as a raw file and retry only from that stored snapshot.

Production versus simple demo

In a demo, we can often get away with one staging table and one merge statement. In production, I would usually add a few more controls:

  • source file checksum or manifest validation
  • audit columns like ingested_at, batch_id, and run_id
  • dead-letter handling for bad records
  • stronger locking or concurrency control if two runs may overlap
  • data quality checks before marking the batch complete

Concurrency is a big one. A retry-safe pipeline can still break if two runs for the same batch execute at the same time and both think they own the work. Some platforms handle this with job-level concurrency limits. In other cases we need an application lock or a transactional control table.

A practical default approach

If I had to choose a simple pattern that works in many data projects, I would do this:

  1. Land source data unchanged
  2. Assign a stable batch key
  3. Record batch state in a control table
  4. Transform from the landed batch only
  5. Use MERGE into the target table
  6. Mark success only after the target write completes
  7. Clean old stage data later, not in the critical path

This is usually enough to make retries boring, and boring is what we want in ETL operations.

Retries are going to happen whether we plan for them or not. If the pipeline is idempotent, a retry becomes routine. If it is not, every rerun becomes a small incident. That is why it is worth building the batch key, merge logic, and control tables from the beginning instead of trying to patch duplicates later.

This post is licensed under CC BY 4.0 by the author.