How to Design Idempotent ETL Jobs in AWS

Posted Jun 2, 2024

By Ashok KS 1 min read

If your pipeline cannot be rerun safely, it is not production-ready.

Idempotency means:

Running the same job with the same input multiple times produces the same final output.

This is one of the most important reliability properties in ETL systems.

Why idempotency matters

Without idempotency:

retries can duplicate records
backfills can corrupt outputs
incident recovery becomes manual and risky

With idempotency:

retries are safe
replay is predictable
operations become calmer

Core design rules

deterministic keys for business entities
partition-aware write strategy
clear upsert/merge logic
run metadata and checkpoint tracking
publish only after successful checks

Write strategy options

Option A: overwrite partition

Good for daily snapshot partitions.

rewrite the specific partition atomically
avoid appending duplicates

Option B: merge/upsert

Good for CDC/incremental entities.

use business key + latest timestamp
resolve conflicts deterministically

Example dedupe pattern (SQL)

  
WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY order_id
           ORDER BY updated_at DESC
         ) AS rn
  FROM staging.orders_incremental
)
SELECT *
FROM ranked
WHERE rn = 1;

This ensures reruns keep stable final records.

Run metadata table (recommended)

Maintain a pipeline_runs table with:

run_id
dataset
partition
status
started_at
completed_at
input checksum/version

This helps prevent accidental double-publish and supports replay audits.

Idempotency + orchestration

Step Functions should:

check if partition already successfully published
skip if already complete and unchanged
rerun only when forced backfill is requested

This avoids accidental reprocessing loops.

Common anti-patterns

append-only writes for mutable entities
no unique business keys
random UUIDs created during transform stage
reruns with no run metadata tracking

Final take

Idempotency is not an advanced optimization.

It is a baseline reliability requirement for any serious data platform.

If your jobs can’t safely retry and replay, incident handling will always remain expensive.

AWS, Data Engineering, Reliability

This post is licensed under CC BY 4.0 by the author.