Post

How to Design Idempotent ETL Jobs in AWS

If your pipeline cannot be rerun safely, it is not production-ready.

Idempotency means:

Running the same job with the same input multiple times produces the same final output.

This is one of the most important reliability properties in ETL systems.

Why idempotency matters

Without idempotency:

  • retries can duplicate records
  • backfills can corrupt outputs
  • incident recovery becomes manual and risky

With idempotency:

  • retries are safe
  • replay is predictable
  • operations become calmer

Core design rules

  1. deterministic keys for business entities
  2. partition-aware write strategy
  3. clear upsert/merge logic
  4. run metadata and checkpoint tracking
  5. publish only after successful checks

Write strategy options

Option A: overwrite partition

Good for daily snapshot partitions.

  • rewrite the specific partition atomically
  • avoid appending duplicates

Option B: merge/upsert

Good for CDC/incremental entities.

  • use business key + latest timestamp
  • resolve conflicts deterministically

Example dedupe pattern (SQL)

1
2
3
4
5
6
7
8
9
10
11
WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY order_id
           ORDER BY updated_at DESC
         ) AS rn
  FROM staging.orders_incremental
)
SELECT *
FROM ranked
WHERE rn = 1;

This ensures reruns keep stable final records.

Maintain a pipeline_runs table with:

  • run_id
  • dataset
  • partition
  • status
  • started_at
  • completed_at
  • input checksum/version

This helps prevent accidental double-publish and supports replay audits.

Idempotency + orchestration

Step Functions should:

  • check if partition already successfully published
  • skip if already complete and unchanged
  • rerun only when forced backfill is requested

This avoids accidental reprocessing loops.

Common anti-patterns

  1. append-only writes for mutable entities
  2. no unique business keys
  3. random UUIDs created during transform stage
  4. reruns with no run metadata tracking

Final take

Idempotency is not an advanced optimization.

It is a baseline reliability requirement for any serious data platform.

If your jobs can’t safely retry and replay, incident handling will always remain expensive.

This post is licensed under CC BY 4.0 by the author.