Post

Step Functions + Glue: A Practical Reference Architecture for Reliable Pipelines

A lot of Glue-based pipelines start simple and become fragile quickly.

Typical symptoms:

  • one failed job breaks the full daily load
  • retry logic is unclear
  • quality checks are inconsistent
  • backfills are painful

This is where Step Functions becomes useful. Not because it is fashionable, but because it makes failure behavior explicit.

In this post, I’ll share a practical architecture for teams running AWS data pipelines at beginner/intermediate maturity.

Why Step Functions + Glue works well

Glue gives managed Spark transforms. Step Functions gives stateful orchestration with visibility.

Together, you get:

  • deterministic execution flow
  • explicit retries and timeout behavior
  • clean failure paths
  • easier observability and replay

Core architecture

At a high level:

  1. Scheduler triggers state machine
  2. State machine validates input availability
  3. Glue job runs raw -> clean transform
  4. Quality checks run
  5. If pass, publish curated layer
  6. If fail, alert + send to replay path
flowchart TD
    A[EventBridge Schedule] --> B[Step Functions]
    B --> C[Check Input Availability]
    C --> D[Run Glue Job]
    D --> E[Run Quality Checks]
    E -->|Pass| F[Publish Curated Layer]
    E -->|Fail| G[Alert + Quarantine]
    G --> H[Replay Queue]

State machine design principles

Keep these principles:

  1. Small states, clear intent
  2. No hidden retry loops inside scripts
  3. Each failure path must end in an action
  4. Carry run metadata through all states

Useful metadata to pass:

  • dataset name
  • partition date
  • run_id
  • source location
  • quality threshold profile

Retry strategy that avoids chaos

Not all failures are equal.

  • Transient failures (network throttling, temporary service issue)
    • use retries with backoff
  • Deterministic failures (schema mismatch, bad input)
    • fail fast, quarantine, alert

Example Step Functions task retry pattern:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
"RunGlueJob": {
  "Type": "Task",
  "Resource": "arn:aws:states:::glue:startJobRun.sync",
  "Retry": [
    {
      "ErrorEquals": ["Glue.ConcurrentRunsExceededException", "States.Timeout"],
      "IntervalSeconds": 30,
      "MaxAttempts": 3,
      "BackoffRate": 2.0
    }
  ],
  "Catch": [
    {
      "ErrorEquals": ["States.ALL"],
      "Next": "NotifyFailure"
    }
  ],
  "Next": "RunQualityChecks"
}

Quality stage should be mandatory

Do not treat quality checks as optional post-processing.

A simple pattern:

  • run SQL checks in Athena or Lambda
  • return PASS / FAIL
  • block publish on fail

Example quality output payload:

1
2
3
4
5
{
  "status": "FAIL",
  "failed_checks": ["duplicate_order_id", "freshness_sla"],
  "run_id": "2024-09-08-orders-daily"
}

This makes incident response faster than scanning generic logs.

Replay design (the most underrated part)

Most pipelines support retries, but not replay.

Replay-ready design means:

  • keep raw files immutable
  • partition outputs by run window
  • isolate failed partitions
  • replay only failed window, not full history

This reduces cost and shortens recovery time.

Monitoring checklist

Track these metrics for each pipeline:

  • success/failure count by day
  • median + p95 run duration
  • quality fail rate
  • replay frequency
  • data freshness lag

Alert thresholds should be business-facing, not infrastructure-facing.

For example:

  • “orders dataset not available by 7:30 AM”

is better than:

  • “task X failed with code 137”

Cost notes

Step Functions adds orchestration cost, but usually saves money by:

  • reducing failed reruns
  • minimizing full reprocessing
  • shortening investigation time

Glue costs can be controlled by:

  • right-sizing worker count
  • compacting files upstream
  • limiting unnecessary repartitions

Minimal reference repo structure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
repo/
  orchestrations/
    orders_daily_state_machine.json
  glue/
    jobs/
      orders_raw_to_clean.py
      orders_clean_to_curated.py
  quality/
    checks/
      orders_quality.sql
  infra/
    step_functions.tf
    glue_jobs.tf
    iam.tf

This separation keeps ownership clear and reviews simpler.

How this helps AI-ready data platforms

If you plan to serve AI use cases later, this architecture helps because:

  • curated data quality is enforced
  • lineage is easier to track
  • replay patterns support feature recomputation
  • run metadata helps model/feature auditability

In short, reliable orchestration is a data engineering capability that directly supports AI engineering maturity.

Final take

If your Glue pipelines are growing beyond one simple daily job, Step Functions is usually the right orchestration upgrade.

The key is not to over-engineer the first version. Start with:

  • one state machine
  • one quality gate
  • one replay path

Then evolve from usage and incidents.

In the next post, I’ll cover a practical migration checklist for engineers moving from GCP Dataflow orchestration patterns to AWS-native workflow patterns.

This post is licensed under CC BY 4.0 by the author.