Step Functions + Glue: A Practical Reference Architecture for Reliable Pipelines

Posted Sep 8, 2024

By Ashok KS 3 min read

A lot of Glue-based pipelines start simple and become fragile quickly.

Typical symptoms:

one failed job breaks the full daily load
retry logic is unclear
quality checks are inconsistent
backfills are painful

This is where Step Functions becomes useful. Not because it is fashionable, but because it makes failure behavior explicit.

In this post, I’ll share a practical architecture for teams running AWS data pipelines at beginner/intermediate maturity.

Why Step Functions + Glue works well

Glue gives managed Spark transforms. Step Functions gives stateful orchestration with visibility.

Together, you get:

deterministic execution flow
explicit retries and timeout behavior
clean failure paths
easier observability and replay

Core architecture

At a high level:

Scheduler triggers state machine
State machine validates input availability
Glue job runs raw -> clean transform
Quality checks run
If pass, publish curated layer
If fail, alert + send to replay path

flowchart TD
    A[EventBridge Schedule] --> B[Step Functions]
    B --> C[Check Input Availability]
    C --> D[Run Glue Job]
    D --> E[Run Quality Checks]
    E -->|Pass| F[Publish Curated Layer]
    E -->|Fail| G[Alert + Quarantine]
    G --> H[Replay Queue]

State machine design principles

Keep these principles:

Small states, clear intent
No hidden retry loops inside scripts
Each failure path must end in an action
Carry run metadata through all states

Useful metadata to pass:

dataset name
partition date
run_id
source location
quality threshold profile

Retry strategy that avoids chaos

Not all failures are equal.

Transient failures (network throttling, temporary service issue)
- use retries with backoff
Deterministic failures (schema mismatch, bad input)
- fail fast, quarantine, alert

Example Step Functions task retry pattern:

  
"RunGlueJob": {
  "Type": "Task",
  "Resource": "arn:aws:states:::glue:startJobRun.sync",
  "Retry": [
    {
      "ErrorEquals": ["Glue.ConcurrentRunsExceededException", "States.Timeout"],
      "IntervalSeconds": 30,
      "MaxAttempts": 3,
      "BackoffRate": 2.0
    }
  ],
  "Catch": [
    {
      "ErrorEquals": ["States.ALL"],
      "Next": "NotifyFailure"
    }
  ],
  "Next": "RunQualityChecks"
}

Quality stage should be mandatory

Do not treat quality checks as optional post-processing.

A simple pattern:

run SQL checks in Athena or Lambda
return PASS / FAIL
block publish on fail

Example quality output payload:

  
{
  "status": "FAIL",
  "failed_checks": ["duplicate_order_id", "freshness_sla"],
  "run_id": "2024-09-08-orders-daily"
}

This makes incident response faster than scanning generic logs.

Replay design (the most underrated part)

Most pipelines support retries, but not replay.

Replay-ready design means:

keep raw files immutable
partition outputs by run window
isolate failed partitions
replay only failed window, not full history

This reduces cost and shortens recovery time.

Monitoring checklist

Track these metrics for each pipeline:

success/failure count by day
median + p95 run duration
quality fail rate
replay frequency
data freshness lag

Alert thresholds should be business-facing, not infrastructure-facing.

For example:

“orders dataset not available by 7:30 AM”

is better than:

“task X failed with code 137”

Cost notes

Step Functions adds orchestration cost, but usually saves money by:

reducing failed reruns
minimizing full reprocessing
shortening investigation time

Glue costs can be controlled by:

right-sizing worker count
compacting files upstream
limiting unnecessary repartitions

Minimal reference repo structure

  
repo/
  orchestrations/
    orders_daily_state_machine.json
  glue/
    jobs/
      orders_raw_to_clean.py
      orders_clean_to_curated.py
  quality/
    checks/
      orders_quality.sql
  infra/
    step_functions.tf
    glue_jobs.tf
    iam.tf

This separation keeps ownership clear and reviews simpler.

How this helps AI-ready data platforms

If you plan to serve AI use cases later, this architecture helps because:

curated data quality is enforced
lineage is easier to track
replay patterns support feature recomputation
run metadata helps model/feature auditability

In short, reliable orchestration is a data engineering capability that directly supports AI engineering maturity.

Final take

If your Glue pipelines are growing beyond one simple daily job, Step Functions is usually the right orchestration upgrade.

The key is not to over-engineer the first version. Start with:

one state machine
one quality gate
one replay path

Then evolve from usage and incidents.

In the next post, I’ll cover a practical migration checklist for engineers moving from GCP Dataflow orchestration patterns to AWS-native workflow patterns.

AWS, Data Engineering, Architecture

This post is licensed under CC BY 4.0 by the author.