Step Functions + Glue: A Practical Reference Architecture for Reliable Pipelines
A lot of Glue-based pipelines start simple and become fragile quickly.
Typical symptoms:
- one failed job breaks the full daily load
- retry logic is unclear
- quality checks are inconsistent
- backfills are painful
This is where Step Functions becomes useful. Not because it is fashionable, but because it makes failure behavior explicit.
In this post, I’ll share a practical architecture for teams running AWS data pipelines at beginner/intermediate maturity.
Why Step Functions + Glue works well
Glue gives managed Spark transforms. Step Functions gives stateful orchestration with visibility.
Together, you get:
- deterministic execution flow
- explicit retries and timeout behavior
- clean failure paths
- easier observability and replay
Core architecture
At a high level:
- Scheduler triggers state machine
- State machine validates input availability
- Glue job runs raw -> clean transform
- Quality checks run
- If pass, publish curated layer
- If fail, alert + send to replay path
flowchart TD
A[EventBridge Schedule] --> B[Step Functions]
B --> C[Check Input Availability]
C --> D[Run Glue Job]
D --> E[Run Quality Checks]
E -->|Pass| F[Publish Curated Layer]
E -->|Fail| G[Alert + Quarantine]
G --> H[Replay Queue]
State machine design principles
Keep these principles:
- Small states, clear intent
- No hidden retry loops inside scripts
- Each failure path must end in an action
- Carry run metadata through all states
Useful metadata to pass:
- dataset name
- partition date
- run_id
- source location
- quality threshold profile
Retry strategy that avoids chaos
Not all failures are equal.
- Transient failures (network throttling, temporary service issue)
- use retries with backoff
- Deterministic failures (schema mismatch, bad input)
- fail fast, quarantine, alert
Example Step Functions task retry pattern:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
"RunGlueJob": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Retry": [
{
"ErrorEquals": ["Glue.ConcurrentRunsExceededException", "States.Timeout"],
"IntervalSeconds": 30,
"MaxAttempts": 3,
"BackoffRate": 2.0
}
],
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"Next": "NotifyFailure"
}
],
"Next": "RunQualityChecks"
}
Quality stage should be mandatory
Do not treat quality checks as optional post-processing.
A simple pattern:
- run SQL checks in Athena or Lambda
- return
PASS/FAIL - block publish on fail
Example quality output payload:
1
2
3
4
5
{
"status": "FAIL",
"failed_checks": ["duplicate_order_id", "freshness_sla"],
"run_id": "2024-09-08-orders-daily"
}
This makes incident response faster than scanning generic logs.
Replay design (the most underrated part)
Most pipelines support retries, but not replay.
Replay-ready design means:
- keep raw files immutable
- partition outputs by run window
- isolate failed partitions
- replay only failed window, not full history
This reduces cost and shortens recovery time.
Monitoring checklist
Track these metrics for each pipeline:
- success/failure count by day
- median + p95 run duration
- quality fail rate
- replay frequency
- data freshness lag
Alert thresholds should be business-facing, not infrastructure-facing.
For example:
- “orders dataset not available by 7:30 AM”
is better than:
- “task X failed with code 137”
Cost notes
Step Functions adds orchestration cost, but usually saves money by:
- reducing failed reruns
- minimizing full reprocessing
- shortening investigation time
Glue costs can be controlled by:
- right-sizing worker count
- compacting files upstream
- limiting unnecessary repartitions
Minimal reference repo structure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
repo/
orchestrations/
orders_daily_state_machine.json
glue/
jobs/
orders_raw_to_clean.py
orders_clean_to_curated.py
quality/
checks/
orders_quality.sql
infra/
step_functions.tf
glue_jobs.tf
iam.tf
This separation keeps ownership clear and reviews simpler.
How this helps AI-ready data platforms
If you plan to serve AI use cases later, this architecture helps because:
- curated data quality is enforced
- lineage is easier to track
- replay patterns support feature recomputation
- run metadata helps model/feature auditability
In short, reliable orchestration is a data engineering capability that directly supports AI engineering maturity.
Final take
If your Glue pipelines are growing beyond one simple daily job, Step Functions is usually the right orchestration upgrade.
The key is not to over-engineer the first version. Start with:
- one state machine
- one quality gate
- one replay path
Then evolve from usage and incidents.
In the next post, I’ll cover a practical migration checklist for engineers moving from GCP Dataflow orchestration patterns to AWS-native workflow patterns.