Using Step Functions with AWS Glue for simple orchestration
In this article, let us see how to use AWS Step Functions together with AWS Glue for simple orchestration, and why this is often a good choice when you do not want to build a full scheduler or a heavy workflow platform. If your pipeline is mostly a few sequential steps like validate input, run one or two Glue jobs, and then mark success or failure, this approach is usually enough and it is easier to reason about.
Many beginner teams start with a Glue job alone and trigger it manually or on a schedule. That works for the first version. The problem comes when the flow has a bit more logic. Maybe you want one job to run only if a file is present. Maybe you want to run a second job only if the first job succeeds. Maybe you want clean retries and one place to see where the execution failed. That is where Step Functions fits nicely.
From my understanding, Step Functions is not there to replace Glue. Glue still does the actual ETL work. Step Functions gives the control flow around it. Think of it like this:
- Glue does the data processing
- Step Functions decides what runs next
- CloudWatch logs help with debugging
- IAM controls what each piece can do
A simple use case
Let us assume we receive a CSV file in S3 every day. We want to:
- Start a raw-to-bronze Glue job
- If that succeeds, start a bronze-to-curated Glue job
- If anything fails, stop the workflow and surface the error
For a simple pipeline, this is enough orchestration. We do not need Airflow just for two dependent jobs.
When this approach makes sense
A quick comparison may help.
| Option | Best for | Limitation |
|---|---|---|
| Glue trigger only | Very simple job chains inside Glue | Limited visibility and branching |
| EventBridge schedule + Glue | One scheduled job | No real workflow logic |
| Step Functions + Glue | Small to medium workflows with retries and branching | State machine definitions add some setup |
| Airflow or MWAA | Bigger platforms with many dependencies | More operational overhead |
For our use case, Step Functions sits in a nice middle ground. It is more capable than a plain schedule, but much simpler than running a workflow platform.
The architecture
A basic setup can look like this:
1
2
3
4
5
S3 file arrives
-> Step Functions execution starts
-> Glue job: ingest_raw_orders
-> Glue job: build_curated_orders
-> Success notification or end state
You can start the Step Functions execution in different ways:
- on a schedule using EventBridge
- from a Lambda after an S3 event
- manually for testing
- from another application or API
For a demo, I usually start it manually first because it is easier to debug one layer at a time. In production, I would usually connect it to EventBridge or an event-driven trigger depending on how the source files arrive.
A sample Step Functions definition
Below is a simple state machine definition using the Glue integration.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
{
"StartAt": "RunRawJob",
"States": {
"RunRawJob": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "ingest_raw_orders",
"Arguments": {
"--input_path": "s3://demo-raw/orders/2025-01-26/orders.csv",
"--output_path": "s3://demo-bronze/orders/"
}
},
"Retry": [
{
"ErrorEquals": ["States.ALL"],
"IntervalSeconds": 30,
"MaxAttempts": 2,
"BackoffRate": 2
}
],
"Next": "RunCuratedJob"
},
"RunCuratedJob": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "build_curated_orders",
"Arguments": {
"--input_path": "s3://demo-bronze/orders/",
"--output_path": "s3://demo-curated/orders/"
}
},
"End": true
}
}
}
The important thing here is the .sync integration. That makes Step Functions wait for the Glue job to complete instead of just firing it and moving on immediately. For simple orchestration, that is usually what we want.
A simple Glue job script
Your Glue job can still be very normal. For example, a PySpark job may look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
import sys
from awsglue.utils import getResolvedOptions
from pyspark.sql import SparkSession
args = getResolvedOptions(sys.argv, ["input_path", "output_path"])
spark = SparkSession.builder.getOrCreate()
df = spark.read.option("header", True).csv(args["input_path"])
clean_df = df.dropDuplicates(["order_id"])
clean_df.write.mode("overwrite").parquet(args["output_path"])
What I like here is that orchestration logic stays outside the ETL code. The Glue script focuses on reading, transforming, and writing data. Step Functions handles the job order and retry behavior. That separation makes debugging easier.
Passing dynamic values between steps
One useful pattern is to pass runtime values like a business date, S3 path, or table name into the state machine input. Then each Glue job receives those values as arguments.
For example, the execution input could be:
1
2
3
4
{
"process_date": "2025-01-26",
"input_path": "s3://demo-raw/orders/2025-01-26/orders.csv"
}
And inside the Step Functions task:
1
2
3
4
5
"Arguments": {
"--process_date.$": "$.process_date",
"--input_path.$": "$.input_path",
"--output_path": "s3://demo-bronze/orders/"
}
This is much cleaner than hardcoding file names or dates into the job definition.
Things to be careful about
This setup is simple, but there are still a few practical caveats.
1. Glue startup time
Glue jobs are not always instant. For small jobs, the startup time can feel longer than the processing itself. If the transformation is tiny and latency matters, Lambda or ECS may be a better fit.
2. Retries can duplicate work
If you retry a failed Glue job blindly, make sure the job is idempotent or writes to a safe staging location. Otherwise, you may duplicate data or partially overwrite outputs.
3. State machine errors are only part of the picture
If a Glue job fails, Step Functions shows you the failed state, which is useful, but you still need CloudWatch logs and Glue run logs to understand the root cause. Step Functions improves visibility, but it does not replace job-level logging.
4. IAM permissions need to be clear
Your Step Functions role must be allowed to start Glue jobs, and the Glue role must be allowed to read and write the right S3 locations. Many early failures are just permission issues.
What I would change in production
For a small demo, two sequential tasks are enough. In a production setup, I would usually add a few more things:
- input validation before running the first Glue job
- failure notification through SNS, Slack, or email
- explicit timeout handling
- separate dev and prod state machines
- infrastructure as code using Terraform or CloudFormation
- better partitioning and data quality checks inside the Glue jobs
If multiple datasets all follow the same pattern, I would also think about creating a reusable state machine template rather than copying JSON for every pipeline.
A simple SQL check after the run
If the final data lands in a queryable table, I like to add a quick validation step outside the workflow. For example, in Athena:
1
2
3
4
SELECT process_date, COUNT(*) AS row_count
FROM curated_orders
WHERE process_date = DATE '2025-01-26'
GROUP BY process_date;
This is not orchestration by itself, but it gives a fast sanity check that the downstream table looks reasonable after the workflow finishes.
Conclusion
If your pipeline is more than one Glue job but still not large enough to justify a full workflow platform, Step Functions plus Glue is a very practical combination. You get retries, sequencing, and better visibility without adding too much operational weight. For simple orchestration, this is one of the cleaner AWS-native patterns to start with, and it is easy to grow from here once the pipeline becomes more serious.
