Using Step Functions with AWS Glue for simple orchestration

Posted Jan 26, 2025

By Ashok KS 6 min read

In this article, let us see how to use AWS Step Functions together with AWS Glue for simple orchestration, and why this is often a good choice when you do not want to build a full scheduler or a heavy workflow platform. If your pipeline is mostly a few sequential steps like validate input, run one or two Glue jobs, and then mark success or failure, this approach is usually enough and it is easier to reason about.

Many beginner teams start with a Glue job alone and trigger it manually or on a schedule. That works for the first version. The problem comes when the flow has a bit more logic. Maybe you want one job to run only if a file is present. Maybe you want to run a second job only if the first job succeeds. Maybe you want clean retries and one place to see where the execution failed. That is where Step Functions fits nicely.

From my understanding, Step Functions is not there to replace Glue. Glue still does the actual ETL work. Step Functions gives the control flow around it. Think of it like this:

Glue does the data processing
Step Functions decides what runs next
CloudWatch logs help with debugging
IAM controls what each piece can do

A simple use case

Let us assume we receive a CSV file in S3 every day. We want to:

Start a raw-to-bronze Glue job
If that succeeds, start a bronze-to-curated Glue job
If anything fails, stop the workflow and surface the error

For a simple pipeline, this is enough orchestration. We do not need Airflow just for two dependent jobs.

When this approach makes sense

A quick comparison may help.

Option	Best for	Limitation
Glue trigger only	Very simple job chains inside Glue	Limited visibility and branching
EventBridge schedule + Glue	One scheduled job	No real workflow logic
Step Functions + Glue	Small to medium workflows with retries and branching	State machine definitions add some setup
Airflow or MWAA	Bigger platforms with many dependencies	More operational overhead

For our use case, Step Functions sits in a nice middle ground. It is more capable than a plain schedule, but much simpler than running a workflow platform.

The architecture

A basic setup can look like this:

S3 file arrives
   -> Step Functions execution starts
      -> Glue job: ingest_raw_orders
      -> Glue job: build_curated_orders
      -> Success notification or end state

You can start the Step Functions execution in different ways:

on a schedule using EventBridge
from a Lambda after an S3 event
manually for testing
from another application or API

For a demo, I usually start it manually first because it is easier to debug one layer at a time. In production, I would usually connect it to EventBridge or an event-driven trigger depending on how the source files arrive.

A sample Step Functions definition

Below is a simple state machine definition using the Glue integration.

  
{
  "StartAt": "RunRawJob",
  "States": {
    "RunRawJob": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "ingest_raw_orders",
        "Arguments": {
          "--input_path": "s3://demo-raw/orders/2025-01-26/orders.csv",
          "--output_path": "s3://demo-bronze/orders/"
        }
      },
      "Retry": [
        {
          "ErrorEquals": ["States.ALL"],
          "IntervalSeconds": 30,
          "MaxAttempts": 2,
          "BackoffRate": 2
        }
      ],
      "Next": "RunCuratedJob"
    },
    "RunCuratedJob": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "build_curated_orders",
        "Arguments": {
          "--input_path": "s3://demo-bronze/orders/",
          "--output_path": "s3://demo-curated/orders/"
        }
      },
      "End": true
    }
  }
}

The important thing here is the .sync integration. That makes Step Functions wait for the Glue job to complete instead of just firing it and moving on immediately. For simple orchestration, that is usually what we want.

A simple Glue job script

Your Glue job can still be very normal. For example, a PySpark job may look like this:

  
import sys
from awsglue.utils import getResolvedOptions
from pyspark.sql import SparkSession

args = getResolvedOptions(sys.argv, ["input_path", "output_path"])

spark = SparkSession.builder.getOrCreate()

df = spark.read.option("header", True).csv(args["input_path"])

clean_df = df.dropDuplicates(["order_id"])

clean_df.write.mode("overwrite").parquet(args["output_path"])

What I like here is that orchestration logic stays outside the ETL code. The Glue script focuses on reading, transforming, and writing data. Step Functions handles the job order and retry behavior. That separation makes debugging easier.

Passing dynamic values between steps

One useful pattern is to pass runtime values like a business date, S3 path, or table name into the state machine input. Then each Glue job receives those values as arguments.

For example, the execution input could be:

  
{
  "process_date": "2025-01-26",
  "input_path": "s3://demo-raw/orders/2025-01-26/orders.csv"
}

And inside the Step Functions task:

  
"Arguments": {
  "--process_date.$": "$.process_date",
  "--input_path.$": "$.input_path",
  "--output_path": "s3://demo-bronze/orders/"
}

This is much cleaner than hardcoding file names or dates into the job definition.

Things to be careful about

This setup is simple, but there are still a few practical caveats.

1. Glue startup time

Glue jobs are not always instant. For small jobs, the startup time can feel longer than the processing itself. If the transformation is tiny and latency matters, Lambda or ECS may be a better fit.

2. Retries can duplicate work

If you retry a failed Glue job blindly, make sure the job is idempotent or writes to a safe staging location. Otherwise, you may duplicate data or partially overwrite outputs.

3. State machine errors are only part of the picture

If a Glue job fails, Step Functions shows you the failed state, which is useful, but you still need CloudWatch logs and Glue run logs to understand the root cause. Step Functions improves visibility, but it does not replace job-level logging.

4. IAM permissions need to be clear

Your Step Functions role must be allowed to start Glue jobs, and the Glue role must be allowed to read and write the right S3 locations. Many early failures are just permission issues.

What I would change in production

For a small demo, two sequential tasks are enough. In a production setup, I would usually add a few more things:

input validation before running the first Glue job
failure notification through SNS, Slack, or email
explicit timeout handling
separate dev and prod state machines
infrastructure as code using Terraform or CloudFormation
better partitioning and data quality checks inside the Glue jobs

If multiple datasets all follow the same pattern, I would also think about creating a reusable state machine template rather than copying JSON for every pipeline.

A simple SQL check after the run

If the final data lands in a queryable table, I like to add a quick validation step outside the workflow. For example, in Athena:

  
SELECT process_date, COUNT(*) AS row_count
FROM curated_orders
WHERE process_date = DATE '2025-01-26'
GROUP BY process_date;

This is not orchestration by itself, but it gives a fast sanity check that the downstream table looks reasonable after the workflow finishes.

Conclusion

If your pipeline is more than one Glue job but still not large enough to justify a full workflow platform, Step Functions plus Glue is a very practical combination. You get retries, sequencing, and better visibility without adding too much operational weight. For simple orchestration, this is one of the cleaner AWS-native patterns to start with, and it is easy to grow from here once the pipeline becomes more serious.

AWS, Data Engineering, Tutorial

This post is licensed under CC BY 4.0 by the author.