AWS Step Functions for Data Pipeline Orchestration: A Practical Guide

Posted May 13, 2025

By Ashok KS 8 min read

In this article, let us look at AWS Step Functions and how you can use it to orchestrate your data pipelines. If you have been building data pipelines for a while, you have probably used tools like Apache Airflow or maybe even cron jobs running on an EC2 instance. Step Functions sits somewhere in between — it is serverless, deeply integrated with AWS services, and you do not need to manage any infrastructure for the orchestrator itself.

I have used Airflow extensively in the past (we even covered it in a previous article), and while Airflow is powerful, it is also a bit heavy for simple workflows. If your entire pipeline runs on AWS and you do not need the flexibility of Python DAGs with custom operators, Step Functions can be a much simpler option.

When Should You Use Step Functions?

Before we jump into code, let me give you my honest view on where Step Functions fits.

Tool	Best For	Not Great For
Step Functions	AWS-native pipelines, event-driven workflows, simple branching	Complex Python logic inside the orchestrator, hybrid cloud setups
Airflow	Multi-cloud pipelines, custom Python operators, complex scheduling	Simple AWS-only workflows (overkill)
GitHub Actions	CI/CD for data pipelines, deployment triggers	Long-running data workflows with complex retry logic
Cron + Lambda	Simple periodic tasks	Multi-step workflows with dependencies and error handling

If your pipeline looks like “Run Glue job A → On success, run Glue job B → On failure, send SNS notification,” Step Functions is a great fit. If your pipeline involves calling external APIs that are not on AWS, doing complex data transformations inside the orchestrator, or running on a schedule that changes dynamically, you might want to look at Airflow.

Setting Up Your First Step Function

Let us build a simple pipeline that does the following:

Runs a Glue ETL job to process raw data
On success, triggers a Lambda function to validate the output
On validation failure, sends an alert via SNS

Step 1: The IAM Role

First, we need a role that Step Functions can assume. This role needs permissions to invoke Glue jobs, Lambda functions, and publish to SNS.

  
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:StartJobRun",
        "lambda:InvokeFunction",
        "sns:Publish"
      ],
      "Resource": "*"
    }
  ]
}

Create this role from the IAM console or using Terraform. Make sure you add a trust policy that lets states.amazonaws.com assume the role.

Step 2: Define the State Machine

The state machine is where you define your workflow. You can write this in the Amazon States Language (ASL), which is a JSON-based DSL. It is not pretty, but it is functional.

  
{
  "Comment": "Data pipeline orchestration example",
  "StartAt": "Run Glue Job",
  "States": {
    "Run Glue Job": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "raw-data-processing-job"
      },
      "Next": "Validate Output",
      "Retry": [
        {
          "ErrorEquals": ["Glue.AWSGlueException"],
          "IntervalSeconds": 60,
          "MaxAttempts": 2,
          "BackoffRate": 2
        }
      ]
    },
    "Validate Output": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "validate-output-lambda",
        "Payload.$": "$"
      },
      "Next": "Check Validation",
      "Retry": [
        {
          "ErrorEquals": ["Lambda.ServiceException"],
          "IntervalSeconds": 30,
          "MaxAttempts": 3,
          "BackoffRate": 2
        }
      ]
    },
    "Check Validation": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.Payload.status",
          "StringEquals": "VALID",
          "Next": "Pipeline Success"
        }
      ],
      "Default": "Send Alert"
    },
    "Send Alert": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {
        "TopicArn": "arn:aws:sns:us-east-1:123456789012:data-pipeline-alerts",
        "Subject": "Pipeline Validation Failed",
        "Message.$": "$.Payload.errorMessage"
      },
      "End": true
    },
    "Pipeline Success": {
      "Type": "Succeed"
    }
  }
}

A few things to notice here. The .sync suffix on the Glue job resource makes Step Functions wait for the job to complete before moving forward. Without it, Step Functions would fire the job and move on immediately, which is almost never what you want in a data pipeline.

Also notice the Payload.$ syntax. The $ at the end tells Step Functions to pass the entire state to the Lambda function. If you wanted only a subset, you would use something like "Payload": { "jobRunId.$": "$.JobRunId" }. This path-based filtering is useful but can get confusing when your state gets large.

Step 3: Triggering the State Machine

You can trigger a Step Function in a few ways:

EventBridge (CloudWatch Events): Schedule-based triggers, like “run every day at 2 AM”
S3 Events: When a file lands in a bucket
API Gateway: HTTP endpoint for on-demand triggers
Manual: From the console or AWS CLI

For scheduled data pipelines, EventBridge is the most common approach. Here is what a CloudFormation snippet looks like for a daily trigger:

  
EventRule:
  Type: AWS::Events::Rule
  Properties:
    ScheduleExpression: "cron(0 2 * * ? *)"
    Targets:
      - Arn: !Ref StateMachineArn
        Id: "DailyPipelineTrigger"
        RoleArn: !Ref EventBridgeRoleArn

Error Handling That Actually Works

This is where Step Functions does better than what most people give it credit for. The retry configuration we saw earlier is per-state, which means you can have different retry strategies for different steps.

A Glue job might need 3 retries with 60-second gaps, while a Lambda call might only need 2 retries with 30-second gaps. You can define these separately.

Step Functions also has a Catch field that lets you define what happens when a state fails after all retries are exhausted. You can route the failure to a cleanup step or a notification. Without Catch, the entire execution fails with no graceful handling.

  
"Catch": [
  {
    "ErrorEquals": ["States.ALL"],
    "Next": "Cleanup and Notify"
  }
]

One thing I learned the hard way — the States.ALL catch handles only errors from within Step Functions. If your Glue job crashes with an out-of-memory error, Step Functions will report it. But if the job completes successfully and the output is just wrong, Step Functions will not catch that. You need the validation Lambda (or equivalent) for logical errors.

Things to Watch Out For

State Machine Size Limits

The ASL JSON definition has a size limit of 1 MB. This sounds like a lot, but if you are embedding large inline Lambda code in your state machine, you will hit it fast. Keep your Lambda code in a separate zip file and reference it by ARN.

Execution History Limit

Step Functions keeps execution history for up to 90 days. After that, it is gone. If you need long-term audit trails, export your execution logs to CloudWatch or S3.

Maximum Execution Time

A single Step Function execution can run for up to 1 year, which is plenty. But if you are using Express Workflows (the cheaper, high-throughput option), the limit is 5 minutes. Choose carefully based on your pipeline duration.

Cost at Scale

Standard Workflows are priced per state transition. At $0.025 per 1,000 transitions, it is cheap for most use cases. But if you have a workflow with 50 states running 10,000 times a day, you are looking at $12.50 per day just in state transitions. Not breaking the bank, but worth calculating before you go all-in.

Input and Output Path Confusion

The $., $$., $[0], $..field syntax for filtering input and output is powerful but not intuitive. I have spent more time than I would like debugging state machines where I accidentally passed the wrong fields between states. Use the data flow simulator in the AWS console to test your path expressions before deploying.

Production Considerations

If you are deploying this to production, here is what changes from the demo above:

Infrastructure as Code: Do not create the state machine from the console. Use Terraform or CloudFormation. You want the state machine definition, IAM roles, and triggering rules all version-controlled.
Environment-Specific Configurations: Use SSM Parameter Store or step function input parameters instead of hardcoding job names and ARNs. Your Glue job ARN will be different between dev and prod.
Logging and Monitoring: Enable CloudWatch Logs for your state machine executions. Set up alarms on failed executions using CloudWatch metrics.
Permissions Boundaries: The IAM role I showed above uses "Resource": "*" for simplicity. In production, scope it down to specific resources.
Dead Letter Queues: For event-driven triggers like S3 events, set up a DLQ so you do not lose events if the state machine fails to start.

Wrapping Up

AWS Step Functions is not a replacement for Airflow, and it was never meant to be. It is a simpler, AWS-native orchestrator that works well when your pipeline is made up of AWS services and you want to avoid managing infrastructure for your scheduler.

In my experience, the sweet spot is pipelines with 5–15 steps where each step is an AWS service call and you need solid error handling. For anything more complex, you will likely find yourself fighting the ASL language more than you are building your pipeline.

If you have been running Airflow for a handful of Glue jobs and a Lambda, give Step Functions a try. You might find that the reduced operational overhead is worth the trade-off in flexibility.

Data Engineering, AWS

This post is licensed under CC BY 4.0 by the author.