How to Choose the Right AWS Data Pipeline Architecture for a New Data Product

Posted Mar 17, 2025

By Ashok KS 6 min read

When we build a new data pipeline, we usually jump straight to tools:

“Should I use Glue?”
“Can Athena do this?”
“Do I need Step Functions or Airflow?”

These are valid questions, but in my experience they are asked too early.

A better order is:

What is the product requirement?
What reliability is needed?
What freshness is expected?
What cost boundary can we operate within?
Then pick tools.

In this guide, I will walk through a practical framework for data engineers (beginner to intermediate) who are designing pipelines in AWS. I will also show where these choices naturally connect to AI engineering work.

Start with product shape, not tool preference

Before selecting Glue/Athena/Step Functions/dbt, write one page that answers:

Consumers: Who uses this data? BI team, product app, ML team, external partner?
Freshness: hourly, daily, near-real-time?
Data correctness: how costly is bad data for the business?
Latency tolerance: can users wait 15 minutes or 8 hours?
Change rate: does source schema change frequently?
Scale expectation: 1 GB/day, 100 GB/day, 10 TB/day?

Without this, architecture decisions become “tool fandom” decisions.

A simple architecture decision map

Use this as your default thought process:

flowchart TD
    A[New Data Product Request] --> B{Freshness needed?}
    B -->|Batch Hourly/Daily| C[S3 + Glue + Athena/dbt]
    B -->|Near Real Time| D[Kinesis + Stream Processing + S3/Iceberg]

    C --> E{Complex orchestration?}
    E -->|Yes| F[Step Functions]
    E -->|No| G[EventBridge Scheduler]

    C --> H{Transformation style?}
    H -->|SQL-first analytics| I[dbt + Athena/warehouse]
    H -->|Complex Python transforms| J[Glue Spark jobs]

    D --> K{Serving target?}
    K -->|BI + ad hoc| L[Athena/Redshift]
    K -->|ML/GenAI features| M[Feature-ready curated layers]

Notice the sequence: requirement -> pattern -> tooling.

Recommended baseline for most teams

For many teams starting a new analytics product in AWS, this is a strong baseline:

Landing: S3 raw zone (immutable files)
Transform compute: Glue jobs for heavy transforms
SQL modeling: dbt for semantic/analytics modeling
Query: Athena for ad hoc + lightweight reporting
Orchestration: Step Functions for multi-step, retriable workflows
Metadata/governance: Glue Data Catalog + clear naming/versioning

Why this works well:

scales from small to medium teams
low platform overhead vs running full Kubernetes/Spark clusters
clear separation between raw/clean/curated layers
easy to integrate with downstream BI and AI feature generation

When to choose Glue vs dbt vs Athena

Choose Glue when:

you need Spark-level transforms on large datasets
you need Python logic beyond SQL readability
you need robust file-level ETL with partition handling

Choose dbt when:

business logic is mostly SQL-transformable
you want testable models and lineage
analytics engineers and data engineers collaborate

Choose Athena directly when:

simple transformations are enough
pipeline is small and cost-sensitive
you need fast exploration without additional runtime layers

A practical combo many teams use:

Glue for heavy normalization/enrichment
dbt for curated marts/metrics
Athena for queries

Orchestration: when Step Functions is worth it

If your flow is more than “run one job nightly”, Step Functions is usually worth it.

Good fit signals:

multiple dependent jobs
conditional branching (if file exists, if quality checks pass, etc.)
retries and failure handling required
need for operational visibility and replay

Example state-machine sequence:

Check source file availability
Run Glue raw->clean transform
Run quality checks
If pass, run dbt models
Publish completion event
Alert on failure path

This gives predictable operations and easier incident debugging.

Data quality: treat it as architecture, not an afterthought

Quality should be a pipeline stage, not a dashboard complaint.

Minimum quality controls for new pipelines:

schema checks on ingestion
null/duplicate checks in curated layers
row count drift thresholds
freshness checks (expected partition exists)

You can implement quick checks in Spark/SQL and evolve into a stronger data contract approach later.

Sample SQL quality check (Athena compatible pattern):

  
WITH latest AS (
  SELECT *
  FROM curated.orders
  WHERE dt = current_date - interval '1' day
)
SELECT
  COUNT(*)                              AS total_rows,
  SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END) AS null_order_id,
  COUNT(*) - COUNT(DISTINCT order_id)   AS duplicate_order_id
FROM latest;

Cost optimization from day one

A lot of teams “optimize later” and carry expensive design debt.

Make these decisions upfront:

Partition strategy
- date partitioning for large fact tables
- avoid over-partitioning tiny datasets
File format
- use Parquet (or Iceberg where needed)
- avoid long-term CSV in analytics layers
Small files problem
- compact files in clean/curated zones
Query boundaries
- enforce date filters in analytical queries
Retry policy discipline
- retries are good, unlimited retries are expensive

A single poor partitioning decision can cost more than all your orchestration tooling.

Lakehouse layering pattern (practical version)

Use a simple medallion-style approach:

Raw: source-as-is, immutable, auditable
Clean/Silver: normalized schema, standard datatypes, dedupe rules
Curated/Gold: domain-ready models for BI, product, AI use cases

This layering is not just for BI. It is also what makes AI feature development easier, because feature teams consume stable curated entities instead of random source tables.

Where AI engineering starts for data engineers

If you are a data engineer trying to move toward AI engineering, your biggest leverage is not prompt engineering. It is reliable data product design.

AI systems fail most often due to:

stale data
inconsistent entity definitions
weak lineage/traceability
missing observability

All of these are core data engineering strengths.

A good transition path:

Build pipeline quality and lineage discipline
Create feature-ready curated entities
Add serving patterns for AI workloads (batch + near-real-time)
Learn vector/search retrieval patterns on top of trusted data
Add model monitoring as an extension of pipeline monitoring

So the path is not “leave data engineering”. It is “extend platform engineering into AI workloads.”

Reference implementation sketch

A simple IaC + SQL + orchestration split might look like this:

  
repo/
  infra/
    step_functions/
    glue_jobs/
    iam/
    catalog/
  pipelines/
    glue/
      raw_to_clean.py
      clean_to_curated.py
  transformations/
    dbt/
      models/
      tests/
  observability/
    quality_checks/
    alerts/

And a lightweight Step Functions task definition pattern:

  
{
  "Comment": "Daily pipeline",
  "StartAt": "RunRawToClean",
  "States": {
    "RunRawToClean": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Next": "RunQualityChecks"
    },
    "RunQualityChecks": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Next": "RunCuratedModels"
    },
    "RunCuratedModels": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "End": true
    }
  }
}

Common anti-patterns to avoid

choosing tools first and requirements second
mixing ingestion, transformation, and serving logic into one giant Glue job
no clear raw/clean/curated boundaries
orchestration without retries/alerts
quality checks done manually after business reports break

Final checklist before you commit architecture

Before finalizing your design, answer yes/no:

Do we know the exact freshness SLA?
Is failure behavior explicitly designed?
Are quality checks first-class stages?
Is partitioning/file strategy documented?
Can this architecture support future AI feature pipelines?

If you can answer yes to these, your tool choices (Glue, Athena, Step Functions, dbt) usually fall into place naturally.

If you are currently moving from GCP-based pipelines to AWS, this framework helps avoid the most common mistake: translating service names 1:1 instead of redesigning for product requirements.

In the next article, I will break down a concrete decision matrix: Glue vs EMR vs Athena SQL pipelines with cost, complexity, and team-skill trade-offs.

AWS, Data Engineering, Architecture

This post is licensed under CC BY 4.0 by the author.