How to Choose the Right AWS Data Pipeline Architecture for a New Data Product
When we build a new data pipeline, we usually jump straight to tools:
- “Should I use Glue?”
- “Can Athena do this?”
- “Do I need Step Functions or Airflow?”
These are valid questions, but in my experience they are asked too early.
A better order is:
- What is the product requirement?
- What reliability is needed?
- What freshness is expected?
- What cost boundary can we operate within?
- Then pick tools.
In this guide, I will walk through a practical framework for data engineers (beginner to intermediate) who are designing pipelines in AWS. I will also show where these choices naturally connect to AI engineering work.
Start with product shape, not tool preference
Before selecting Glue/Athena/Step Functions/dbt, write one page that answers:
- Consumers: Who uses this data? BI team, product app, ML team, external partner?
- Freshness: hourly, daily, near-real-time?
- Data correctness: how costly is bad data for the business?
- Latency tolerance: can users wait 15 minutes or 8 hours?
- Change rate: does source schema change frequently?
- Scale expectation: 1 GB/day, 100 GB/day, 10 TB/day?
Without this, architecture decisions become “tool fandom” decisions.
A simple architecture decision map
Use this as your default thought process:
flowchart TD
A[New Data Product Request] --> B{Freshness needed?}
B -->|Batch Hourly/Daily| C[S3 + Glue + Athena/dbt]
B -->|Near Real Time| D[Kinesis + Stream Processing + S3/Iceberg]
C --> E{Complex orchestration?}
E -->|Yes| F[Step Functions]
E -->|No| G[EventBridge Scheduler]
C --> H{Transformation style?}
H -->|SQL-first analytics| I[dbt + Athena/warehouse]
H -->|Complex Python transforms| J[Glue Spark jobs]
D --> K{Serving target?}
K -->|BI + ad hoc| L[Athena/Redshift]
K -->|ML/GenAI features| M[Feature-ready curated layers]
Notice the sequence: requirement -> pattern -> tooling.
Recommended baseline for most teams
For many teams starting a new analytics product in AWS, this is a strong baseline:
- Landing: S3 raw zone (immutable files)
- Transform compute: Glue jobs for heavy transforms
- SQL modeling: dbt for semantic/analytics modeling
- Query: Athena for ad hoc + lightweight reporting
- Orchestration: Step Functions for multi-step, retriable workflows
- Metadata/governance: Glue Data Catalog + clear naming/versioning
Why this works well:
- scales from small to medium teams
- low platform overhead vs running full Kubernetes/Spark clusters
- clear separation between raw/clean/curated layers
- easy to integrate with downstream BI and AI feature generation
When to choose Glue vs dbt vs Athena
Choose Glue when:
- you need Spark-level transforms on large datasets
- you need Python logic beyond SQL readability
- you need robust file-level ETL with partition handling
Choose dbt when:
- business logic is mostly SQL-transformable
- you want testable models and lineage
- analytics engineers and data engineers collaborate
Choose Athena directly when:
- simple transformations are enough
- pipeline is small and cost-sensitive
- you need fast exploration without additional runtime layers
A practical combo many teams use:
- Glue for heavy normalization/enrichment
- dbt for curated marts/metrics
- Athena for queries
Orchestration: when Step Functions is worth it
If your flow is more than “run one job nightly”, Step Functions is usually worth it.
Good fit signals:
- multiple dependent jobs
- conditional branching (if file exists, if quality checks pass, etc.)
- retries and failure handling required
- need for operational visibility and replay
Example state-machine sequence:
- Check source file availability
- Run Glue raw->clean transform
- Run quality checks
- If pass, run dbt models
- Publish completion event
- Alert on failure path
This gives predictable operations and easier incident debugging.
Data quality: treat it as architecture, not an afterthought
Quality should be a pipeline stage, not a dashboard complaint.
Minimum quality controls for new pipelines:
- schema checks on ingestion
- null/duplicate checks in curated layers
- row count drift thresholds
- freshness checks (expected partition exists)
You can implement quick checks in Spark/SQL and evolve into a stronger data contract approach later.
Sample SQL quality check (Athena compatible pattern):
1
2
3
4
5
6
7
8
9
10
WITH latest AS (
SELECT *
FROM curated.orders
WHERE dt = current_date - interval '1' day
)
SELECT
COUNT(*) AS total_rows,
SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END) AS null_order_id,
COUNT(*) - COUNT(DISTINCT order_id) AS duplicate_order_id
FROM latest;
Cost optimization from day one
A lot of teams “optimize later” and carry expensive design debt.
Make these decisions upfront:
- Partition strategy
- date partitioning for large fact tables
- avoid over-partitioning tiny datasets
- File format
- use Parquet (or Iceberg where needed)
- avoid long-term CSV in analytics layers
- Small files problem
- compact files in clean/curated zones
- Query boundaries
- enforce date filters in analytical queries
- Retry policy discipline
- retries are good, unlimited retries are expensive
A single poor partitioning decision can cost more than all your orchestration tooling.
Lakehouse layering pattern (practical version)
Use a simple medallion-style approach:
- Raw: source-as-is, immutable, auditable
- Clean/Silver: normalized schema, standard datatypes, dedupe rules
- Curated/Gold: domain-ready models for BI, product, AI use cases
This layering is not just for BI. It is also what makes AI feature development easier, because feature teams consume stable curated entities instead of random source tables.
Where AI engineering starts for data engineers
If you are a data engineer trying to move toward AI engineering, your biggest leverage is not prompt engineering. It is reliable data product design.
AI systems fail most often due to:
- stale data
- inconsistent entity definitions
- weak lineage/traceability
- missing observability
All of these are core data engineering strengths.
A good transition path:
- Build pipeline quality and lineage discipline
- Create feature-ready curated entities
- Add serving patterns for AI workloads (batch + near-real-time)
- Learn vector/search retrieval patterns on top of trusted data
- Add model monitoring as an extension of pipeline monitoring
So the path is not “leave data engineering”. It is “extend platform engineering into AI workloads.”
Reference implementation sketch
A simple IaC + SQL + orchestration split might look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
repo/
infra/
step_functions/
glue_jobs/
iam/
catalog/
pipelines/
glue/
raw_to_clean.py
clean_to_curated.py
transformations/
dbt/
models/
tests/
observability/
quality_checks/
alerts/
And a lightweight Step Functions task definition pattern:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
"Comment": "Daily pipeline",
"StartAt": "RunRawToClean",
"States": {
"RunRawToClean": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Next": "RunQualityChecks"
},
"RunQualityChecks": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Next": "RunCuratedModels"
},
"RunCuratedModels": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"End": true
}
}
}
Common anti-patterns to avoid
- choosing tools first and requirements second
- mixing ingestion, transformation, and serving logic into one giant Glue job
- no clear raw/clean/curated boundaries
- orchestration without retries/alerts
- quality checks done manually after business reports break
Final checklist before you commit architecture
Before finalizing your design, answer yes/no:
- Do we know the exact freshness SLA?
- Is failure behavior explicitly designed?
- Are quality checks first-class stages?
- Is partitioning/file strategy documented?
- Can this architecture support future AI feature pipelines?
If you can answer yes to these, your tool choices (Glue, Athena, Step Functions, dbt) usually fall into place naturally.
If you are currently moving from GCP-based pipelines to AWS, this framework helps avoid the most common mistake: translating service names 1:1 instead of redesigning for product requirements.
In the next article, I will break down a concrete decision matrix: Glue vs EMR vs Athena SQL pipelines with cost, complexity, and team-skill trade-offs.