Data Quality Checks That Actually Catch Production Issues
Most teams say they have data quality checks. But in many projects, quality still gets discovered by a stakeholder message like:
“Why does this dashboard look wrong today?”
That usually means checks were present, but not designed to catch the failures that matter.
In this article, I’ll walk through a practical quality strategy for beginner and intermediate data engineers working with Glue, Athena, and dbt.
Why quality checks fail in real systems
The common pattern I see:
- one or two null checks
- one row count check
- no freshness guardrails
- no explicit data contracts
- no fail-fast stage in orchestration
So the checks pass, but the business still gets wrong data.
A better quality model: 5 gates
Treat quality as a set of gates in the pipeline:
- Schema gate — did structure change unexpectedly?
- Completeness gate — are required records present?
- Uniqueness gate — did dedupe assumptions break?
- Freshness gate — did data arrive on time?
- Drift gate — are distributions still within expected range?
If you are using Step Functions, these should be explicit states, not “nice-to-have” scripts.
flowchart LR
A[Ingestion] --> B[Schema Checks]
B --> C[Core Transform]
C --> D[Completeness + Uniqueness]
D --> E[Freshness + Drift]
E --> F[Publish Curated Data]
B --> X[Fail + Alert]
D --> X
E --> X
Minimum checks that deliver high value
If you’re early in maturity, start with these 8 checks:
- Required columns exist
- Column type compatibility for critical fields
- Primary key null check
- Primary key duplicate check
- Partition/date freshness check
- Minimum expected row threshold
- Major categorical value whitelist check
- Referential integrity check for important joins
This gives you strong coverage without building a full data quality platform on day one.
Example SQL checks (Athena style)
Duplicate + null guard
1
2
3
4
5
SELECT
SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS null_customer_id,
COUNT(*) - COUNT(DISTINCT customer_id) AS duplicate_customer_id
FROM silver.customers
WHERE dt = current_date - interval '1' day;
Freshness check
1
2
3
SELECT
MAX(event_time) AS latest_event_time
FROM bronze.events;
In orchestration, compare latest_event_time against expected SLA threshold.
Volume anomaly check
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
WITH latest AS (
SELECT COUNT(*) AS c
FROM silver.orders
WHERE dt = current_date - interval '1' day
), baseline AS (
SELECT AVG(cnt) AS avg_cnt
FROM (
SELECT dt, COUNT(*) AS cnt
FROM silver.orders
WHERE dt BETWEEN current_date - interval '14' day AND current_date - interval '2' day
GROUP BY dt
)
)
SELECT latest.c, baseline.avg_cnt,
CASE WHEN latest.c < baseline.avg_cnt * 0.6 THEN 'ALERT' ELSE 'OK' END AS status
FROM latest CROSS JOIN baseline;
Where to run checks
A practical placement:
- Ingestion checks in Glue/Lambda before writing clean layer
- Model checks in dbt tests at curated layer
- SLA checks in Step Functions after transform completion
This gives layered protection.
Quality + orchestration = real protection
Checks only matter if failure stops publish.
In Step Functions:
- if quality status = fail → stop downstream publish
- send alert with context (dataset, partition, failing rule)
- keep failed batch quarantined for replay
Without this, quality checks become logging noise.
Data contracts for growing teams
As your org scales, quality logic should move from ad-hoc rules to data contracts.
A simple contract file can define:
- required fields
- valid ranges
- freshness window
- uniqueness constraints
- owner/team metadata
This helps onboarding and prevents hidden assumptions.
Why this matters for AI engineering too
If you are transitioning into AI engineering, remember:
- model quality depends on feature quality
- feature quality depends on data pipeline quality
The same checks that protect dashboards also protect feature stores, embeddings, and retrieval indexes.
A weak quality pipeline becomes an AI reliability problem later.
A practical rollout plan (4 weeks)
Week 1
- implement null + duplicate + freshness checks for top 3 datasets
Week 2
- add row-volume anomaly checks and alerting
Week 3
- enforce fail-fast publish in orchestration
Week 4
- define simple contract files for critical datasets
This phased model is realistic and creates visible impact fast.
Final take
Data quality should not be a reporting-side cleanup activity.
It should be a first-class architecture layer in your pipeline design.
If your checks do not block bad data from getting published, they are observability, not quality control.
In the next post, I’ll share a reference Step Functions + Glue design that wires these quality gates in a maintainable way.