Data Quality Checks That Actually Catch Production Issues

Posted Jun 16, 2024

By Ashok KS 4 min read

Most teams say they have data quality checks. But in many projects, quality still gets discovered by a stakeholder message like:

“Why does this dashboard look wrong today?”

That usually means checks were present, but not designed to catch the failures that matter.

In this article, I’ll walk through a practical quality strategy for beginner and intermediate data engineers working with Glue, Athena, and dbt.

Why quality checks fail in real systems

The common pattern I see:

one or two null checks
one row count check
no freshness guardrails
no explicit data contracts
no fail-fast stage in orchestration

So the checks pass, but the business still gets wrong data.

A better quality model: 5 gates

Treat quality as a set of gates in the pipeline:

Schema gate — did structure change unexpectedly?
Completeness gate — are required records present?
Uniqueness gate — did dedupe assumptions break?
Freshness gate — did data arrive on time?
Drift gate — are distributions still within expected range?

If you are using Step Functions, these should be explicit states, not “nice-to-have” scripts.

flowchart LR
    A[Ingestion] --> B[Schema Checks]
    B --> C[Core Transform]
    C --> D[Completeness + Uniqueness]
    D --> E[Freshness + Drift]
    E --> F[Publish Curated Data]
    B --> X[Fail + Alert]
    D --> X
    E --> X

Minimum checks that deliver high value

If you’re early in maturity, start with these 8 checks:

Required columns exist
Column type compatibility for critical fields
Primary key null check
Primary key duplicate check
Partition/date freshness check
Minimum expected row threshold
Major categorical value whitelist check
Referential integrity check for important joins

This gives you strong coverage without building a full data quality platform on day one.

Example SQL checks (Athena style)

Duplicate + null guard

  
SELECT
  SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS null_customer_id,
  COUNT(*) - COUNT(DISTINCT customer_id) AS duplicate_customer_id
FROM silver.customers
WHERE dt = current_date - interval '1' day;

Freshness check

  
SELECT
  MAX(event_time) AS latest_event_time
FROM bronze.events;

In orchestration, compare latest_event_time against expected SLA threshold.

Volume anomaly check

  
WITH latest AS (
  SELECT COUNT(*) AS c
  FROM silver.orders
  WHERE dt = current_date - interval '1' day
), baseline AS (
  SELECT AVG(cnt) AS avg_cnt
  FROM (
    SELECT dt, COUNT(*) AS cnt
    FROM silver.orders
    WHERE dt BETWEEN current_date - interval '14' day AND current_date - interval '2' day
    GROUP BY dt
  )
)
SELECT latest.c, baseline.avg_cnt,
       CASE WHEN latest.c < baseline.avg_cnt * 0.6 THEN 'ALERT' ELSE 'OK' END AS status
FROM latest CROSS JOIN baseline;

Where to run checks

A practical placement:

Ingestion checks in Glue/Lambda before writing clean layer
Model checks in dbt tests at curated layer
SLA checks in Step Functions after transform completion

This gives layered protection.

Quality + orchestration = real protection

Checks only matter if failure stops publish.

In Step Functions:

if quality status = fail → stop downstream publish
send alert with context (dataset, partition, failing rule)
keep failed batch quarantined for replay

Without this, quality checks become logging noise.

Data contracts for growing teams

As your org scales, quality logic should move from ad-hoc rules to data contracts.

A simple contract file can define:

required fields
valid ranges
freshness window
uniqueness constraints
owner/team metadata

This helps onboarding and prevents hidden assumptions.

Why this matters for AI engineering too

If you are transitioning into AI engineering, remember:

model quality depends on feature quality
feature quality depends on data pipeline quality

The same checks that protect dashboards also protect feature stores, embeddings, and retrieval indexes.

A weak quality pipeline becomes an AI reliability problem later.

A practical rollout plan (4 weeks)

Week 1

implement null + duplicate + freshness checks for top 3 datasets

Week 2

add row-volume anomaly checks and alerting

Week 3

enforce fail-fast publish in orchestration

Week 4

define simple contract files for critical datasets

This phased model is realistic and creates visible impact fast.

Final take

Data quality should not be a reporting-side cleanup activity.

It should be a first-class architecture layer in your pipeline design.

If your checks do not block bad data from getting published, they are observability, not quality control.

In the next post, I’ll share a reference Step Functions + Glue design that wires these quality gates in a maintainable way.

AWS, Data Engineering, Data Quality

This post is licensed under CC BY 4.0 by the author.