Post

Data Quality Checks That Actually Catch Production Issues

Most teams say they have data quality checks. But in many projects, quality still gets discovered by a stakeholder message like:

“Why does this dashboard look wrong today?”

That usually means checks were present, but not designed to catch the failures that matter.

In this article, I’ll walk through a practical quality strategy for beginner and intermediate data engineers working with Glue, Athena, and dbt.

Why quality checks fail in real systems

The common pattern I see:

  • one or two null checks
  • one row count check
  • no freshness guardrails
  • no explicit data contracts
  • no fail-fast stage in orchestration

So the checks pass, but the business still gets wrong data.

A better quality model: 5 gates

Treat quality as a set of gates in the pipeline:

  1. Schema gate — did structure change unexpectedly?
  2. Completeness gate — are required records present?
  3. Uniqueness gate — did dedupe assumptions break?
  4. Freshness gate — did data arrive on time?
  5. Drift gate — are distributions still within expected range?

If you are using Step Functions, these should be explicit states, not “nice-to-have” scripts.

flowchart LR
    A[Ingestion] --> B[Schema Checks]
    B --> C[Core Transform]
    C --> D[Completeness + Uniqueness]
    D --> E[Freshness + Drift]
    E --> F[Publish Curated Data]
    B --> X[Fail + Alert]
    D --> X
    E --> X

Minimum checks that deliver high value

If you’re early in maturity, start with these 8 checks:

  1. Required columns exist
  2. Column type compatibility for critical fields
  3. Primary key null check
  4. Primary key duplicate check
  5. Partition/date freshness check
  6. Minimum expected row threshold
  7. Major categorical value whitelist check
  8. Referential integrity check for important joins

This gives you strong coverage without building a full data quality platform on day one.

Example SQL checks (Athena style)

Duplicate + null guard

1
2
3
4
5
SELECT
  SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS null_customer_id,
  COUNT(*) - COUNT(DISTINCT customer_id) AS duplicate_customer_id
FROM silver.customers
WHERE dt = current_date - interval '1' day;

Freshness check

1
2
3
SELECT
  MAX(event_time) AS latest_event_time
FROM bronze.events;

In orchestration, compare latest_event_time against expected SLA threshold.

Volume anomaly check

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
WITH latest AS (
  SELECT COUNT(*) AS c
  FROM silver.orders
  WHERE dt = current_date - interval '1' day
), baseline AS (
  SELECT AVG(cnt) AS avg_cnt
  FROM (
    SELECT dt, COUNT(*) AS cnt
    FROM silver.orders
    WHERE dt BETWEEN current_date - interval '14' day AND current_date - interval '2' day
    GROUP BY dt
  )
)
SELECT latest.c, baseline.avg_cnt,
       CASE WHEN latest.c < baseline.avg_cnt * 0.6 THEN 'ALERT' ELSE 'OK' END AS status
FROM latest CROSS JOIN baseline;

Where to run checks

A practical placement:

  • Ingestion checks in Glue/Lambda before writing clean layer
  • Model checks in dbt tests at curated layer
  • SLA checks in Step Functions after transform completion

This gives layered protection.

Quality + orchestration = real protection

Checks only matter if failure stops publish.

In Step Functions:

  • if quality status = fail → stop downstream publish
  • send alert with context (dataset, partition, failing rule)
  • keep failed batch quarantined for replay

Without this, quality checks become logging noise.

Data contracts for growing teams

As your org scales, quality logic should move from ad-hoc rules to data contracts.

A simple contract file can define:

  • required fields
  • valid ranges
  • freshness window
  • uniqueness constraints
  • owner/team metadata

This helps onboarding and prevents hidden assumptions.

Why this matters for AI engineering too

If you are transitioning into AI engineering, remember:

  • model quality depends on feature quality
  • feature quality depends on data pipeline quality

The same checks that protect dashboards also protect feature stores, embeddings, and retrieval indexes.

A weak quality pipeline becomes an AI reliability problem later.

A practical rollout plan (4 weeks)

Week 1

  • implement null + duplicate + freshness checks for top 3 datasets

Week 2

  • add row-volume anomaly checks and alerting

Week 3

  • enforce fail-fast publish in orchestration

Week 4

  • define simple contract files for critical datasets

This phased model is realistic and creates visible impact fast.

Final take

Data quality should not be a reporting-side cleanup activity.

It should be a first-class architecture layer in your pipeline design.

If your checks do not block bad data from getting published, they are observability, not quality control.

In the next post, I’ll share a reference Step Functions + Glue design that wires these quality gates in a maintainable way.

This post is licensed under CC BY 4.0 by the author.