Post

Data quality checks every beginner team should add

In this article, let us see a few data quality checks that every beginner team should add early in their pipeline. This approach is useful because most data problems are not fancy platform problems. Usually the issue is that a file is missing, a key column became null, a table has duplicate records, or a transformation loaded half the rows we expected. If we add a few simple checks at the right places, we can catch these issues before a dashboard or downstream job starts showing wrong numbers.

When a team is just starting, it is easy to focus only on making the pipeline run. I have done that too. The job runs, the table gets created, and everyone feels happy for a few days. Then one source file changes format, someone sends duplicate rows, or a filter in the transformation removes more records than expected. At that point we realize that a successful pipeline run is not the same as a trustworthy dataset.

The good thing is that we do not need a complicated framework on day one. A handful of basic checks already gives a lot of value.

Where to add the checks

I prefer thinking about checks in three places:

  1. before loading the source file
  2. after loading the raw table
  3. after building the cleaned or reporting table

This gives us coverage across the full pipeline. If we only check the final table, we may detect issues too late. If we only check the input file, we may miss transformation bugs.

The main checks I would add first

A simple comparison looks like this:

CheckWhy it mattersExample
File presenceDetect missing deliveryDaily file did not arrive by 8 AM
Row countDetect empty or unexpectedly small loadsYesterday had 1M rows, today has 20
Null checksProtect key fieldsorder_id or customer_id becomes null
Duplicate checksPrevent double countingSame order appears twice
Accepted valuesCatch invalid codesstatus should be NEW, PAID, or CANCELLED
FreshnessEnsure table is updated recentlymax event time is two days old

These are not the only checks possible, but they are the ones I would start with for almost every batch pipeline.

1. Check that the source file actually arrived

This sounds basic, but it prevents many incidents. If a daily file is missing, there is no point continuing the rest of the pipeline unless the design supports it.

Pseudo-code for a storage check could look like this:

1
2
3
files = list_files("s3://demo-landing/orders/2025-01-20/")
if len(files) == 0:
    raise Exception("Source file missing for orders feed")

If the source can send more than one file, then the check can be based on a naming pattern or minimum expected file count. In production, I would also log the exact file names so troubleshooting is easier.

2. Check the row count after load

Once the file is loaded into a raw table, I like to compare the row count with a simple threshold. We do not always need exact equality with yesterday, because business volumes change. But a big drop or spike is usually worth checking.

For example:

1
2
3
select count(*) as row_count
from raw_orders
where load_date = '2025-01-20';

Then compare that with the recent average:

1
2
3
4
5
6
7
select avg(cnt) as avg_row_count
from (
  select count(*) as cnt
  from raw_orders
  where load_date between '2025-01-13' and '2025-01-19'
  group by load_date
) t;

If today is 80 percent lower than normal, I would fail the pipeline or at least raise an alert. For a beginner team, even a simple rule like “today must be at least 50 percent of the 7 day average” is already useful.

3. Add not-null checks on important columns

Not every column needs a null check. I usually apply it to business keys, timestamps, partition columns, and fields required by downstream reporting.

Example:

1
2
3
4
select count(*) as bad_rows
from raw_orders
where order_id is null
   or order_timestamp is null;

If order_id is null, deduplication becomes hard. If order_timestamp is null, time-based reporting can break. This is one of the easiest checks to write and one of the most useful.

4. Check duplicates on business keys

Duplicate records create a lot of confusion because the pipeline may still succeed technically. The problem only appears later when revenue looks too high or customer counts are inflated.

A common duplicate check is:

1
2
3
4
5
select order_id, count(*) as cnt
from raw_orders
where load_date = '2025-01-20'
group by order_id
having count(*) > 1;

If the source is append-only and each order is expected once per day, duplicates should be treated as a failure. But if the source sends updates for the same key, then the rule changes. In that case I would deduplicate with row_number() and check whether the duplicate rate is unusually high.

5. Validate accepted values

Some columns should only contain a known set of values. For example, order status may only allow NEW, PAID, SHIPPED, and CANCELLED. If we suddenly get PENDNG because of a typo upstream, we want to catch that early.

1
2
3
select distinct status
from raw_orders
where status not in ('NEW', 'PAID', 'SHIPPED', 'CANCELLED');

This check is especially helpful for dimensions, reference codes, and flags. It is simple, but it protects dashboards and downstream business logic from unexpected category drift.

6. Check freshness on the final table

Sometimes the raw load is fine, but the transformation job reads an old partition or fails silently halfway through a chain of tasks. A freshness check on the final table helps catch that.

1
2
select max(order_timestamp) as latest_order_ts
from analytics.orders;

If the latest timestamp is far behind the expected business time, something is wrong. For our use case, we could alert if the table is more than one day stale.

What a simple validation flow can look like

A basic orchestration sequence could be:

1
2
3
4
5
6
1. Check source file exists
2. Load file into raw table
3. Run row count, null, and duplicate checks
4. Build cleaned table
5. Run freshness and business rule checks
6. Publish only if all checks pass

This is enough for many beginner teams. We do not need a huge data observability platform to get started.

What changes in production

In a small demo, we might just fail the job and inspect the SQL manually. In production, I would add a few more things:

  • store validation results in a table for trend analysis
  • alert through orchestration or messaging when a check fails
  • separate warning-level checks from failure-level checks
  • quarantine bad files instead of loading them blindly
  • document the expected rules for each important dataset

Another production difference is ownership. Somebody should know whether a failed check is because of a source issue, pipeline bug, or real business event. Without that, even good checks become noise after a while.

Limitations and things to be careful about

There are a few caveats here.

  1. A row count anomaly is useful, but promotions or seasonal traffic can make the numbers look unusual without meaning the data is wrong.
  2. Duplicate checks depend on understanding the business key properly. If the key logic is wrong, the check will mislead us.
  3. Accepted-value rules need maintenance when the source system adds a legitimate new status.
  4. Passing all checks does not prove the data is perfect. It only gives us better guardrails.

So these checks are not a replacement for understanding the data. They are just practical controls that catch the most common problems early.

Conclusion

If a beginner team adds only a few data quality checks, I would start with file presence, row count, not-null, duplicate, accepted-value, and freshness checks. They are simple to implement, easy to explain, and they prevent many painful downstream surprises. Over time we can add richer validation, but these basics already make a pipeline much more reliable.

This post is licensed under CC BY 4.0 by the author.