Post

BigQuery partitioning and clustering basics for faster and cheaper queries

In this article, let us see what BigQuery partitioning and clustering are, why we would use them, and how to get started with them using a simple example. If you are working with BigQuery tables that keep growing every day, these two features can make a very big difference in query cost and runtime.

When we start with BigQuery, it is tempting to put all data into a table and just query it however we want. That works for demos and small datasets. But once the table becomes large, we might end up scanning far more data than needed. That is where partitioning and clustering help.

Partitioning is mainly about breaking a table into logical chunks, usually by date or timestamp. Clustering is about organizing the data inside those partitions based on selected columns. In practice, many teams use both together.

What is partitioning in BigQuery

Partitioning means BigQuery stores the data in separate partitions based on a partition column or ingestion time. When our query includes a filter on the partition column, BigQuery can skip unrelated partitions and scan less data.

For example, if we have an events table with one year of data and we only want yesterday’s events, partitioning allows BigQuery to read just that partition instead of the whole table.

The most common partitioning choices are:

  1. Ingestion time partitioning
  2. Partitioning by a DATE column
  3. Partitioning by a TIMESTAMP or DATETIME column
  4. Integer range partitioning for some special use cases

For most data engineering use cases, partitioning by a business date column or event timestamp is easier to reason about than ingestion time partitioning.

What is clustering in BigQuery

Clustering sorts and organizes the data in a table based on up to four columns. It does not create separate physical partitions like partitioning does, but it helps BigQuery prune blocks of data more efficiently when we filter on clustered columns.

If a table is partitioned by event_date and clustered by customer_id and country, then queries filtering by date and customer_id usually perform better than on a non-clustered table.

Clustering is useful when:

  • we often filter or group by a few repeated columns
  • partitioning alone is too broad
  • we want better performance without creating too many partitions

Partitioning vs clustering

FeaturePartitioningClustering
Main purposeReduce scanned data by skipping partitionsImprove pruning within a table or partition
Works best withDate, timestamp, ingestion time, integer rangesFrequently filtered dimensions
Required query patternFilter on partition columnFilter on clustered columns
Limits to watchToo many partitions can be a problemOnly up to 4 clustering columns
Typical usageevent_date, created_atcustomer_id, region, status

For our use case, if we have daily event data, a good starting point is partition by event_date and cluster by customer_id or event_type depending on the common query pattern.

Create a sample partitioned and clustered table

Let us create a simple table in BigQuery. Assume we are storing application events.

1
2
3
4
5
6
7
8
9
10
11
CREATE TABLE `demo.analytics.app_events` (
  event_id STRING,
  customer_id STRING,
  event_type STRING,
  country STRING,
  event_timestamp TIMESTAMP,
  event_date DATE,
  revenue NUMERIC
)
PARTITION BY event_date
CLUSTER BY customer_id, event_type;

Here:

  • event_date is the partition column
  • customer_id and event_type are clustering columns

If your source already has a timestamp column, it is common to derive the date during ingestion.

1
2
3
4
5
6
7
8
9
10
INSERT INTO `demo.analytics.app_events`
SELECT
  event_id,
  customer_id,
  event_type,
  country,
  event_timestamp,
  DATE(event_timestamp) AS event_date,
  revenue
FROM `demo.staging.raw_app_events`;

How query cost improves

Now let us see why this matters. Suppose we want purchases for one customer in the last seven days.

1
2
3
4
5
6
7
8
9
SELECT
  customer_id,
  event_type,
  SUM(revenue) AS total_revenue
FROM `demo.analytics.app_events`
WHERE event_date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND CURRENT_DATE()
  AND customer_id = 'CUST_1001'
  AND event_type = 'purchase'
GROUP BY 1, 2;

This query benefits in two ways:

  1. The event_date filter reduces the number of partitions scanned
  2. The customer_id and event_type filters help BigQuery prune clustered storage blocks inside those partitions

Without partitioning and clustering, BigQuery may need to read much more data even though the result is small.

Require partition filters

One thing I prefer in production is to require partition filters. This prevents people from accidentally scanning the full table.

1
2
3
4
ALTER TABLE `demo.analytics.app_events`
SET OPTIONS (
  require_partition_filter = TRUE
);

This is a very useful setting for large fact tables. Otherwise, one missing WHERE clause can become an expensive mistake.

When partitioning helps the most

Partitioning helps a lot when queries naturally work on a time window, for example:

  • yesterday’s orders
  • last 30 days of clickstream
  • monthly billing data
  • daily incremental loads

If users rarely filter by date, partitioning might not help much. In that case, clustering or even a different table design may matter more.

Also, avoid over-partitioning without a reason. In many cases, daily partitioning is enough. Some teams try to partition too aggressively and then the design becomes harder to maintain.

Choosing clustering columns

The clustering columns should come from real query patterns, not guesses. Good candidates are columns that:

  • appear often in WHERE clauses
  • have enough repetition to help pruning
  • are commonly used together with the partition filter

For example, these can be reasonable choices depending on the workload:

  • customer_id
  • region
  • event_type
  • order_status

I would not just pick four columns because BigQuery allows four. Too many low-value clustering columns can make the design less useful. Start with one or two important ones and monitor query behavior.

Things to be careful about

There are a few practical limitations and caveats.

1. Partitioning does not fix bad queries

If the query does not filter on the partition column, the benefits can disappear. I have seen tables correctly partitioned, but users still scan everything because the query was written without the date filter.

2. Clustering benefits depend on table size

Clustering is more noticeable on larger tables. On small tables, the difference may be minimal.

3. Pick the right partition column

If we partition by ingestion date but all reporting is based on business event date, users may get confused and performance may not match expectations. Usually the partitioning column should match how the data is queried.

4. Streaming and late arriving data

If your data arrives late, then partitioning by event date can still work, but your ingestion logic needs to handle backfills properly. In production, that means thinking about late arriving events, reprocessing, and partition overwrite strategy.

5. Too many tiny partitions are not useful

If each partition has very little data, then the complexity may not bring much value. Partition only when there is enough data and a clear access pattern.

What I would change in production

For a simple demo, creating the table manually with SQL is enough. In production, I would usually do a few more things:

  1. Manage table creation with Terraform or another IaC tool
  2. Enforce partition filters on large tables
  3. Add data quality checks for partition columns
  4. Monitor query cost and bytes scanned
  5. Review clustering columns every so often based on actual usage

A production design also needs us to think about retention. Some tables only need 90 days of data, while others need years. BigQuery table and partition expiration settings can help here.

Simple rule of thumb

A simple way to think about it is this:

  • Use partitioning to narrow by time
  • Use clustering to narrow further by business dimensions

If your analysts mostly query recent data for a subset of customers, both features together are usually a good fit.

Conclusion

BigQuery partitioning and clustering are not difficult features, but they are very important once your tables start growing. Partitioning helps reduce the amount of data scanned, and clustering helps BigQuery organize the data better for common filters. If you pick the columns based on real query patterns and not just theory, you can save both time and cost. For most event-style datasets, partition by date and add one or two useful clustering columns is a very good place to start.

This post is licensed under CC BY 4.0 by the author.