BigQuery partitioning and clustering basics for faster and cheaper queries
In this article, let us see what BigQuery partitioning and clustering are, why we would use them, and how to get started with them using a simple example. If you are working with BigQuery tables that keep growing every day, these two features can make a very big difference in query cost and runtime.
When we start with BigQuery, it is tempting to put all data into a table and just query it however we want. That works for demos and small datasets. But once the table becomes large, we might end up scanning far more data than needed. That is where partitioning and clustering help.
Partitioning is mainly about breaking a table into logical chunks, usually by date or timestamp. Clustering is about organizing the data inside those partitions based on selected columns. In practice, many teams use both together.
What is partitioning in BigQuery
Partitioning means BigQuery stores the data in separate partitions based on a partition column or ingestion time. When our query includes a filter on the partition column, BigQuery can skip unrelated partitions and scan less data.
For example, if we have an events table with one year of data and we only want yesterday’s events, partitioning allows BigQuery to read just that partition instead of the whole table.
The most common partitioning choices are:
- Ingestion time partitioning
- Partitioning by a DATE column
- Partitioning by a TIMESTAMP or DATETIME column
- Integer range partitioning for some special use cases
For most data engineering use cases, partitioning by a business date column or event timestamp is easier to reason about than ingestion time partitioning.
What is clustering in BigQuery
Clustering sorts and organizes the data in a table based on up to four columns. It does not create separate physical partitions like partitioning does, but it helps BigQuery prune blocks of data more efficiently when we filter on clustered columns.
If a table is partitioned by event_date and clustered by customer_id and country, then queries filtering by date and customer_id usually perform better than on a non-clustered table.
Clustering is useful when:
- we often filter or group by a few repeated columns
- partitioning alone is too broad
- we want better performance without creating too many partitions
Partitioning vs clustering
| Feature | Partitioning | Clustering |
|---|---|---|
| Main purpose | Reduce scanned data by skipping partitions | Improve pruning within a table or partition |
| Works best with | Date, timestamp, ingestion time, integer ranges | Frequently filtered dimensions |
| Required query pattern | Filter on partition column | Filter on clustered columns |
| Limits to watch | Too many partitions can be a problem | Only up to 4 clustering columns |
| Typical usage | event_date, created_at | customer_id, region, status |
For our use case, if we have daily event data, a good starting point is partition by event_date and cluster by customer_id or event_type depending on the common query pattern.
Create a sample partitioned and clustered table
Let us create a simple table in BigQuery. Assume we are storing application events.
1
2
3
4
5
6
7
8
9
10
11
CREATE TABLE `demo.analytics.app_events` (
event_id STRING,
customer_id STRING,
event_type STRING,
country STRING,
event_timestamp TIMESTAMP,
event_date DATE,
revenue NUMERIC
)
PARTITION BY event_date
CLUSTER BY customer_id, event_type;
Here:
event_dateis the partition columncustomer_idandevent_typeare clustering columns
If your source already has a timestamp column, it is common to derive the date during ingestion.
1
2
3
4
5
6
7
8
9
10
INSERT INTO `demo.analytics.app_events`
SELECT
event_id,
customer_id,
event_type,
country,
event_timestamp,
DATE(event_timestamp) AS event_date,
revenue
FROM `demo.staging.raw_app_events`;
How query cost improves
Now let us see why this matters. Suppose we want purchases for one customer in the last seven days.
1
2
3
4
5
6
7
8
9
SELECT
customer_id,
event_type,
SUM(revenue) AS total_revenue
FROM `demo.analytics.app_events`
WHERE event_date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND CURRENT_DATE()
AND customer_id = 'CUST_1001'
AND event_type = 'purchase'
GROUP BY 1, 2;
This query benefits in two ways:
- The
event_datefilter reduces the number of partitions scanned - The
customer_idandevent_typefilters help BigQuery prune clustered storage blocks inside those partitions
Without partitioning and clustering, BigQuery may need to read much more data even though the result is small.
Require partition filters
One thing I prefer in production is to require partition filters. This prevents people from accidentally scanning the full table.
1
2
3
4
ALTER TABLE `demo.analytics.app_events`
SET OPTIONS (
require_partition_filter = TRUE
);
This is a very useful setting for large fact tables. Otherwise, one missing WHERE clause can become an expensive mistake.
When partitioning helps the most
Partitioning helps a lot when queries naturally work on a time window, for example:
- yesterday’s orders
- last 30 days of clickstream
- monthly billing data
- daily incremental loads
If users rarely filter by date, partitioning might not help much. In that case, clustering or even a different table design may matter more.
Also, avoid over-partitioning without a reason. In many cases, daily partitioning is enough. Some teams try to partition too aggressively and then the design becomes harder to maintain.
Choosing clustering columns
The clustering columns should come from real query patterns, not guesses. Good candidates are columns that:
- appear often in WHERE clauses
- have enough repetition to help pruning
- are commonly used together with the partition filter
For example, these can be reasonable choices depending on the workload:
customer_idregionevent_typeorder_status
I would not just pick four columns because BigQuery allows four. Too many low-value clustering columns can make the design less useful. Start with one or two important ones and monitor query behavior.
Things to be careful about
There are a few practical limitations and caveats.
1. Partitioning does not fix bad queries
If the query does not filter on the partition column, the benefits can disappear. I have seen tables correctly partitioned, but users still scan everything because the query was written without the date filter.
2. Clustering benefits depend on table size
Clustering is more noticeable on larger tables. On small tables, the difference may be minimal.
3. Pick the right partition column
If we partition by ingestion date but all reporting is based on business event date, users may get confused and performance may not match expectations. Usually the partitioning column should match how the data is queried.
4. Streaming and late arriving data
If your data arrives late, then partitioning by event date can still work, but your ingestion logic needs to handle backfills properly. In production, that means thinking about late arriving events, reprocessing, and partition overwrite strategy.
5. Too many tiny partitions are not useful
If each partition has very little data, then the complexity may not bring much value. Partition only when there is enough data and a clear access pattern.
What I would change in production
For a simple demo, creating the table manually with SQL is enough. In production, I would usually do a few more things:
- Manage table creation with Terraform or another IaC tool
- Enforce partition filters on large tables
- Add data quality checks for partition columns
- Monitor query cost and bytes scanned
- Review clustering columns every so often based on actual usage
A production design also needs us to think about retention. Some tables only need 90 days of data, while others need years. BigQuery table and partition expiration settings can help here.
Simple rule of thumb
A simple way to think about it is this:
- Use partitioning to narrow by time
- Use clustering to narrow further by business dimensions
If your analysts mostly query recent data for a subset of customers, both features together are usually a good fit.
Conclusion
BigQuery partitioning and clustering are not difficult features, but they are very important once your tables start growing. Partitioning helps reduce the amount of data scanned, and clustering helps BigQuery organize the data better for common filters. If you pick the columns based on real query patterns and not just theory, you can save both time and cost. For most event-style datasets, partition by date and add one or two useful clustering columns is a very good place to start.
