BigQuery Partitioning and Clustering: A Practical Guide for Data Engineers
If you have worked with BigQuery for any reasonable amount of time, you have probably stared at a query that scanned a few hundred gigabytes when it really only needed a few megabytes. That is usually the moment you start looking into partitioning and clustering.
In this article, let us walk through what partitioning and clustering actually do in BigQuery, when to use each one (or both together), and the things that tripped me up when I first started using them. We will go through actual SQL examples, not just definitions, so you can take these patterns and use them directly.
Why This Matters
BigQuery charges you based on the amount of data scanned by your queries, not the amount of data stored. A SELECT * FROM large_table WHERE date = '2025-01-01' on an unpartitioned table will scan the entire table. On a table with a few terabytes of data and a few years of history, that query alone can cost you a few dollars. Multiply that by a few hundred queries a day from dashboards, scheduled jobs, and analysts poking around, and it adds up fast.
Partitioning and clustering are the two main tools BigQuery gives you to avoid scanning data you do not need.
Partitioning: The Big Hammer
Partitioning splits your table into segments based on a column. When your query filters on that column, BigQuery only scans the matching partitions and skips the rest. It is called partition pruning, and it is the single most effective thing you can do to reduce costs.
Types of Partitioning
BigQuery gives you three options:
| Partition Type | What It Does | Good Fit For |
|---|---|---|
| Time-unit column | Partitions by DATE, TIMESTAMP, or DATETIME column (daily, hourly, monthly, yearly) | Tables with a natural event time like created_at, order_date, event_timestamp |
| Ingestion time | Partitions by the time data was loaded into BigQuery, using a pseudo-column _PARTITIONTIME | Streaming tables or when you do not want to rely on an actual date column in your data |
| Integer range | Partitions by ranges of an integer column | Tables with sequential IDs, user IDs bucketed into ranges |
For most use cases in data engineering, you will reach for time-unit column partitioning first. It is the most intuitive and lines up with how most people query data — by date range.
Creating a Partitioned Table
Let us create a table partitioned by a transaction_date column:
1
2
3
4
CREATE TABLE sales.partitioned_transactions
PARTITION BY DATE(transaction_date)
AS
SELECT * FROM sales.raw_transactions;
That is it. Now when you run:
1
2
SELECT COUNT(*) FROM sales.partitioned_transactions
WHERE transaction_date = '2025-05-01';
BigQuery will only scan the partition for May 1st 2025 instead of the whole table. On a large table, the difference in bytes scanned can easily go from terabytes down to megabytes.
Partition by Ingestion Time
If you do not have a clean date column or you just want to partition by when the data was loaded, use ingestion-time partitioning:
1
2
3
4
CREATE TABLE sales.ingestion_partitioned_transactions
PARTITION BY DATE(_PARTITIONTIME)
AS
SELECT * FROM sales.raw_transactions;
Now your queries need to filter on _PARTITIONTIME instead:
1
2
SELECT COUNT(*) FROM sales.ingestion_partitioned_transactions
WHERE _PARTITIONTIME = '2025-05-01';
One thing to watch out for: BigQuery uses UTC for ingestion-time partitioning. If your data lands from systems in a different timezone, make sure you are aware of the boundary cutoffs. I have seen queries miss data because someone ran a query at 11 PM local time expecting the partition to match, but the UTC date had already rolled over.
Partition Pruning and What Breaks It
Partition pruning only works when you filter on the partition column directly. If you wrap it in a function, BigQuery gets confused and scans everything:
1
2
3
4
5
-- This WILL prune
WHERE transaction_date = '2025-05-01'
-- This will NOT prune (scans all partitions)
WHERE FORMAT_DATE('%Y-%m', transaction_date) = '2025-05'
This is probably the most common mistake I see. If you need to filter by month, add a separate condition on the date range instead:
1
2
WHERE transaction_date >= '2025-05-01'
AND transaction_date < '2025-06-01'
Clustering: The Fine-Tuning
Partitioning is good for broad filtering, but what if you have a few terabytes within a single partition? That is where clustering comes in.
Clustering sorts your data within each partition based on one or more columns. When you filter or group by those columns, BigQuery can skip blocks of data that do not match. Unlike partitioning, clustering does not put data in separate physical segments — it just keeps related rows physically close together so the query engine can skip more efficiently.
Creating a Clustered Table
You can combine clustering with partitioning:
1
2
3
4
5
CREATE TABLE sales.partitioned_and_clustered
PARTITION BY DATE(transaction_date)
CLUSTER BY customer_id, product_category
AS
SELECT * FROM sales.raw_transactions;
Now when you run:
1
2
3
SELECT SUM(amount) FROM sales.partitioned_and_clustered
WHERE transaction_date = '2025-05-01'
AND customer_id = 12345;
BigQuery first prunes to the May 1st partition, then uses clustering to skip blocks of data that do not contain customer_id 12345. The combination is where the real savings happen.
Picking Clustering Columns
The order of columns in the CLUSTER BY clause matters. BigQuery sorts by the first column, then the second, and so on. If most of your queries filter on customer_id and occasionally on product_category, put customer_id first. Clustering on a low-cardinality column first (like country with 5 values) is not very useful because almost every block will contain all values, and BigQuery cannot skip much.
A good rule of thumb: cluster on columns that appear frequently in WHERE clauses and have high enough cardinality to make the clustering worthwhile. Columns with under a few hundred distinct values are usually not great clustering candidates.
Clustering Limitations You Should Know
A few things that are not obvious from the documentation:
- Clustering works best on larger tables. On tables under 1 GB, BigQuery may decide the overhead is not worth it and skip clustering entirely. You will not get an error — it just will not help.
- New data does not get perfectly sorted on arrival. BigQuery does automatic re-clustering in the background, but recently inserted or streamed data might not be optimally sorted for a while. If you run a query right after a large load, you might not see the clustering benefit immediately.
- You cannot change the clustering columns after table creation. You would need to recreate the table with a
CREATE TABLE ... AS SELECTor useALTER TABLE ... SET OPTIONS— but that only works on existing data, not new data going forward. In practice, just recreate the table. - Clustering is free (no extra storage cost), unlike partitioning where each partition has a minimum billing duration. This makes clustering an easy win if your query patterns fit.
Partitioning vs Clustering: When to Use What
Here is a quick decision guide based on what I have seen work in practice:
| Scenario | Recommendation |
|---|---|
| Table has a clear date column and most queries filter by date | Partition by that date column |
| Table is small (under a few GB) | Probably do not need either, but clustering does not hurt |
| Queries always filter by customer_id or user_id but no date | Cluster on the ID column alone |
| Mix of date filtering and high-cardinality column filtering | Partition by date and cluster by the other column(s) |
| Streaming data with no reliable event timestamp | Use ingestion-time partitioning |
| Need to limit partitions (cost or management overhead) | Use partition expiration (partition_expiration_days) |
Setting Partition Expiration
Partitions are not free — each partition counts as a minimum of 10 MB for billing purposes, even if it is empty. If you have daily partitions going back five years, that is 1,825 partitions. At 10 MB minimum billing each, you are billed for at least 18 GB of storage even if the partitions are mostly empty. This is usually negligible, but worth knowing for very long retention tables with tiny daily data.
To keep things tidy, set a partition expiration:
1
2
3
4
CREATE TABLE sales.transactions
PARTITION BY DATE(transaction_date)
OPTIONS (partition_expiration_days = 365)
AS SELECT * FROM sales.raw_transactions;
This automatically drops partitions older than a year. No more manual cleanup scripts.
Production Considerations
In a production pipeline, here is what I would add beyond the simple demo setup above:
- Partition filter requirement. Set
require_partition_filter = trueon your tables so that anyone running a query without a partition filter gets an error instead of a surprise bill. This one setting has probably saved my teams more money than any other.
1
2
3
4
CREATE TABLE sales.transactions
PARTITION BY DATE(transaction_date)
OPTIONS (require_partition_filter = true)
AS SELECT * FROM sales.raw_transactions;
Use table clones or snapshots for testing. Before rolling out a new partitioning scheme, clone your production table and test your queries against it. BigQuery table clones are cheap and let you verify that your partition filters actually prune as expected.
Monitor with INFORMATION_SCHEMA. BigQuery has a
PARTITIONSview that tells you the size and row count of each partition. Check it periodically to make sure your data is not skewed — if one partition is 100x the size of the others, your queries targeting that partition will still be slow.
1
2
3
4
5
SELECT partition_id, total_rows, total_logical_bytes
FROM sales.INFORMATION_SCHEMA.PARTITIONS
WHERE table_name = 'transactions'
ORDER BY partition_id DESC
LIMIT 10;
- Apply the same partitioning logic in dbt or Terraform. If you are managing tables through dbt, add the partition and cluster config in your model. If you are using Terraform, define it in the
google_bigquery_tableresource. Avoid creating tables manually in the console — someone will forget to apply the settings, and you will not notice until the bill comes in.
Summary
Partitioning and clustering are not complicated features, but getting them right makes a huge difference to both query speed and your monthly GCP bill. Start with partitioning on a date column, add clustering on the columns your queries actually filter by, and turn on require_partition_filter before anyone on your team runs their first ad-hoc query. The INFORMATION_SCHEMA views will tell you if your setup is actually working.
If you are coming from a traditional data warehouse background, BigQuery’s approach to partitioning might feel a bit different — there are no indexes to manage and no manual partition maintenance. That is both the appeal and the trap. It is easy to set up, but if you ignore it entirely, the costs will let you know eventually.
