Cost optimization basics for BigQuery workloads
In this article let us go through some practical basics for reducing BigQuery cost before it turns into a painful surprise in your monthly bill. If you are using BigQuery for analytics, reporting, or data engineering workloads, it is very easy to start with a simple setup and then slowly end up with expensive queries because the tables keep growing while the query patterns stay the same.
For our use case, we will focus on the common things a beginner or mid-sized team can actually control, like partitioning, clustering, avoiding full table scans, and keeping an eye on query habits. This is not an advanced cost governance setup, but it is enough to save a lot of money if you apply it early.
Why BigQuery cost goes up quickly
BigQuery is convenient because you do not need to manage servers, indexes, or infrastructure in the traditional sense. But that convenience also makes it easy for teams to query large datasets casually. A dashboard refresh, a badly written join, or a simple SELECT * on a big table can scan a lot more data than expected.
In practice, I have noticed that BigQuery costs usually grow for one of the below reasons:
- Large tables are queried without partition filters
- Teams use
SELECT *even when only a few columns are needed - Staging or intermediate data is kept forever
- The same expensive transformation is run repeatedly
- Nobody checks query volume until the monthly bill arrives
Let us look at the basic controls one by one.
1. Partition large tables early
If a table is going to keep growing every day, partition it early instead of waiting for it to become a problem. Partitioning allows BigQuery to scan only the partitions needed for a query rather than reading the full table.
For example, if we have an events table, we can partition by event date.
1
2
3
4
5
6
7
8
9
CREATE TABLE analytics.events
(
event_id STRING,
user_id STRING,
event_type STRING,
event_timestamp TIMESTAMP,
event_date DATE
)
PARTITION BY event_date;
Once the table is partitioned, make sure the queries also use the partition column.
1
2
3
SELECT user_id, event_type
FROM analytics.events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-07';
If your team creates partitioned tables but keeps filtering on DATE(event_timestamp) instead of the partition column, the savings may not be as good as expected. So the table design and query pattern need to match.
2. Use clustering when filters are predictable
Partitioning helps reduce the amount of data scanned across time. Clustering helps within the partition when queries commonly filter by the same columns.
For example, if most queries filter by customer_id or event_type, clustering can help organize the storage better.
1
2
3
4
5
6
7
8
9
10
11
CREATE TABLE analytics.events
(
event_id STRING,
user_id STRING,
customer_id STRING,
event_type STRING,
event_timestamp TIMESTAMP,
event_date DATE
)
PARTITION BY event_date
CLUSTER BY customer_id, event_type;
I would not cluster every table blindly. It helps most when the access pattern is consistent. If nobody filters by those fields, clustering may not give much benefit.
3. Stop using SELECT * on large tables
This is one of the simplest fixes and also one of the most ignored ones. If a table has 80 columns and the report only needs 6, do not read all 80.
Bad query:
1
2
3
SELECT *
FROM analytics.orders
WHERE order_date = '2025-01-21';
Better query:
1
2
3
SELECT order_id, customer_id, order_status, total_amount, created_at
FROM analytics.orders
WHERE order_date = '2025-01-21';
This looks like a small change, but on wide tables it makes a real difference. It also makes queries easier to review and maintain.
4. Materialize expensive repeated transformations
Sometimes the cost issue is not a single bad query, but the same expensive transformation being executed again and again by dashboards, ad hoc analysis, and downstream jobs.
If you have a transformation that is reused a lot, consider materializing it into a table or scheduled aggregate. For example:
1
2
3
4
5
6
7
8
CREATE OR REPLACE TABLE analytics.daily_sales_summary AS
SELECT
order_date,
customer_region,
COUNT(*) AS total_orders,
SUM(total_amount) AS total_sales
FROM analytics.orders
GROUP BY order_date, customer_region;
Then the dashboards can read from daily_sales_summary instead of scanning the raw orders table each time.
For a simple demo or smaller project, a scheduled query may be enough. In a production use case, I would think more carefully about refresh frequency, late arriving data, and whether incremental logic is needed.
5. Keep staging tables under control
Many BigQuery environments become expensive not because of user-facing analytics, but because old staging and temporary tables are never cleaned up.
If your pipeline creates intermediate tables, decide upfront:
- Should this table live only for a few days?
- Can it be replaced by a view?
- Should it have a table expiration policy?
For example:
1
2
3
4
5
6
7
CREATE TABLE analytics.temp_session_data
(
session_id STRING,
user_id STRING,
created_at TIMESTAMP
)
OPTIONS (expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 3 DAY));
This is a very simple control, but it prevents a lot of forgotten data from sitting around forever.
6. Compare before and after using dry runs and job history
Before changing a major query, it is useful to compare how much data it scans. BigQuery dry runs are useful here because you can estimate bytes processed without actually running the query.
If you are using the CLI, you can do something like this:
1
2
3
4
bq query --use_legacy_sql=false --dry_run ' SELECT customer_id, SUM(total_amount)
FROM analytics.orders
WHERE order_date BETWEEN "2025-01-01" AND "2025-01-07"
GROUP BY customer_id '
You can also review query history from the BigQuery console and identify the users, jobs, or scheduled queries consuming the most bytes. That usually gives better value than trying to optimize everything blindly.
Quick comparison
| Technique | Helps with | Best use case | Caveat |
|---|---|---|---|
| Partitioning | Reducing scanned data by date or range | Large growing fact tables | Queries must use the partition filter |
| Clustering | Better pruning inside partitions | Repeated filters on known columns | Not useful if query patterns are random |
| Selecting fewer columns | Lower scan cost on wide tables | Reports and ad hoc analysis | Needs better query discipline |
| Materialized summary tables | Repeated expensive aggregations | Dashboards and common business metrics | Need refresh strategy |
| Expiration for temp tables | Lower storage waste | Staging and intermediate data | Must not expire data still needed |
7. Add guardrails for the team
Cost optimization should not depend only on one engineer remembering best practices. It helps to build a few simple guardrails into the team process.
A few practical examples are:
- Add SQL review comments for missing partition filters
- Create reusable views or summary tables for common reporting needs
- Monitor high-cost queries weekly
- Separate raw, staging, and curated datasets clearly
- Document which tables are safe for ad hoc use and which are not
If the team is growing, I would also consider setting custom budgets and alerts in GCP. That does not reduce cost by itself, but it makes sure the surprise comes earlier.
Things to be careful about
There are a few places where teams over-optimize or optimize the wrong thing.
First, not every table needs clustering and partitioning. If the dataset is small, the extra design effort may not be worth it.
Second, summary tables can reduce query cost but increase pipeline complexity. If your source data changes late, the aggregates may become stale unless you rebuild them carefully.
Third, cost should not be reduced at the expense of making the data unusable. For example, deleting staging data too aggressively can make debugging much harder.
So for me the right approach is usually simple: optimize the largest repeated scans first, put basic controls in place, and then improve further only when the usage pattern justifies it.
Conclusion
BigQuery cost optimization does not need to start with a large governance project. In many cases, the basics give the biggest result: partition large tables, cluster only where it helps, avoid SELECT *, materialize repeated heavy transformations, and clean up temporary data. If you do these things early, BigQuery stays easy to use without becoming unnecessarily expensive.
