Post

Cost optimization basics for BigQuery workloads

In this article let us go through some practical basics for reducing BigQuery cost before it turns into a painful surprise in your monthly bill. If you are using BigQuery for analytics, reporting, or data engineering workloads, it is very easy to start with a simple setup and then slowly end up with expensive queries because the tables keep growing while the query patterns stay the same.

For our use case, we will focus on the common things a beginner or mid-sized team can actually control, like partitioning, clustering, avoiding full table scans, and keeping an eye on query habits. This is not an advanced cost governance setup, but it is enough to save a lot of money if you apply it early.

Why BigQuery cost goes up quickly

BigQuery is convenient because you do not need to manage servers, indexes, or infrastructure in the traditional sense. But that convenience also makes it easy for teams to query large datasets casually. A dashboard refresh, a badly written join, or a simple SELECT * on a big table can scan a lot more data than expected.

In practice, I have noticed that BigQuery costs usually grow for one of the below reasons:

  1. Large tables are queried without partition filters
  2. Teams use SELECT * even when only a few columns are needed
  3. Staging or intermediate data is kept forever
  4. The same expensive transformation is run repeatedly
  5. Nobody checks query volume until the monthly bill arrives

Let us look at the basic controls one by one.

1. Partition large tables early

If a table is going to keep growing every day, partition it early instead of waiting for it to become a problem. Partitioning allows BigQuery to scan only the partitions needed for a query rather than reading the full table.

For example, if we have an events table, we can partition by event date.

1
2
3
4
5
6
7
8
9
CREATE TABLE analytics.events
(
  event_id STRING,
  user_id STRING,
  event_type STRING,
  event_timestamp TIMESTAMP,
  event_date DATE
)
PARTITION BY event_date;

Once the table is partitioned, make sure the queries also use the partition column.

1
2
3
SELECT user_id, event_type
FROM analytics.events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-07';

If your team creates partitioned tables but keeps filtering on DATE(event_timestamp) instead of the partition column, the savings may not be as good as expected. So the table design and query pattern need to match.

2. Use clustering when filters are predictable

Partitioning helps reduce the amount of data scanned across time. Clustering helps within the partition when queries commonly filter by the same columns.

For example, if most queries filter by customer_id or event_type, clustering can help organize the storage better.

1
2
3
4
5
6
7
8
9
10
11
CREATE TABLE analytics.events
(
  event_id STRING,
  user_id STRING,
  customer_id STRING,
  event_type STRING,
  event_timestamp TIMESTAMP,
  event_date DATE
)
PARTITION BY event_date
CLUSTER BY customer_id, event_type;

I would not cluster every table blindly. It helps most when the access pattern is consistent. If nobody filters by those fields, clustering may not give much benefit.

3. Stop using SELECT * on large tables

This is one of the simplest fixes and also one of the most ignored ones. If a table has 80 columns and the report only needs 6, do not read all 80.

Bad query:

1
2
3
SELECT *
FROM analytics.orders
WHERE order_date = '2025-01-21';

Better query:

1
2
3
SELECT order_id, customer_id, order_status, total_amount, created_at
FROM analytics.orders
WHERE order_date = '2025-01-21';

This looks like a small change, but on wide tables it makes a real difference. It also makes queries easier to review and maintain.

4. Materialize expensive repeated transformations

Sometimes the cost issue is not a single bad query, but the same expensive transformation being executed again and again by dashboards, ad hoc analysis, and downstream jobs.

If you have a transformation that is reused a lot, consider materializing it into a table or scheduled aggregate. For example:

1
2
3
4
5
6
7
8
CREATE OR REPLACE TABLE analytics.daily_sales_summary AS
SELECT
  order_date,
  customer_region,
  COUNT(*) AS total_orders,
  SUM(total_amount) AS total_sales
FROM analytics.orders
GROUP BY order_date, customer_region;

Then the dashboards can read from daily_sales_summary instead of scanning the raw orders table each time.

For a simple demo or smaller project, a scheduled query may be enough. In a production use case, I would think more carefully about refresh frequency, late arriving data, and whether incremental logic is needed.

5. Keep staging tables under control

Many BigQuery environments become expensive not because of user-facing analytics, but because old staging and temporary tables are never cleaned up.

If your pipeline creates intermediate tables, decide upfront:

  • Should this table live only for a few days?
  • Can it be replaced by a view?
  • Should it have a table expiration policy?

For example:

1
2
3
4
5
6
7
CREATE TABLE analytics.temp_session_data
(
  session_id STRING,
  user_id STRING,
  created_at TIMESTAMP
)
OPTIONS (expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 3 DAY));

This is a very simple control, but it prevents a lot of forgotten data from sitting around forever.

6. Compare before and after using dry runs and job history

Before changing a major query, it is useful to compare how much data it scans. BigQuery dry runs are useful here because you can estimate bytes processed without actually running the query.

If you are using the CLI, you can do something like this:

1
2
3
4
bq query --use_legacy_sql=false --dry_run ' SELECT customer_id, SUM(total_amount)
  FROM analytics.orders
  WHERE order_date BETWEEN "2025-01-01" AND "2025-01-07"
  GROUP BY customer_id '

You can also review query history from the BigQuery console and identify the users, jobs, or scheduled queries consuming the most bytes. That usually gives better value than trying to optimize everything blindly.

Quick comparison

TechniqueHelps withBest use caseCaveat
PartitioningReducing scanned data by date or rangeLarge growing fact tablesQueries must use the partition filter
ClusteringBetter pruning inside partitionsRepeated filters on known columnsNot useful if query patterns are random
Selecting fewer columnsLower scan cost on wide tablesReports and ad hoc analysisNeeds better query discipline
Materialized summary tablesRepeated expensive aggregationsDashboards and common business metricsNeed refresh strategy
Expiration for temp tablesLower storage wasteStaging and intermediate dataMust not expire data still needed

7. Add guardrails for the team

Cost optimization should not depend only on one engineer remembering best practices. It helps to build a few simple guardrails into the team process.

A few practical examples are:

  • Add SQL review comments for missing partition filters
  • Create reusable views or summary tables for common reporting needs
  • Monitor high-cost queries weekly
  • Separate raw, staging, and curated datasets clearly
  • Document which tables are safe for ad hoc use and which are not

If the team is growing, I would also consider setting custom budgets and alerts in GCP. That does not reduce cost by itself, but it makes sure the surprise comes earlier.

Things to be careful about

There are a few places where teams over-optimize or optimize the wrong thing.

First, not every table needs clustering and partitioning. If the dataset is small, the extra design effort may not be worth it.

Second, summary tables can reduce query cost but increase pipeline complexity. If your source data changes late, the aggregates may become stale unless you rebuild them carefully.

Third, cost should not be reduced at the expense of making the data unusable. For example, deleting staging data too aggressively can make debugging much harder.

So for me the right approach is usually simple: optimize the largest repeated scans first, put basic controls in place, and then improve further only when the usage pattern justifies it.

Conclusion

BigQuery cost optimization does not need to start with a large governance project. In many cases, the basics give the biggest result: partition large tables, cluster only where it helps, avoid SELECT *, materialize repeated heavy transformations, and clean up temporary data. If you do these things early, BigQuery stays easy to use without becoming unnecessarily expensive.

This post is licensed under CC BY 4.0 by the author.