Post

Creating external tables in BigQuery on GCS data

In this article let us see how to create external tables in BigQuery on top of files stored in GCS, why you might choose this approach, and what limitations you should keep in mind before using it in a real project. This is useful when you want to query files quickly without first loading them into native BigQuery storage. For small explorations, proof of concepts, and some low-frequency reporting use cases, this can save time and reduce the amount of pipeline work you need to set up.

At the same time, external tables are not a replacement for all ingestion patterns. They are convenient, but there are trade-offs around performance, features, and long-term manageability. So in this article we will create one, run a few queries, and discuss where it fits well.

What is an external table in BigQuery

An external table in BigQuery lets us query data that stays in an external storage system instead of being copied into a native BigQuery table. In our case, the external storage is GCS. BigQuery reads the files directly when we run a query.

This means:

  • We do not need a separate load job to bring the data into BigQuery first
  • The source of truth remains in GCS
  • We can get started quickly if the files are already available

For our use case, let us assume that CSV files are landing in a GCS bucket from another system and we want analysts to query them quickly.

When should you use this

From my experience, external tables are a good choice in the below situations:

  • You are validating a new dataset and do not want to build a full pipeline yet
  • The data changes infrequently and query performance is not extremely critical
  • You want a quick way to expose files in GCS to SQL users
  • You are dealing with a staging or raw layer before proper modeling

But if you need best query performance, partitioning strategies fully under your control, or frequent transformations, loading the data into a native BigQuery table is usually better.

External table vs native table

AspectExternal table on GCSNative BigQuery table
Data locationRemains in GCSStored in BigQuery
Setup speedVery fastRequires load or ingestion job
Query performanceUsually lowerUsually better
FeaturesMore limitedFull BigQuery features
Best forExploration, staging, light reportingProduction analytics, high-performance workloads

This is not to say external tables are bad. They are just better for a different job.

Sample source data

Let us say we have files stored like this:

1
2
3
gs://demo-raw-sales/orders_2025_01_01.csv
gs://demo-raw-sales/orders_2025_01_02.csv
gs://demo-raw-sales/orders_2025_01_03.csv

And the CSV looks like this:

order_id,customer_id,order_amount,order_date,status
1001,C101,125.50,2025-01-01,SHIPPED
1002,C102,89.00,2025-01-01,PENDING
1003,C103,240.75,2025-01-02,SHIPPED

Before creating the table, make sure the files are consistent. If one file has a different column order or a different delimiter, then you will run into problems while querying.

Create an external table using SQL

The simplest way is to create it using a CREATE EXTERNAL TABLE statement.

1
2
3
4
5
6
7
8
9
10
11
12
13
CREATE OR REPLACE EXTERNAL TABLE `demo_project.raw_ext.orders_gcs_ext`
(
  order_id INT64,
  customer_id STRING,
  order_amount NUMERIC,
  order_date DATE,
  status STRING
)
OPTIONS (
  format = 'CSV',
  uris = ['gs://demo-raw-sales/orders_2025_01_*.csv'],
  skip_leading_rows = 1
);

A few things to note here:

  • We are defining the schema explicitly instead of relying on autodetect
  • We are using a wildcard in the GCS path
  • We are skipping the header row from each CSV file

I generally prefer explicit schema in production because autodetect can sometimes infer types in a way that creates confusion later.

Query the external table

Once the table is created, we can query it like a normal BigQuery table.

1
2
3
4
5
6
7
8
SELECT
  order_date,
  status,
  COUNT(*) AS order_count,
  SUM(order_amount) AS total_amount
FROM `demo_project.raw_ext.orders_gcs_ext`
GROUP BY order_date, status
ORDER BY order_date, status;

This is one reason the pattern is attractive. The users can keep using SQL without caring too much that the data is still sitting in GCS.

If you want to expose only a cleaner subset, you could create a view on top of the external table.

1
2
3
4
5
6
7
8
CREATE OR REPLACE VIEW `demo_project.analytics.orders_clean` AS
SELECT
  order_id,
  customer_id,
  order_amount,
  order_date
FROM `demo_project.raw_ext.orders_gcs_ext`
WHERE status = 'SHIPPED';

That way analysts query the view, and your raw external table remains a lower-level object.

Create it from the BigQuery UI

You can also create the external table from the BigQuery UI. The steps are simple:

  1. Open BigQuery in the GCP Console
  2. Choose the dataset where you want the table
  3. Click Create Table
  4. For the source, choose Google Cloud Storage
  5. Enter the GCS URI or wildcard path
  6. Choose the file format like CSV, JSON, Parquet, or Avro
  7. Define the schema manually or use autodetect
  8. For the table type, choose External table

For quick testing this is fine. But if the setup matters for repeatable environments, I would still prefer creating it through SQL or Terraform.

Practical limitations and caveats

This is the part that matters most. External tables are easy to create, but there are some things to be careful about.

1. Query performance can be slower

Since BigQuery is reading the files externally, performance is often not as good as querying native tables. On a small dataset this may not be noticeable, but on larger datasets it usually becomes important.

2. File consistency matters a lot

If your files have schema drift, bad rows, or inconsistent formatting, queries can fail or produce confusing results. For a demo this may be acceptable. For production, you should validate files before exposing them.

3. Feature support is more limited

Depending on the file format and external table type, some optimizations or BigQuery features will not behave the same way as native tables. This is another reason why I see external tables more as a raw access layer rather than the final modeled layer.

4. Cost and scan patterns still need attention

People sometimes think external means cheaper by default. That is not always true. If users keep scanning many files repeatedly, loading the data once into a structured native table may become more efficient operationally.

5. Permissions need to be set correctly

Your BigQuery users may also need the right access path to query data backed by GCS. If access is not configured correctly, the table might exist but users still cannot query it successfully.

A simple production-minded approach

For a simple demo, external tables directly on raw CSV files are enough. But in a production use case, I would usually improve the pattern a bit:

  • Prefer Parquet or Avro over CSV when possible
  • Keep files organized in clear folder structures
  • Validate schema before files land in the shared bucket
  • Put a view on top of the external table for consumers
  • Load frequently queried data into native BigQuery tables later

A lightweight pattern could look like this:

1
Source system -> GCS raw bucket -> External table -> Validation view -> Scheduled load to native table

This gives us a nice balance. We get quick access to the data as soon as files arrive, but we do not force all users to depend forever on external-file query behavior.

Example with Parquet

If your files are in Parquet, the setup is usually cleaner because the schema is embedded and the format is more analytics-friendly.

1
2
3
4
5
CREATE OR REPLACE EXTERNAL TABLE `demo_project.raw_ext.orders_parquet_ext`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://demo-raw-sales-parquet/orders/*.parquet']
);

If I had a choice in a new project, I would usually pick Parquet over CSV for this kind of integration unless a source system forces CSV.

What I would avoid

I would avoid using external tables as the permanent answer for a heavily used analytics layer. If business dashboards, multiple joins, and many daily users depend on the dataset, native BigQuery tables are usually a better design. External tables are great for speed of setup, not always for long-term serving performance.

Conclusion

External tables in BigQuery on GCS data are a very practical option when you want to query files quickly without building a full ingestion pipeline first. They work well for raw access, exploration, and some lightweight reporting use cases. But they do come with trade-offs around performance, consistency, and feature flexibility. For our use case, this is a good tool to have when we want fast access to GCS data, as long as we are clear about where it fits and where a native BigQuery table would be the better next step.

This post is licensed under CC BY 4.0 by the author.