Creating external tables in BigQuery on GCS data
In this article let us see how to create external tables in BigQuery on top of files stored in GCS, why you might choose this approach, and what limitations you should keep in mind before using it in a real project. This is useful when you want to query files quickly without first loading them into native BigQuery storage. For small explorations, proof of concepts, and some low-frequency reporting use cases, this can save time and reduce the amount of pipeline work you need to set up.
At the same time, external tables are not a replacement for all ingestion patterns. They are convenient, but there are trade-offs around performance, features, and long-term manageability. So in this article we will create one, run a few queries, and discuss where it fits well.
What is an external table in BigQuery
An external table in BigQuery lets us query data that stays in an external storage system instead of being copied into a native BigQuery table. In our case, the external storage is GCS. BigQuery reads the files directly when we run a query.
This means:
- We do not need a separate load job to bring the data into BigQuery first
- The source of truth remains in GCS
- We can get started quickly if the files are already available
For our use case, let us assume that CSV files are landing in a GCS bucket from another system and we want analysts to query them quickly.
When should you use this
From my experience, external tables are a good choice in the below situations:
- You are validating a new dataset and do not want to build a full pipeline yet
- The data changes infrequently and query performance is not extremely critical
- You want a quick way to expose files in GCS to SQL users
- You are dealing with a staging or raw layer before proper modeling
But if you need best query performance, partitioning strategies fully under your control, or frequent transformations, loading the data into a native BigQuery table is usually better.
External table vs native table
| Aspect | External table on GCS | Native BigQuery table |
|---|---|---|
| Data location | Remains in GCS | Stored in BigQuery |
| Setup speed | Very fast | Requires load or ingestion job |
| Query performance | Usually lower | Usually better |
| Features | More limited | Full BigQuery features |
| Best for | Exploration, staging, light reporting | Production analytics, high-performance workloads |
This is not to say external tables are bad. They are just better for a different job.
Sample source data
Let us say we have files stored like this:
1
2
3
gs://demo-raw-sales/orders_2025_01_01.csv
gs://demo-raw-sales/orders_2025_01_02.csv
gs://demo-raw-sales/orders_2025_01_03.csv
And the CSV looks like this:
order_id,customer_id,order_amount,order_date,status
1001,C101,125.50,2025-01-01,SHIPPED
1002,C102,89.00,2025-01-01,PENDING
1003,C103,240.75,2025-01-02,SHIPPED
Before creating the table, make sure the files are consistent. If one file has a different column order or a different delimiter, then you will run into problems while querying.
Create an external table using SQL
The simplest way is to create it using a CREATE EXTERNAL TABLE statement.
1
2
3
4
5
6
7
8
9
10
11
12
13
CREATE OR REPLACE EXTERNAL TABLE `demo_project.raw_ext.orders_gcs_ext`
(
order_id INT64,
customer_id STRING,
order_amount NUMERIC,
order_date DATE,
status STRING
)
OPTIONS (
format = 'CSV',
uris = ['gs://demo-raw-sales/orders_2025_01_*.csv'],
skip_leading_rows = 1
);
A few things to note here:
- We are defining the schema explicitly instead of relying on autodetect
- We are using a wildcard in the GCS path
- We are skipping the header row from each CSV file
I generally prefer explicit schema in production because autodetect can sometimes infer types in a way that creates confusion later.
Query the external table
Once the table is created, we can query it like a normal BigQuery table.
1
2
3
4
5
6
7
8
SELECT
order_date,
status,
COUNT(*) AS order_count,
SUM(order_amount) AS total_amount
FROM `demo_project.raw_ext.orders_gcs_ext`
GROUP BY order_date, status
ORDER BY order_date, status;
This is one reason the pattern is attractive. The users can keep using SQL without caring too much that the data is still sitting in GCS.
If you want to expose only a cleaner subset, you could create a view on top of the external table.
1
2
3
4
5
6
7
8
CREATE OR REPLACE VIEW `demo_project.analytics.orders_clean` AS
SELECT
order_id,
customer_id,
order_amount,
order_date
FROM `demo_project.raw_ext.orders_gcs_ext`
WHERE status = 'SHIPPED';
That way analysts query the view, and your raw external table remains a lower-level object.
Create it from the BigQuery UI
You can also create the external table from the BigQuery UI. The steps are simple:
- Open BigQuery in the GCP Console
- Choose the dataset where you want the table
- Click
Create Table - For the source, choose
Google Cloud Storage - Enter the GCS URI or wildcard path
- Choose the file format like CSV, JSON, Parquet, or Avro
- Define the schema manually or use autodetect
- For the table type, choose
External table
For quick testing this is fine. But if the setup matters for repeatable environments, I would still prefer creating it through SQL or Terraform.
Practical limitations and caveats
This is the part that matters most. External tables are easy to create, but there are some things to be careful about.
1. Query performance can be slower
Since BigQuery is reading the files externally, performance is often not as good as querying native tables. On a small dataset this may not be noticeable, but on larger datasets it usually becomes important.
2. File consistency matters a lot
If your files have schema drift, bad rows, or inconsistent formatting, queries can fail or produce confusing results. For a demo this may be acceptable. For production, you should validate files before exposing them.
3. Feature support is more limited
Depending on the file format and external table type, some optimizations or BigQuery features will not behave the same way as native tables. This is another reason why I see external tables more as a raw access layer rather than the final modeled layer.
4. Cost and scan patterns still need attention
People sometimes think external means cheaper by default. That is not always true. If users keep scanning many files repeatedly, loading the data once into a structured native table may become more efficient operationally.
5. Permissions need to be set correctly
Your BigQuery users may also need the right access path to query data backed by GCS. If access is not configured correctly, the table might exist but users still cannot query it successfully.
A simple production-minded approach
For a simple demo, external tables directly on raw CSV files are enough. But in a production use case, I would usually improve the pattern a bit:
- Prefer Parquet or Avro over CSV when possible
- Keep files organized in clear folder structures
- Validate schema before files land in the shared bucket
- Put a view on top of the external table for consumers
- Load frequently queried data into native BigQuery tables later
A lightweight pattern could look like this:
1
Source system -> GCS raw bucket -> External table -> Validation view -> Scheduled load to native table
This gives us a nice balance. We get quick access to the data as soon as files arrive, but we do not force all users to depend forever on external-file query behavior.
Example with Parquet
If your files are in Parquet, the setup is usually cleaner because the schema is embedded and the format is more analytics-friendly.
1
2
3
4
5
CREATE OR REPLACE EXTERNAL TABLE `demo_project.raw_ext.orders_parquet_ext`
OPTIONS (
format = 'PARQUET',
uris = ['gs://demo-raw-sales-parquet/orders/*.parquet']
);
If I had a choice in a new project, I would usually pick Parquet over CSV for this kind of integration unless a source system forces CSV.
What I would avoid
I would avoid using external tables as the permanent answer for a heavily used analytics layer. If business dashboards, multiple joins, and many daily users depend on the dataset, native BigQuery tables are usually a better design. External tables are great for speed of setup, not always for long-term serving performance.
Conclusion
External tables in BigQuery on GCS data are a very practical option when you want to query files quickly without building a full ingestion pipeline first. They work well for raw access, exploration, and some lightweight reporting use cases. But they do come with trade-offs around performance, consistency, and feature flexibility. For our use case, this is a good tool to have when we want fast access to GCS data, as long as we are clear about where it fits and where a native BigQuery table would be the better next step.
