Small team data platform architecture on GCP
In this article, let us look at a simple data platform architecture on GCP that works well for a small team. This approach is useful when you need to ingest files or events, do some light to medium transformations, and serve clean tables for reporting, but you do not want to bring too many tools into the picture from day one.
When people talk about data platforms, the diagrams can quickly become bigger than what most teams really need. For our use case, I am assuming a small team that maybe has one or two data engineers, a few analysts, and a limited budget. In such cases, the goal is usually not to build the perfect platform. The goal is to build something that is easy to run, easy to debug, and easy to expand later.
What I would keep in the first version
A practical first version on GCP could look like this:
- Cloud Storage for landing raw files
- Pub/Sub for event-driven notifications when needed
- Dataflow or Cloud Run jobs for ingestion and transformation
- BigQuery for the warehouse layer
- Cloud Composer or Workflows only if orchestration starts getting messy
- Terraform for infrastructure
- GitHub Actions for CI/CD
If I draw it in plain words, it is something like this:
- Source system drops files in GCS or pushes events
- Pub/Sub or a scheduler triggers processing
- Processing job validates and standardizes data
- BigQuery stores raw, cleaned, and curated tables
- BI tools query the curated layer
This is not the only valid setup, but for a small team it is a good balance between managed services and operational simplicity.
A simple comparison
| Need | GCP service | Why I would choose it |
|---|---|---|
| Raw file landing | Cloud Storage | Cheap, simple, and works with many ingestion patterns |
| Analytics warehouse | BigQuery | Serverless and fast enough for most small teams |
| Event messaging | Pub/Sub | Good when pipelines need decoupling |
| Batch or stream processing | Dataflow | Strong option when transformation logic grows |
| Lightweight job execution | Cloud Run Jobs | Easier than Dataflow for smaller Python workloads |
| Orchestration | Workflows or Composer | Add only when cron plus scripts becomes painful |
One thing I noticed in small teams is that they often adopt Composer too early because it feels like a proper data platform tool. But Airflow-based orchestration also adds maintenance overhead. If the team only has three or four pipelines, scheduled Cloud Run jobs or Workflows can be much easier.
Layering the data
I still prefer to organize the warehouse in raw, cleaned, and curated zones even when using BigQuery. Different teams call this bronze, silver, and gold. The naming matters less than being consistent.
A possible dataset structure is below:
1
2
3
4
5
my-project
raw_sales
clean_sales
marts
monitoring
The idea is simple:
- raw keeps the source close to how it arrived
- clean applies data types, deduplication, and basic quality checks
- marts contains business-ready tables for reporting
For example, if a CSV file lands in GCS under gs://demo-landing/sales/orders/, a first load can place it into raw_sales.orders_ext or raw_sales.orders_raw. Then a SQL transformation can produce the clean table.
1
2
3
4
5
6
7
8
9
create or replace table `demo.clean_sales.orders` as
select
cast(order_id as string) as order_id,
cast(customer_id as string) as customer_id,
cast(order_timestamp as timestamp) as order_ts,
cast(total_amount as numeric) as total_amount,
current_timestamp() as processed_at
from `demo.raw_sales.orders_raw`
where order_id is not null;
Later, the mart layer can aggregate this into business-friendly tables:
1
2
3
4
5
6
7
create or replace table `demo.marts.daily_sales` as
select
date(order_ts) as order_date,
sum(total_amount) as revenue,
count(distinct order_id) as order_count
from `demo.clean_sales.orders`
group by 1;
Ingestion choices
For small teams, I usually see two common ingestion patterns.
1. File-based ingestion
This is the easier one. Upstream systems place JSON, CSV, or parquet files in Cloud Storage. A Cloud Storage event can publish to Pub/Sub, and a consumer job can process the file.
Pseudo-flow:
1
GCS upload -> Pub/Sub message -> Cloud Run job or Dataflow pipeline -> BigQuery raw table
If transformation is light, Cloud Run is often enough. If you are handling many files, larger volumes, or streaming data, Dataflow starts to make more sense.
2. Scheduled extraction
Some systems cannot push files. In that case, use Cloud Scheduler to call a Cloud Run job or a small extraction service. That job fetches data from an API and writes to GCS or directly to BigQuery staging tables.
This is still a valid architecture. Not every pipeline needs streaming. I think small teams sometimes overestimate how much real-time processing they actually need.
Infrastructure as code
I would definitely manage this using Terraform from the beginning. Even if the first version is small, it avoids many manual configuration surprises later. A very small example could be:
1
2
3
4
5
6
7
8
9
10
11
12
13
resource "google_storage_bucket" "landing" {
name = "demo-landing-bucket"
location = "australia-southeast1"
}
resource "google_bigquery_dataset" "raw_sales" {
dataset_id = "raw_sales"
location = "australia-southeast1"
}
resource "google_pubsub_topic" "file_events" {
name = "file-events"
}
In production, I would also add separate service accounts, least-privilege IAM, remote Terraform state, and environment separation such as dev and prod projects. For a demo, people skip these things. For a real team, these are worth doing early.
Orchestration without making it too heavy
A lot of small platforms can start with nothing more than scheduled jobs and event triggers. Once dependencies grow, then introduce orchestration. On GCP, I would consider the below order:
- Start with Cloud Scheduler + Cloud Run Jobs
- Move to Workflows if you need retries and multi-step control
- Move to Composer only when you truly need DAG-level orchestration across many pipelines
That progression is usually easier to operate than starting with Composer on day one. Composer is powerful, but it is not free from overhead. Someone still needs to own it.
Monitoring and caveats
This architecture is simple, but there are still things to be careful about:
- Schema drift - source files change quietly, and raw loads start failing or worse, load bad data
- Duplicate ingestion - event-driven pipelines should be idempotent
- BigQuery cost surprises - unpartitioned large tables can become expensive
- IAM sprawl - too many broad permissions accumulate quickly
- Missing observability - if you do not log row counts and failures, debugging becomes slow
For BigQuery tables that grow, I would partition by ingestion date or event date and cluster on commonly filtered columns. For example, an orders table might be partitioned by date(order_ts) and clustered by customer_id.
For monitoring, even a simple audit table helps a lot:
1
2
3
4
insert into `demo.monitoring.pipeline_runs`
(run_id, pipeline_name, status, row_count, loaded_at)
values
('abc-123', 'orders_ingestion', 'SUCCESS', 15420, current_timestamp());
This does not replace full observability, but it gives the team something concrete to query when a stakeholder says the dashboard looks wrong.
What changes in production
For a production-grade version, I would make a few changes:
- separate projects for dev and prod
- CI/CD with plan and apply controls for Terraform
- stricter IAM and dedicated service accounts per workload
- data quality checks before publishing to marts
- dead-letter handling for failed events
- better metadata and lineage tracking
I would also think carefully before choosing Dataflow everywhere. It is excellent, but if most transformations are SQL-based, BigQuery scheduled queries or dbt can keep the platform much simpler. The best architecture for a small team is often the one that removes moving parts, not the one that showcases every managed service in GCP.
Conclusion
We could see that a small team data platform on GCP does not need to be complicated to be useful. Cloud Storage, BigQuery, a simple processing layer, and a modest amount of orchestration can cover a lot of real use cases. Start with the simplest setup that your team can reliably run, and only add more services when the current shape truly becomes a bottleneck.
