Small team data platform architecture on GCP

Posted Mar 11, 2025

By Ashok KS 7 min read

In this article, let us look at a simple data platform architecture on GCP that works well for a small team. This approach is useful when you need to ingest files or events, do some light to medium transformations, and serve clean tables for reporting, but you do not want to bring too many tools into the picture from day one.

When people talk about data platforms, the diagrams can quickly become bigger than what most teams really need. For our use case, I am assuming a small team that maybe has one or two data engineers, a few analysts, and a limited budget. In such cases, the goal is usually not to build the perfect platform. The goal is to build something that is easy to run, easy to debug, and easy to expand later.

What I would keep in the first version

A practical first version on GCP could look like this:

Cloud Storage for landing raw files
Pub/Sub for event-driven notifications when needed
Dataflow or Cloud Run jobs for ingestion and transformation
BigQuery for the warehouse layer
Cloud Composer or Workflows only if orchestration starts getting messy
Terraform for infrastructure
GitHub Actions for CI/CD

If I draw it in plain words, it is something like this:

Source system drops files in GCS or pushes events
Pub/Sub or a scheduler triggers processing
Processing job validates and standardizes data
BigQuery stores raw, cleaned, and curated tables
BI tools query the curated layer

This is not the only valid setup, but for a small team it is a good balance between managed services and operational simplicity.

A simple comparison

Need	GCP service	Why I would choose it
Raw file landing	Cloud Storage	Cheap, simple, and works with many ingestion patterns
Analytics warehouse	BigQuery	Serverless and fast enough for most small teams
Event messaging	Pub/Sub	Good when pipelines need decoupling
Batch or stream processing	Dataflow	Strong option when transformation logic grows
Lightweight job execution	Cloud Run Jobs	Easier than Dataflow for smaller Python workloads
Orchestration	Workflows or Composer	Add only when cron plus scripts becomes painful

One thing I noticed in small teams is that they often adopt Composer too early because it feels like a proper data platform tool. But Airflow-based orchestration also adds maintenance overhead. If the team only has three or four pipelines, scheduled Cloud Run jobs or Workflows can be much easier.

Layering the data

I still prefer to organize the warehouse in raw, cleaned, and curated zones even when using BigQuery. Different teams call this bronze, silver, and gold. The naming matters less than being consistent.

A possible dataset structure is below:

my-project
  raw_sales
  clean_sales
  marts
  monitoring

The idea is simple:

raw keeps the source close to how it arrived
clean applies data types, deduplication, and basic quality checks
marts contains business-ready tables for reporting

For example, if a CSV file lands in GCS under gs://demo-landing/sales/orders/, a first load can place it into raw_sales.orders_ext or raw_sales.orders_raw. Then a SQL transformation can produce the clean table.

  
create or replace table `demo.clean_sales.orders` as
select
  cast(order_id as string) as order_id,
  cast(customer_id as string) as customer_id,
  cast(order_timestamp as timestamp) as order_ts,
  cast(total_amount as numeric) as total_amount,
  current_timestamp() as processed_at
from `demo.raw_sales.orders_raw`
where order_id is not null;

Later, the mart layer can aggregate this into business-friendly tables:

  
create or replace table `demo.marts.daily_sales` as
select
  date(order_ts) as order_date,
  sum(total_amount) as revenue,
  count(distinct order_id) as order_count
from `demo.clean_sales.orders`
group by 1;

Ingestion choices

For small teams, I usually see two common ingestion patterns.

1. File-based ingestion

This is the easier one. Upstream systems place JSON, CSV, or parquet files in Cloud Storage. A Cloud Storage event can publish to Pub/Sub, and a consumer job can process the file.

Pseudo-flow:

GCS upload -> Pub/Sub message -> Cloud Run job or Dataflow pipeline -> BigQuery raw table

If transformation is light, Cloud Run is often enough. If you are handling many files, larger volumes, or streaming data, Dataflow starts to make more sense.

2. Scheduled extraction

Some systems cannot push files. In that case, use Cloud Scheduler to call a Cloud Run job or a small extraction service. That job fetches data from an API and writes to GCS or directly to BigQuery staging tables.

This is still a valid architecture. Not every pipeline needs streaming. I think small teams sometimes overestimate how much real-time processing they actually need.

Infrastructure as code

I would definitely manage this using Terraform from the beginning. Even if the first version is small, it avoids many manual configuration surprises later. A very small example could be:

  
resource "google_storage_bucket" "landing" {
  name     = "demo-landing-bucket"
  location = "australia-southeast1"
}

resource "google_bigquery_dataset" "raw_sales" {
  dataset_id = "raw_sales"
  location   = "australia-southeast1"
}

resource "google_pubsub_topic" "file_events" {
  name = "file-events"
}

In production, I would also add separate service accounts, least-privilege IAM, remote Terraform state, and environment separation such as dev and prod projects. For a demo, people skip these things. For a real team, these are worth doing early.

Orchestration without making it too heavy

A lot of small platforms can start with nothing more than scheduled jobs and event triggers. Once dependencies grow, then introduce orchestration. On GCP, I would consider the below order:

Start with Cloud Scheduler + Cloud Run Jobs
Move to Workflows if you need retries and multi-step control
Move to Composer only when you truly need DAG-level orchestration across many pipelines

That progression is usually easier to operate than starting with Composer on day one. Composer is powerful, but it is not free from overhead. Someone still needs to own it.

Monitoring and caveats

This architecture is simple, but there are still things to be careful about:

Schema drift - source files change quietly, and raw loads start failing or worse, load bad data
Duplicate ingestion - event-driven pipelines should be idempotent
BigQuery cost surprises - unpartitioned large tables can become expensive
IAM sprawl - too many broad permissions accumulate quickly
Missing observability - if you do not log row counts and failures, debugging becomes slow

For BigQuery tables that grow, I would partition by ingestion date or event date and cluster on commonly filtered columns. For example, an orders table might be partitioned by date(order_ts) and clustered by customer_id.

For monitoring, even a simple audit table helps a lot:

  
insert into `demo.monitoring.pipeline_runs`
(run_id, pipeline_name, status, row_count, loaded_at)
values
('abc-123', 'orders_ingestion', 'SUCCESS', 15420, current_timestamp());

This does not replace full observability, but it gives the team something concrete to query when a stakeholder says the dashboard looks wrong.

What changes in production

For a production-grade version, I would make a few changes:

separate projects for dev and prod
CI/CD with plan and apply controls for Terraform
stricter IAM and dedicated service accounts per workload
data quality checks before publishing to marts
dead-letter handling for failed events
better metadata and lineage tracking

I would also think carefully before choosing Dataflow everywhere. It is excellent, but if most transformations are SQL-based, BigQuery scheduled queries or dbt can keep the platform much simpler. The best architecture for a small team is often the one that removes moving parts, not the one that showcases every managed service in GCP.

Conclusion

We could see that a small team data platform on GCP does not need to be complicated to be useful. Cloud Storage, BigQuery, a simple processing layer, and a modest amount of orchestration can cover a lot of real use cases. Start with the simplest setup that your team can reliably run, and only add more services when the current shape truly becomes a bottleneck.

GCP, Architecture, Data Engineering

This post is licensed under CC BY 4.0 by the author.