Terraform basics for data engineering infrastructure

Posted Jan 6, 2025

By Ashok KS 7 min read

Infrastructure as code is no longer something only a platform or DevOps team needs to care about. In many data engineering projects, the same person who builds a pipeline also needs to provision the bucket, dataset, scheduler, service account, or some other cloud resource that makes that pipeline work. In this article, let us see the basics of Terraform and why it is useful for data engineering infrastructure. We will use a small GCP example so the workflow is easy to understand, and then look at what would usually change in a production setup.

Why Terraform is useful for data engineering

When we create cloud resources manually from the console, things work fast in the beginning, but after a while it becomes difficult to track what was created, who changed it, and how to recreate the same setup in another environment. That is where Terraform helps. We describe the infrastructure in code, run Terraform, and let it create or update the resources.

For a data engineer, this becomes useful in cases like:

creating GCS buckets for landing files
creating BigQuery datasets and tables
managing Pub/Sub topics and subscriptions
provisioning service accounts and IAM bindings
setting up Cloud Scheduler, Cloud Run, or Dataform related resources

The biggest benefit is not just automation. It is repeatability. If your development environment works, you can use the same pattern for staging and production with only a few controlled changes.

What Terraform actually does

Terraform compares two things:

the infrastructure code you wrote
the infrastructure that already exists

Based on that, it calculates what needs to be created, updated, or deleted. This is why the terraform plan step is so useful. Before changing anything, we can see what Terraform is about to do.

A simple way to think about it is like this:

Step	What it does	Why it matters
`terraform init`	Downloads provider plugins and initializes the folder	Needed once before Terraform can work
`terraform plan`	Shows the proposed changes	Helps review before applying
`terraform apply`	Executes the changes	Creates or updates resources
`terraform destroy`	Deletes managed resources	Useful for cleaning demo environments

A simple GCP example

For a beginner example, creating a GCS bucket is enough. It is simple, but it still shows the main Terraform flow clearly.

Let us assume we have a folder like this:

terraform-demo/
├── main.tf
├── variables.tf
└── terraform.tfvars

main.tf

  
terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

provider "google" {
  project = var.project_id
  region  = var.region
}

resource "google_storage_bucket" "landing_bucket" {
  name          = var.bucket_name
  location      = var.region
  force_destroy = true

  uniform_bucket_level_access = true
}

variables.tf

  
variable "project_id" {
  type = string
}

variable "region" {
  type = string
}

variable "bucket_name" {
  type = string
}

terraform.tfvars

  
project_id  = "my-gcp-project"
region      = "australia-southeast1"
bucket_name = "my-demo-dataeng-bucket-123"

This example is intentionally small. In real projects, you would probably split providers, IAM, storage, and compute into separate files or even separate modules. But for learning, keeping everything simple is better.

Running Terraform locally

Once the files are ready, the usual flow is below:

terraform init
terraform plan
terraform apply

If the plan looks correct, type yes during apply and Terraform will create the bucket. After that, if you run terraform plan again without changing anything, it should show no changes. That is a good sign because it means your code matches the deployed infrastructure.

One thing beginners notice quickly is the terraform.tfstate file. Terraform uses this state file to keep track of the resources it manages. Without it, Terraform would not know what it created previously.

What to be careful about

This is usually where the first real problems begin. Terraform is simple in a demo, but there are a few things to be careful about.

1. State file management

If you keep the state file only on your laptop, that is fine for practice but not ideal for a team setup. Another engineer could make changes separately and then your local state would be out of sync. In GCP projects, it is common to store the Terraform state in a dedicated GCS bucket.

A backend configuration could look like this:

  
terraform {
  backend "gcs" {
    bucket = "terraform-state-files"
    prefix = "dataeng-demo"
  }
}

With this, multiple runs can use the same remote state instead of depending on one local file.

2. IAM permissions

It is tempting to use your own cloud account for Terraform, especially in a demo. But it is better to create a separate service account with only the permissions needed. For example, if you are only creating buckets, you do not need broad project owner access.

3. Unique resource names

Some cloud resources, like GCS buckets, need globally unique names. If you copy examples directly, the apply will fail because someone else already used that name. Adding a suffix based on the environment or project is usually enough.

4. Drift from manual changes

If someone edits the bucket manually from the GCP console, the Terraform code and the actual resource can drift apart. Terraform will detect some of this in the next plan, but it is still better to avoid manual changes as much as possible.

How this fits into a data engineering workflow

A very common pattern is that Terraform creates the infrastructure and another tool runs the data workload. For example:

Terraform creates a GCS bucket, BigQuery dataset, and service account
A Python or Spark job writes files to the bucket
BigQuery external tables or load jobs read those files
Cloud Scheduler or Airflow triggers the workflow regularly

We might also provision a scheduled query or Pub/Sub topic using Terraform, but the transformation logic itself would remain in SQL, Python, dbt, or Spark. Terraform is not a pipeline tool. It is the tool that prepares the environment around the pipeline.

For example, we could provision a BigQuery dataset like this:

  
resource "google_bigquery_dataset" "analytics" {
  dataset_id = "analytics"
  location   = "australia-southeast1"
}

And then later a SQL job could write into it:

  
CREATE OR REPLACE TABLE analytics.daily_orders AS
SELECT order_date, COUNT(*) AS total_orders
FROM raw.orders
GROUP BY order_date;

That is a good separation of concerns. Terraform provisions the dataset, and SQL manages the data inside it.

What changes in production

For a simple demo, one folder with a few Terraform files is enough. In production, I would usually make a few changes:

keep remote state in a locked shared backend
separate environments like dev, test, and prod
use modules for reusable patterns
run terraform plan and apply from CI/CD instead of a laptop
use service accounts and secrets management properly
add policy checks or at least code review before apply

Another useful production habit is to keep the plan review separate from apply. It is easy to make mistakes in IAM, networking, or deletion settings, and those mistakes can become expensive very quickly.

Limitations of Terraform

Terraform is very useful, but it is not perfect.

It does not replace good environment design
It can become messy if everything is kept in one large root module
Some provider features lag behind the cloud platform
Importing existing manually created resources can take effort
Poor state handling can create confusing failures

So the goal should not be to put every possible thing into Terraform on day one. Start with the core infrastructure that benefits most from repeatability, then expand gradually.

Conclusion

Terraform is one of the most useful tools a data engineer can learn when working on cloud platforms. Even if your first use case is just a bucket, dataset, or IAM binding, the habit of managing infrastructure as code pays off quickly. Start with a very small example, understand the init, plan, and apply flow well, and then grow from there. Once those basics are clear, it becomes much easier to manage real data engineering infrastructure with confidence.

Terraform, Data Engineering, GCP

This post is licensed under CC BY 4.0 by the author.