Post

Terraform Basics for Data Engineers: A Practical Walkthrough

Infrastructure as code has moved from being a niche platform-engineering skill to something every data engineer bumps into. A few years ago, my job stopped at making the pipeline run — someone else owned the VMs, the storage buckets, the IAM bindings. These days, especially in cloud-native teams, you are expected to own the infrastructure your pipelines depend on.

This article walks through getting started with Terraform on Google Cloud from a data engineer’s perspective. We will create a BigQuery dataset and a Cloud Storage bucket — the kind of resources you reach for in almost every data project. By the end, you will have Terraform running locally, a working configuration, and a clear sense of what to watch out for when this goes beyond a demo.

Why Terraform and Not ClickOps?

Clicking around the GCP console to create a dataset or a bucket is fine when you are experimenting. It falls apart quickly once you need to replicate that setup across dev, staging, and production, or when someone else on the team needs to understand what resources exist and why.

Terraform gives you a declarative way to describe infrastructure. You write what you want, and Terraform figures out what API calls to make. The configuration files act as living documentation. If a bucket gets deleted by accident, a terraform apply brings it back from the config — assuming your state file is safe, which we will come back to.

For data engineers, the sweet spot is managing things like:

  • GCS buckets for landing zones and data lake storage
  • BigQuery datasets and tables
  • Pub/Sub topics and subscriptions
  • Cloud Functions or Cloud Run services for lightweight transforms
  • IAM permissions scoped to specific datasets or buckets

You are not replacing a platform team with Terraform. You are making your data infrastructure reproducible and reviewable.

Setting Up Terraform

Installation is straightforward on macOS or Linux:

1
brew install terraform

On Windows, grab the binary from the HashiCorp releases page and add it to your PATH.

Once installed, verify it works:

1
terraform version

GCP Authentication

Terraform needs credentials to talk to GCP. The recommended way for local development is using a service account rather than your personal user account — it makes it easier to scope permissions tightly.

  1. In the GCP console, go to IAM & Admin → Service Accounts and create a new one (e.g., terraform-sa).
  2. Grant it the roles you need. For our example, Storage Admin and BigQuery Data Editor are enough.
  3. Create a JSON key under Actions → Manage Keys → Add Key → Create New Key → JSON. Download the file and rename it to keys.json.
  4. Move it somewhere outside your repo (never commit it) and set the environment variable:
1
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keys.json"

For production, skip the long-lived JSON key entirely. Use Workload Identity Federation or a service account attached to the environment (Cloud Build, GitHub Actions with OIDC, etc.). Long-lived keys are a security risk and a compliance headache.

Writing Your First Terraform Configuration

Create a new directory and a file called main.tf:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

provider "google" {
  project = var.project_id
  region  = var.region
}

resource "google_storage_bucket" "landing_zone" {
  name          = "${var.project_id}-landing-zone"
  location      = var.region
  force_destroy = false

  lifecycle_rule {
    action {
      type = "Delete"
    }
    condition {
      age = 90
    }
  }
}

resource "google_bigquery_dataset" "raw" {
  dataset_id = "raw_data"
  location   = var.region
}

And a variables.tf:

1
2
3
4
5
6
7
8
9
10
variable "project_id" {
  description = "GCP project ID"
  type        = string
}

variable "region" {
  description = "GCP region for resources"
  type        = string
  default     = "australia-southeast1"
}

Create a terraform.tfvars file (also never commit this):

1
2
project_id = "your-gcp-project-id"
region     = "australia-southeast1"

The configuration above does three things: it sets up the Google provider, creates a GCS bucket with a 90-day lifecycle rule (auto-delete objects older than 90 days), and creates a BigQuery dataset called raw_data. Not exciting on its own, but it is the foundation you build on.

Initialise and Apply

Run these commands in order:

1
2
3
terraform init
terraform plan
terraform apply

init downloads the Google provider plugin. plan shows you what Terraform intends to create without making any changes — always run this and read the output. apply executes the plan; you will be prompted to type yes before it proceeds.

A quick comparison of Terraform commands worth knowing from day one:

CommandWhat It Does
terraform initInitialises the working directory, downloads providers
terraform planDry-run: shows what will be created, changed, or destroyed
terraform applyApplies the configuration to provision resources
terraform destroyTears down everything defined in the configuration
terraform fmtFormats your .tf files to the canonical style
terraform validateChecks syntax and basic logic without hitting any APIs

State: The Part People Get Wrong Early

When Terraform creates resources, it records what it built in a state file (terraform.tfstate). This file is the source of truth that Terraform uses on the next run to figure out what changed.

By default, the state file sits on your local disk. That is fine for a solo experiment but a disaster for a team. If two people run terraform apply from different machines with separate state files, resources will drift or conflict. If your laptop dies, you lose the mapping between your config and reality — and cleaning up orphaned resources by hand is miserable.

For anything beyond a toy project, use a remote backend. GCS works well:

1
2
3
4
5
6
terraform {
  backend "gcs" {
    bucket = "my-terraform-state-bucket"
    prefix = "data-infra"
  }
}

Create the bucket manually once (Terraform cannot manage the bucket that holds its own state), then migrate your local state:

1
terraform init -migrate-state

Adding a Second Resource Without Duplicating Everything

Once you have more than a handful of resources, copying and pasting bucket definitions gets old. Terraform has for_each and count for this — but keep it simple at first. Here is a practical example where you need separate buckets for raw, staging, and curated data layers:

1
2
3
4
5
6
7
8
9
locals {
  data_layers = ["raw", "staging", "curated"]
}

resource "google_storage_bucket" "data_layers" {
  for_each = toset(local.data_layers)
  name     = "${var.project_id}-${each.key}-data"
  location = var.region
}

This creates three buckets with one block. Much cleaner.

What Changes in Production

A local setup is the starting point. Here is what a production-grade Terraform setup for data infrastructure usually looks like:

  • Remote state stored in GCS with versioning enabled, so you can roll back if a bad apply corrupts things.
  • State locking to prevent two pipeline runs from applying simultaneously — GCS supports this natively through object locks.
  • CI/CD-driven applies: you push to a branch, GitHub Actions runs terraform plan and posts the output as a PR comment. Merging to main triggers terraform apply. Nobody runs Terraform from their laptop against production.
  • Smaller modules: separate configurations for networking, storage, and compute rather than one monolithic main.tf with hundreds of resources.
  • Terraform Cloud or Atlantis for teams that need review workflows around infrastructure changes — same idea as code review, but for infrastructure.

Practical Limitations and Gotchas

A few things that have bitten me and are worth knowing early:

  1. Terraform is not great at managing data inside services. It can create a BigQuery dataset, but it should not be inserting rows into tables. That is what your pipeline does. Use Terraform for the infrastructure skeleton, not the data payload.

  2. Some GCP resources do not support in-place updates. Renaming a dataset, for example, forces a destroy-and-recreate. Terraform will warn you in the plan output — read it carefully.

  3. IAM changes are eventually consistent. If your Terraform config creates a bucket and immediately grants a service account access to it, the apply might fail on a race condition. Adding an explicit depends_on can help, but sometimes you just need to run apply twice.

  4. Secrets in state files. If you pass a password or an API key into a resource attribute, that value ends up in the state file in plain text. Use Secret Manager or a similar vault, and reference secrets by resource ID rather than embedding their values.

  5. Provider version pinning matters. The Google provider moves fast. Without a version constraint, a new provider release can change behaviour between runs. Pin to a major version (~> 5.0) at minimum, and consider using a lock file (.terraform.lock.hcl) that you commit to the repo.

Wrapping Up

Terraform is one of those tools where the first hour feels slow — you install it, you write a config, you stare at plan output — but the payoff compounds quickly. Once your buckets, datasets, and IAM bindings are defined in code, reproducing an environment or onboarding a new team member stops being a multi-day exercise.

For data engineers, the bar is not “become a DevOps expert.” It is “own the infrastructure your pipelines run on, and make sure someone else can pick it up without a handover document.” Terraform helps with exactly that.

Start with a bucket and a dataset. Add remote state. Then let the config grow alongside your project. The alternative is a messy console full of resources nobody remembers creating — and nobody wants to be the person cleaning that up.

This post is licensed under CC BY 4.0 by the author.