Terraform basics for data engineering infrastructure
Infrastructure as code is no longer something only a platform or DevOps team needs to care about. In many data engineering projects, the same person who builds a pipeline also needs to provision the bucket, dataset, scheduler, service account, or some other cloud resource that makes that pipeline work. In this article, let us see the basics of Terraform and why it is useful for data engineering infrastructure. We will use a small GCP example so the workflow is easy to understand, and then look at what would usually change in a production setup.
Why Terraform is useful for data engineering
When we create cloud resources manually from the console, things work fast in the beginning, but after a while it becomes difficult to track what was created, who changed it, and how to recreate the same setup in another environment. That is where Terraform helps. We describe the infrastructure in code, run Terraform, and let it create or update the resources.
For a data engineer, this becomes useful in cases like:
- creating GCS buckets for landing files
- creating BigQuery datasets and tables
- managing Pub/Sub topics and subscriptions
- provisioning service accounts and IAM bindings
- setting up Cloud Scheduler, Cloud Run, or Dataform related resources
The biggest benefit is not just automation. It is repeatability. If your development environment works, you can use the same pattern for staging and production with only a few controlled changes.
What Terraform actually does
Terraform compares two things:
- the infrastructure code you wrote
- the infrastructure that already exists
Based on that, it calculates what needs to be created, updated, or deleted. This is why the terraform plan step is so useful. Before changing anything, we can see what Terraform is about to do.
A simple way to think about it is like this:
| Step | What it does | Why it matters |
|---|---|---|
terraform init | Downloads provider plugins and initializes the folder | Needed once before Terraform can work |
terraform plan | Shows the proposed changes | Helps review before applying |
terraform apply | Executes the changes | Creates or updates resources |
terraform destroy | Deletes managed resources | Useful for cleaning demo environments |
A simple GCP example
For a beginner example, creating a GCS bucket is enough. It is simple, but it still shows the main Terraform flow clearly.
Let us assume we have a folder like this:
1
2
3
4
terraform-demo/
├── main.tf
├── variables.tf
└── terraform.tfvars
main.tf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
terraform {
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
provider "google" {
project = var.project_id
region = var.region
}
resource "google_storage_bucket" "landing_bucket" {
name = var.bucket_name
location = var.region
force_destroy = true
uniform_bucket_level_access = true
}
variables.tf
1
2
3
4
5
6
7
8
9
10
11
variable "project_id" {
type = string
}
variable "region" {
type = string
}
variable "bucket_name" {
type = string
}
terraform.tfvars
1
2
3
project_id = "my-gcp-project"
region = "australia-southeast1"
bucket_name = "my-demo-dataeng-bucket-123"
This example is intentionally small. In real projects, you would probably split providers, IAM, storage, and compute into separate files or even separate modules. But for learning, keeping everything simple is better.
Running Terraform locally
Once the files are ready, the usual flow is below:
1
2
3
terraform init
terraform plan
terraform apply
If the plan looks correct, type yes during apply and Terraform will create the bucket. After that, if you run terraform plan again without changing anything, it should show no changes. That is a good sign because it means your code matches the deployed infrastructure.
One thing beginners notice quickly is the terraform.tfstate file. Terraform uses this state file to keep track of the resources it manages. Without it, Terraform would not know what it created previously.
What to be careful about
This is usually where the first real problems begin. Terraform is simple in a demo, but there are a few things to be careful about.
1. State file management
If you keep the state file only on your laptop, that is fine for practice but not ideal for a team setup. Another engineer could make changes separately and then your local state would be out of sync. In GCP projects, it is common to store the Terraform state in a dedicated GCS bucket.
A backend configuration could look like this:
1
2
3
4
5
6
terraform {
backend "gcs" {
bucket = "terraform-state-files"
prefix = "dataeng-demo"
}
}
With this, multiple runs can use the same remote state instead of depending on one local file.
2. IAM permissions
It is tempting to use your own cloud account for Terraform, especially in a demo. But it is better to create a separate service account with only the permissions needed. For example, if you are only creating buckets, you do not need broad project owner access.
3. Unique resource names
Some cloud resources, like GCS buckets, need globally unique names. If you copy examples directly, the apply will fail because someone else already used that name. Adding a suffix based on the environment or project is usually enough.
4. Drift from manual changes
If someone edits the bucket manually from the GCP console, the Terraform code and the actual resource can drift apart. Terraform will detect some of this in the next plan, but it is still better to avoid manual changes as much as possible.
How this fits into a data engineering workflow
A very common pattern is that Terraform creates the infrastructure and another tool runs the data workload. For example:
- Terraform creates a GCS bucket, BigQuery dataset, and service account
- A Python or Spark job writes files to the bucket
- BigQuery external tables or load jobs read those files
- Cloud Scheduler or Airflow triggers the workflow regularly
We might also provision a scheduled query or Pub/Sub topic using Terraform, but the transformation logic itself would remain in SQL, Python, dbt, or Spark. Terraform is not a pipeline tool. It is the tool that prepares the environment around the pipeline.
For example, we could provision a BigQuery dataset like this:
1
2
3
4
resource "google_bigquery_dataset" "analytics" {
dataset_id = "analytics"
location = "australia-southeast1"
}
And then later a SQL job could write into it:
1
2
3
4
CREATE OR REPLACE TABLE analytics.daily_orders AS
SELECT order_date, COUNT(*) AS total_orders
FROM raw.orders
GROUP BY order_date;
That is a good separation of concerns. Terraform provisions the dataset, and SQL manages the data inside it.
What changes in production
For a simple demo, one folder with a few Terraform files is enough. In production, I would usually make a few changes:
- keep remote state in a locked shared backend
- separate environments like dev, test, and prod
- use modules for reusable patterns
- run
terraform planandapplyfrom CI/CD instead of a laptop - use service accounts and secrets management properly
- add policy checks or at least code review before apply
Another useful production habit is to keep the plan review separate from apply. It is easy to make mistakes in IAM, networking, or deletion settings, and those mistakes can become expensive very quickly.
Limitations of Terraform
Terraform is very useful, but it is not perfect.
- It does not replace good environment design
- It can become messy if everything is kept in one large root module
- Some provider features lag behind the cloud platform
- Importing existing manually created resources can take effort
- Poor state handling can create confusing failures
So the goal should not be to put every possible thing into Terraform on day one. Start with the core infrastructure that benefits most from repeatability, then expand gradually.
Conclusion
Terraform is one of the most useful tools a data engineer can learn when working on cloud platforms. Even if your first use case is just a bucket, dataset, or IAM binding, the habit of managing infrastructure as code pays off quickly. Start with a very small example, understand the init, plan, and apply flow well, and then grow from there. Once those basics are clear, it becomes much easier to manage real data engineering infrastructure with confidence.
