Using GitHub Actions for simple data pipeline CI/CD
In this article, let us see how to use GitHub Actions for a simple data pipeline CI/CD setup and why this approach is useful for a small team. If you have a lightweight ETL job, a dbt project, or a Terraform module for pipeline infrastructure, you usually do not need a very complex deployment platform on day one. GitHub Actions is often enough to run checks, package code, and deploy changes in a repeatable way.
For our use case, let us assume we have a small pipeline that reads CSV files from cloud storage, performs a transformation, and writes the final table into a warehouse. Along with that, we also maintain the infrastructure as code in the same repository. We want a simple workflow where every pull request runs validations, and every merge to main deploys the latest approved change.
What GitHub Actions is solving here
When people say CI/CD, sometimes it sounds bigger than what we really need. For a data engineering team, the first practical needs are usually below:
- run formatting and lint checks
- run unit tests for transformation logic
- validate SQL or dbt models
- run Terraform plan before merge
- deploy code automatically after merge
Without automation, someone has to remember all these steps and run them manually. That works for some time, but eventually things get missed. A broken SQL file or a bad Terraform change can go into production just because nobody noticed.
A simple repository structure
We could have a repository structure like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
.
├── .github/workflows/
│ ├── ci.yml
│ └── deploy.yml
├── pipeline/
│ ├── job.py
│ ├── requirements.txt
│ └── tests/
├── sql/
│ └── transformations.sql
└── infra/
├── main.tf
├── variables.tf
└── outputs.tf
This is enough for many beginner or intermediate projects. The main point is not the folder names. The point is that code, tests, and deployment logic are all version controlled together.
CI workflow for pull requests
The first workflow should run on pull requests. This is where we catch problems early.
A basic .github/workflows/ci.yml could look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
name: CI
on:
pull_request:
branches: [ main ]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install -r pipeline/requirements.txt
pip install pytest flake8
- name: Run lint
run: flake8 pipeline
- name: Run tests
run: pytest pipeline/tests
- name: Check SQL files exist
run: ls -l sql
This is intentionally simple. We are not trying to build the perfect pipeline yet. We just want enough checks so obviously broken code does not get merged.
If you are using dbt, then instead of only checking SQL files exist, you would likely run something like below:
1
2
3
dbt deps
dbt parse
dbt test --select state:modified+
That gives more confidence before merge.
Deploy workflow on merge to main
Now let us see the deployment side. When a change is merged to main, we may want to deploy infrastructure and then run a release step for the pipeline code.
A simple deploy.yml could be:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
name: Deploy
on:
push:
branches: [ main ]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Terraform
uses: hashicorp/setup-terraform@v3
- name: Terraform Init
working-directory: infra
run: terraform init
- name: Terraform Apply
working-directory: infra
run: terraform apply -auto-approve
env:
TF_VAR_project_id: $
- name: Package pipeline
run: zip -r pipeline.zip pipeline
- name: Upload artifact
uses: actions/upload-artifact@v4
with:
name: pipeline-package
path: pipeline.zip
For a very small setup, this is enough. We could apply Terraform and package the code in the same workflow. If the pipeline is running on Cloud Functions, Cloud Run, Lambda, or some scheduled container job, then the last step would change based on that platform.
Managing secrets
One thing to be careful about is credentials. In a local system, engineers often keep a service account key or cloud credentials file and everything works. In GitHub Actions, we should not commit those files into the repository.
For our use case, secrets would usually go into GitHub repository secrets:
GCP_PROJECT_IDGCP_REGIONGCP_SA_KEYor workload identity setup- database connection values if needed
If you are on AWS, it would be similar with access keys or, better, role-based federation. In production, I would prefer short-lived credentials using OIDC instead of long-lived service account keys. That takes a bit more setup, but it is safer.
A quick comparison
| Option | Good for | Limitation |
|---|---|---|
| GitHub Actions only | Small teams, simple deployments, one repo | Can become messy if too many pipelines are mixed together |
| GitHub Actions + Terraform | Infra and app deployment from one place | Need to manage secrets and state carefully |
| Dedicated orchestration tool | Large teams, complex release process | More overhead and more moving parts |
For many beginner projects, GitHub Actions only is a good starting point. It is already close to the code review process, and most engineers are comfortable using GitHub.
Adding a simple data quality gate
If your pipeline loads data into a warehouse, it is useful to add a small validation step. Even a basic SQL check helps. For example, after deployment, we could run a query and fail the workflow if the latest partition has no rows.
1
2
3
SELECT COUNT(*) AS row_count
FROM analytics.sales_daily
WHERE load_date = CURRENT_DATE();
Then in the workflow, you could call a CLI tool or small Python script that checks whether row_count is greater than zero. This is not a full observability solution, but it catches some obvious issues.
Things to be careful about
There are some limitations with this approach.
- GitHub Actions is easy to start with, so people tend to put too much logic inside YAML. After some point, that becomes hard to maintain. If logic gets bigger, move it into scripts.
- Running
terraform applydirectly on every merge may be risky if many people are changing infrastructure. At minimum, review the plan carefully. - If the same workflow handles infra deploy, pipeline packaging, and data validation, troubleshooting becomes slower. Sometimes splitting workflows makes more sense.
- For pipelines that need strict approvals, rollback workflows, or multi-environment promotion, this basic setup may feel limited.
In a demo project, I am fine with one repository and two workflows. In a production setup, I would usually separate dev, qa, and prod, add manual approval before production, and avoid long-lived secrets. I would also store Terraform state remotely and lock it properly so concurrent deployments do not create issues.
A practical pattern that works well
A pattern I like for small teams is this:
- pull request triggers lint, tests, and Terraform plan
- merge to main triggers deploy to dev
- production deploy is manual or tag based
This keeps the process simple while still giving some control. It also helps avoid the common mistake where every merge goes directly into production without any pause.
Conclusion
GitHub Actions is a good fit when you want simple CI/CD for a data pipeline and do not want to introduce a lot of extra tooling. It helps standardize checks, reduce manual deployment steps, and keep the release flow close to the code. Start small, keep the workflows readable, and move to a more structured setup only when the project actually needs it.
