Using GitHub Actions for simple data pipeline CI/CD

Posted Jan 5, 2025

By Ashok KS 7 min read

In this article, let us see how to use GitHub Actions for a simple data pipeline CI/CD setup and why this approach is useful for a small team. If you have a lightweight ETL job, a dbt project, or a Terraform module for pipeline infrastructure, you usually do not need a very complex deployment platform on day one. GitHub Actions is often enough to run checks, package code, and deploy changes in a repeatable way.

For our use case, let us assume we have a small pipeline that reads CSV files from cloud storage, performs a transformation, and writes the final table into a warehouse. Along with that, we also maintain the infrastructure as code in the same repository. We want a simple workflow where every pull request runs validations, and every merge to main deploys the latest approved change.

What GitHub Actions is solving here

When people say CI/CD, sometimes it sounds bigger than what we really need. For a data engineering team, the first practical needs are usually below:

run formatting and lint checks
run unit tests for transformation logic
validate SQL or dbt models
run Terraform plan before merge
deploy code automatically after merge

Without automation, someone has to remember all these steps and run them manually. That works for some time, but eventually things get missed. A broken SQL file or a bad Terraform change can go into production just because nobody noticed.

A simple repository structure

We could have a repository structure like this:

.
├── .github/workflows/
│   ├── ci.yml
│   └── deploy.yml
├── pipeline/
│   ├── job.py
│   ├── requirements.txt
│   └── tests/
├── sql/
│   └── transformations.sql
└── infra/
    ├── main.tf
    ├── variables.tf
    └── outputs.tf

This is enough for many beginner or intermediate projects. The main point is not the folder names. The point is that code, tests, and deployment logic are all version controlled together.

CI workflow for pull requests

The first workflow should run on pull requests. This is where we catch problems early.

A basic .github/workflows/ci.yml could look like this:

  
name: CI

on:
  pull_request:
    branches: [ main ]

jobs:
  validate:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install -r pipeline/requirements.txt
          pip install pytest flake8

      - name: Run lint
        run: flake8 pipeline

      - name: Run tests
        run: pytest pipeline/tests

      - name: Check SQL files exist
        run: ls -l sql

This is intentionally simple. We are not trying to build the perfect pipeline yet. We just want enough checks so obviously broken code does not get merged.

If you are using dbt, then instead of only checking SQL files exist, you would likely run something like below:

dbt deps
dbt parse
dbt test --select state:modified+

That gives more confidence before merge.

Deploy workflow on merge to main

Now let us see the deployment side. When a change is merged to main, we may want to deploy infrastructure and then run a release step for the pipeline code.

A simple deploy.yml could be:

  
name: Deploy

on:
  push:
    branches: [ main ]

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Terraform Init
        working-directory: infra
        run: terraform init

      - name: Terraform Apply
        working-directory: infra
        run: terraform apply -auto-approve
        env:
          TF_VAR_project_id: $

      - name: Package pipeline
        run: zip -r pipeline.zip pipeline

      - name: Upload artifact
        uses: actions/upload-artifact@v4
        with:
          name: pipeline-package
          path: pipeline.zip

For a very small setup, this is enough. We could apply Terraform and package the code in the same workflow. If the pipeline is running on Cloud Functions, Cloud Run, Lambda, or some scheduled container job, then the last step would change based on that platform.

Managing secrets

One thing to be careful about is credentials. In a local system, engineers often keep a service account key or cloud credentials file and everything works. In GitHub Actions, we should not commit those files into the repository.

For our use case, secrets would usually go into GitHub repository secrets:

GCP_PROJECT_ID
GCP_REGION
GCP_SA_KEY or workload identity setup
database connection values if needed

If you are on AWS, it would be similar with access keys or, better, role-based federation. In production, I would prefer short-lived credentials using OIDC instead of long-lived service account keys. That takes a bit more setup, but it is safer.

A quick comparison

Option	Good for	Limitation
GitHub Actions only	Small teams, simple deployments, one repo	Can become messy if too many pipelines are mixed together
GitHub Actions + Terraform	Infra and app deployment from one place	Need to manage secrets and state carefully
Dedicated orchestration tool	Large teams, complex release process	More overhead and more moving parts

For many beginner projects, GitHub Actions only is a good starting point. It is already close to the code review process, and most engineers are comfortable using GitHub.

Adding a simple data quality gate

If your pipeline loads data into a warehouse, it is useful to add a small validation step. Even a basic SQL check helps. For example, after deployment, we could run a query and fail the workflow if the latest partition has no rows.

  
SELECT COUNT(*) AS row_count
FROM analytics.sales_daily
WHERE load_date = CURRENT_DATE();

Then in the workflow, you could call a CLI tool or small Python script that checks whether row_count is greater than zero. This is not a full observability solution, but it catches some obvious issues.

Things to be careful about

There are some limitations with this approach.

GitHub Actions is easy to start with, so people tend to put too much logic inside YAML. After some point, that becomes hard to maintain. If logic gets bigger, move it into scripts.
Running terraform apply directly on every merge may be risky if many people are changing infrastructure. At minimum, review the plan carefully.
If the same workflow handles infra deploy, pipeline packaging, and data validation, troubleshooting becomes slower. Sometimes splitting workflows makes more sense.
For pipelines that need strict approvals, rollback workflows, or multi-environment promotion, this basic setup may feel limited.

In a demo project, I am fine with one repository and two workflows. In a production setup, I would usually separate dev, qa, and prod, add manual approval before production, and avoid long-lived secrets. I would also store Terraform state remotely and lock it properly so concurrent deployments do not create issues.

A practical pattern that works well

A pattern I like for small teams is this:

pull request triggers lint, tests, and Terraform plan
merge to main triggers deploy to dev
production deploy is manual or tag based

This keeps the process simple while still giving some control. It also helps avoid the common mistake where every merge goes directly into production without any pause.

Conclusion

GitHub Actions is a good fit when you want simple CI/CD for a data pipeline and do not want to introduce a lot of extra tooling. It helps standardize checks, reduce manual deployment steps, and keep the release flow close to the code. Start small, keep the workflows readable, and move to a more structured setup only when the project actually needs it.

Data Engineering, DevOps, GitHub

This post is licensed under CC BY 4.0 by the author.