Post

GitHub Actions for Data Pipeline CI/CD: A Practical Starting Point

In this article let us look at how you can use GitHub Actions to set up a simple CI/CD pipeline for your data workflows. If you are building dbt models, Airflow DAGs, or Python-based ETL scripts and want to stop manually running things on your laptop before pushing to production, this is for you.

Most data engineers I know got into CI/CD not because someone taught them, but because something broke in prod and they needed a way to catch it earlier. That is how I picked it up too — one broken pipeline at a time.

We will walk through a real workflow that checks code formatting, runs tests, and deploys to a staging environment. By the end you should be able to take the YAML snippets here and adapt them to your own repo.

Why GitHub Actions for data pipelines?

You could use Jenkins, GitLab CI, or Azure DevOps. But if your code already lives on GitHub, Actions is the path of least resistance. No separate server to manage, no extra auth to wire up. The free tier gives you 2,000 minutes per month for private repos and unlimited for public ones, which is plenty when you are starting out.

For data pipelines specifically, the stuff you want CI/CD to do tends to be simpler than what a backend service needs. You are not running integration tests against microservices. You are more likely checking that:

  • SQL files parse without errors
  • Python scripts pass basic tests
  • dbt models compile and run against a dev schema
  • Airflow DAGs deploy to a GCS bucket or S3
  • Changes to a specific folder trigger the right pipeline

This is all well within what GitHub Actions can handle.

Setting up your first workflow

All your workflows live in .github/workflows/ at the root of your repo. Create a file called data-pipeline-ci.yml and let us start small.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
name: Data Pipeline CI

on:
  pull_request:
    paths:
      - 'dags/**'
      - 'models/**'
      - 'scripts/**'
    branches:
      - main

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - name: Check out code
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install ruff pytest

      - name: Lint Python
        run: ruff check scripts/

      - name: Run tests
        run: pytest tests/ -v

A few things going on here:

  • paths means the workflow only runs when files in those directories change. No point running a dbt test pipeline when someone updates the README.
  • ruff is fast and replaces half a dozen linting tools. If you are still using flake8, black, and isort separately, give ruff a look.
  • runs-on: ubuntu-latest is fine for most data work unless you need heavy compute. In that case you can use a self-hosted runner.

Adding SQL linting for dbt users

If you work with dbt, catching SQL issues before they hit production saves a lot of pain. I have opened way too many PRs where someone aliased a column in a CTE but forgot to use it downstream.

Install sqlfluff and add a step:

1
2
3
4
      - name: Lint SQL models
        run: |
          pip install sqlfluff
          sqlfluff lint models/

If your team is new to linting, you can start with sqlfluff fix models/ to auto-correct the easy stuff, then add sqlfluff lint as the CI gate. That way you are not overwhelming people with 200 linting errors on day one.

A comparison of a few tools you might use at different stages:

StageToolWhat it catches
Python lintingruffUnused imports, style issues, code smells
SQL lintingsqlfluffMissing aliases, bad joins, formatting drift
dbt compilationdbt compileInvalid model references, syntax errors in Jinja
Unit testspytestLogic bugs in transformation functions
Data testsdbt test / Great ExpectationsNulls where they should not be, duplicate keys, freshness

Running dbt in CI

If your pipeline is dbt, you want more than just linting. You want to know if the thing actually compiles and runs. Here is how you might extend the workflow:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
      - name: Set up dbt profile
        run: |
          mkdir -p ~/.dbt
          echo "$" > ~/.dbt/profiles.yml

      - name: Install dbt
        run: pip install dbt-bigquery

      - name: Compile dbt models
        run: dbt compile --target dev

      - name: Run dbt tests on changed models
        run: |
          # Only run models that changed plus their downstream dependencies
          dbt run -m state:modified+ --target dev --defer --state ./prod-artifacts/
          dbt test -m state:modified+ --target dev

The state:modified+ selector checks what actually changed and runs only those models and their downstream dependents. It needs a prod-artifacts/ folder with a manifest.json from your last production run. You can store that as a CI artifact or pull it from your deployment bucket.

A small gotcha here: your CI job needs a service account key to connect to BigQuery or Snowflake. Store it as a GitHub Actions secret (DBT_PROFILES_YML in the example above) and avoid hardcoding anything.

Deploying Airflow DAGs with GitHub Actions

For Airflow on Cloud Composer or MWAA, your DAGs usually live in a GCS or S3 bucket. Deployment is just syncing files. Here is a job that does that:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
  deploy-dags:
    needs: lint-and-test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4

      - name: Authenticate to GCP
        uses: google-github-actions/auth@v2
        with:
          credentials_json: $

      - name: Sync DAGs to GCS
        run: |
          gcloud storage rsync dags/ gs://us-central1-composer--abc123/dags/ \
            --delete-unmatched-destination-objects \
            --recursive

A few things worth noting:

  • needs: lint-and-test makes sure tests pass before deployment kicks off. If linting fails, you do not push broken DAGs.
  • if: github.ref == 'refs/heads/main' restricts deployment to the main branch only. PRs from feature branches will run tests but will not deploy.
  • --delete-unmatched-destination-objects removes DAGs you deleted from the repo. Without it, renamed or removed DAG files would sit in the bucket forever and you would need to clean them up manually. This is the kind of thing that sounds optional until you have 40 stale DAGs in your Composer environment.

Environment separation

For a real project you will want at least two environments: staging and production. You can use GitHub environments to control this:

1
2
3
4
5
6
  deploy-prod:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    steps:
      # same sync logic, but with prod bucket

GitHub environments let you add protection rules like required reviewers, wait timers, or restricted branches. For production data pipelines, having someone else approve before the DAGs go live is a good safety net — no matter how confident you feel about a change at 4 PM on a Friday.

Scheduling and triggers beyond push

Not everything needs to run on every commit. You can use a cron schedule for workflows that refresh dev data, run long-running validations, or snapshot production metadata:

1
2
3
4
on:
  schedule:
    - cron: '0 6 * * 1'  # Every Monday at 06:00 UTC
  workflow_dispatch:      # Manual trigger from the GitHub UI

The workflow_dispatch trigger is underrated. It gives you a button in the GitHub UI to run the workflow whenever you need it, with optional input parameters. Handy for things like “re-deploy the DAGs to dev because someone accidentally deleted them from the bucket.”

Practical limitations and things to watch out for

  • Runner disk space. The default runner has about 14 GB. If your pipeline pulls large datasets for testing, you will hit that ceiling fast. Use sampling or run integration tests on a separate schedule with a larger runner.
  • Secret size limits. Individual secrets are capped at 48 KB. If your service account key is huge (some are), you may need to split it or use workload identity federation instead — which is actually the better approach anyway.
  • Build minutes for private repos. 2,000 minutes goes quick if you have a monorepo and a dozen workflows triggering on every push. Use paths filters aggressively.
  • The YAML gets messy. Once you have five or six jobs with conditions and dependencies, the single-file approach becomes painful. Look into reusable workflows or composite actions before it turns into a 300-line wall of YAML.
  • Different tools, different patterns. A Spark job that runs on Databricks needs a different CI/CD approach than dbt or Airflow. You are not building a single “data pipeline CI/CD” — you are stitching together patterns for each component. The GitHub Actions part is the easy bit; knowing what to test and how to deploy each piece is where the real work lives.

Production versus demo

In a simple demo, you can get away with a single workflow file, one environment, and no approval gates. In production you will want:

  • Separate staging and production environments with the GitHub environment feature
  • A defer strategy for dbt that only builds and tests changed models
  • Workload identity federation instead of long-lived service account keys
  • Notification on failure — Slack or email — because CI/CD failures are invisible until someone checks GitHub
  • A rollback plan for DAG deployments (keep the last N versions in your bucket or use a tool that supports rollback)

None of this is hard to add. But adding it after something breaks is always more stressful than setting it up ahead of time.

Summing up

GitHub Actions gives you enough to build a solid CI/CD pipeline for your data workflows without needing to manage extra infrastructure. Start with linting and basic tests, add deployment when you are confident the tests catch real issues, and layer on environment separation and approvals as you go.

It will not be perfect on day one. But having a workflow that catches a broken SQL file or a missing Python import before it reaches production is already a huge step up from running things off your laptop and hoping for the best.

This post is licensed under CC BY 4.0 by the author.