GitHub Actions for Data Pipeline CI/CD: A Practical Starting Point
In this article let us look at how you can use GitHub Actions to set up a simple CI/CD pipeline for your data workflows. If you are building dbt models, Airflow DAGs, or Python-based ETL scripts and want to stop manually running things on your laptop before pushing to production, this is for you.
Most data engineers I know got into CI/CD not because someone taught them, but because something broke in prod and they needed a way to catch it earlier. That is how I picked it up too — one broken pipeline at a time.
We will walk through a real workflow that checks code formatting, runs tests, and deploys to a staging environment. By the end you should be able to take the YAML snippets here and adapt them to your own repo.
Why GitHub Actions for data pipelines?
You could use Jenkins, GitLab CI, or Azure DevOps. But if your code already lives on GitHub, Actions is the path of least resistance. No separate server to manage, no extra auth to wire up. The free tier gives you 2,000 minutes per month for private repos and unlimited for public ones, which is plenty when you are starting out.
For data pipelines specifically, the stuff you want CI/CD to do tends to be simpler than what a backend service needs. You are not running integration tests against microservices. You are more likely checking that:
- SQL files parse without errors
- Python scripts pass basic tests
- dbt models compile and run against a dev schema
- Airflow DAGs deploy to a GCS bucket or S3
- Changes to a specific folder trigger the right pipeline
This is all well within what GitHub Actions can handle.
Setting up your first workflow
All your workflows live in .github/workflows/ at the root of your repo. Create a file called data-pipeline-ci.yml and let us start small.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
name: Data Pipeline CI
on:
pull_request:
paths:
- 'dags/**'
- 'models/**'
- 'scripts/**'
branches:
- main
jobs:
lint-and-test:
runs-on: ubuntu-latest
steps:
- name: Check out code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install ruff pytest
- name: Lint Python
run: ruff check scripts/
- name: Run tests
run: pytest tests/ -v
A few things going on here:
pathsmeans the workflow only runs when files in those directories change. No point running a dbt test pipeline when someone updates the README.ruffis fast and replaces half a dozen linting tools. If you are still using flake8, black, and isort separately, give ruff a look.runs-on: ubuntu-latestis fine for most data work unless you need heavy compute. In that case you can use a self-hosted runner.
Adding SQL linting for dbt users
If you work with dbt, catching SQL issues before they hit production saves a lot of pain. I have opened way too many PRs where someone aliased a column in a CTE but forgot to use it downstream.
Install sqlfluff and add a step:
1
2
3
4
- name: Lint SQL models
run: |
pip install sqlfluff
sqlfluff lint models/
If your team is new to linting, you can start with sqlfluff fix models/ to auto-correct the easy stuff, then add sqlfluff lint as the CI gate. That way you are not overwhelming people with 200 linting errors on day one.
A comparison of a few tools you might use at different stages:
| Stage | Tool | What it catches |
|---|---|---|
| Python linting | ruff | Unused imports, style issues, code smells |
| SQL linting | sqlfluff | Missing aliases, bad joins, formatting drift |
| dbt compilation | dbt compile | Invalid model references, syntax errors in Jinja |
| Unit tests | pytest | Logic bugs in transformation functions |
| Data tests | dbt test / Great Expectations | Nulls where they should not be, duplicate keys, freshness |
Running dbt in CI
If your pipeline is dbt, you want more than just linting. You want to know if the thing actually compiles and runs. Here is how you might extend the workflow:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
- name: Set up dbt profile
run: |
mkdir -p ~/.dbt
echo "$" > ~/.dbt/profiles.yml
- name: Install dbt
run: pip install dbt-bigquery
- name: Compile dbt models
run: dbt compile --target dev
- name: Run dbt tests on changed models
run: |
# Only run models that changed plus their downstream dependencies
dbt run -m state:modified+ --target dev --defer --state ./prod-artifacts/
dbt test -m state:modified+ --target dev
The state:modified+ selector checks what actually changed and runs only those models and their downstream dependents. It needs a prod-artifacts/ folder with a manifest.json from your last production run. You can store that as a CI artifact or pull it from your deployment bucket.
A small gotcha here: your CI job needs a service account key to connect to BigQuery or Snowflake. Store it as a GitHub Actions secret (DBT_PROFILES_YML in the example above) and avoid hardcoding anything.
Deploying Airflow DAGs with GitHub Actions
For Airflow on Cloud Composer or MWAA, your DAGs usually live in a GCS or S3 bucket. Deployment is just syncing files. Here is a job that does that:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
deploy-dags:
needs: lint-and-test
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Authenticate to GCP
uses: google-github-actions/auth@v2
with:
credentials_json: $
- name: Sync DAGs to GCS
run: |
gcloud storage rsync dags/ gs://us-central1-composer--abc123/dags/ \
--delete-unmatched-destination-objects \
--recursive
A few things worth noting:
needs: lint-and-testmakes sure tests pass before deployment kicks off. If linting fails, you do not push broken DAGs.if: github.ref == 'refs/heads/main'restricts deployment to the main branch only. PRs from feature branches will run tests but will not deploy.--delete-unmatched-destination-objectsremoves DAGs you deleted from the repo. Without it, renamed or removed DAG files would sit in the bucket forever and you would need to clean them up manually. This is the kind of thing that sounds optional until you have 40 stale DAGs in your Composer environment.
Environment separation
For a real project you will want at least two environments: staging and production. You can use GitHub environments to control this:
1
2
3
4
5
6
deploy-prod:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production
steps:
# same sync logic, but with prod bucket
GitHub environments let you add protection rules like required reviewers, wait timers, or restricted branches. For production data pipelines, having someone else approve before the DAGs go live is a good safety net — no matter how confident you feel about a change at 4 PM on a Friday.
Scheduling and triggers beyond push
Not everything needs to run on every commit. You can use a cron schedule for workflows that refresh dev data, run long-running validations, or snapshot production metadata:
1
2
3
4
on:
schedule:
- cron: '0 6 * * 1' # Every Monday at 06:00 UTC
workflow_dispatch: # Manual trigger from the GitHub UI
The workflow_dispatch trigger is underrated. It gives you a button in the GitHub UI to run the workflow whenever you need it, with optional input parameters. Handy for things like “re-deploy the DAGs to dev because someone accidentally deleted them from the bucket.”
Practical limitations and things to watch out for
- Runner disk space. The default runner has about 14 GB. If your pipeline pulls large datasets for testing, you will hit that ceiling fast. Use sampling or run integration tests on a separate schedule with a larger runner.
- Secret size limits. Individual secrets are capped at 48 KB. If your service account key is huge (some are), you may need to split it or use workload identity federation instead — which is actually the better approach anyway.
- Build minutes for private repos. 2,000 minutes goes quick if you have a monorepo and a dozen workflows triggering on every push. Use
pathsfilters aggressively. - The YAML gets messy. Once you have five or six jobs with conditions and dependencies, the single-file approach becomes painful. Look into reusable workflows or composite actions before it turns into a 300-line wall of YAML.
- Different tools, different patterns. A Spark job that runs on Databricks needs a different CI/CD approach than dbt or Airflow. You are not building a single “data pipeline CI/CD” — you are stitching together patterns for each component. The GitHub Actions part is the easy bit; knowing what to test and how to deploy each piece is where the real work lives.
Production versus demo
In a simple demo, you can get away with a single workflow file, one environment, and no approval gates. In production you will want:
- Separate staging and production environments with the GitHub
environmentfeature - A
deferstrategy for dbt that only builds and tests changed models - Workload identity federation instead of long-lived service account keys
- Notification on failure — Slack or email — because CI/CD failures are invisible until someone checks GitHub
- A rollback plan for DAG deployments (keep the last N versions in your bucket or use a tool that supports rollback)
None of this is hard to add. But adding it after something breaks is always more stressful than setting it up ahead of time.
Summing up
GitHub Actions gives you enough to build a solid CI/CD pipeline for your data workflows without needing to manage extra infrastructure. Start with linting and basic tests, add deployment when you are confident the tests catch real issues, and layer on environment separation and approvals as you go.
It will not be perfect on day one. But having a workflow that catches a broken SQL file or a missing Python import before it reaches production is already a huge step up from running things off your laptop and hoping for the best.
