Post

dbt Basics for Analytics Engineering Teams: A Practical Guide

In this article let us look at dbt (data build tool) — what it actually is, why analytics engineering teams reach for it, and how to get a basic project running. I will walk through setting up models, adding tests, generating documentation, and point out the things that tripped me up when I first started using it. If you are doing transformations in your data warehouse with a mess of scheduled stored procedures or layered views, dbt might be what you are looking for.

What Problem Does dbt Solve?

Most teams working with a cloud data warehouse (BigQuery, Snowflake, Redshift) follow a pattern: raw data lands in the warehouse, someone writes SQL to clean and reshape it, and then more SQL to build the tables that power dashboards. In the early days this is just a handful of scripts or views. Over time it grows into hundreds of interdependent transformations, nobody knows what depends on what, and a change to one table quietly breaks three downstream reports.

dbt brings software engineering practices to SQL transformations. You write SELECT statements in .sql files, dbt handles the DDL (CREATE TABLE, CREATE VIEW), figures out dependencies, and runs everything in the right order. It also gives you testing and documentation for free. You do not need to know Python — if you can write SQL, you can use dbt.

Setting Up dbt Core

dbt comes in two flavours: dbt Cloud (managed, with a web IDE and scheduler) and dbt Core (open source, command-line). Let us set up dbt Core since that is what most teams start evaluating.

1
pip install dbt-bigquery

You can swap dbt-bigquery for dbt-snowflake, dbt-redshift, or dbt-postgres depending on your warehouse. Once installed, create a new project:

1
dbt init my_analytics_project

This creates a directory structure:

1
2
3
4
5
6
7
my_analytics_project/
├── dbt_project.yml
├── models/
│   └── example/
├── tests/
├── macros/
└── analyses/

The dbt_project.yml is where you configure your project — which warehouse profile to use, where your models live, and how materializations work. The real action happens inside models/.

You will also need a profiles.yml file (usually in ~/.dbt/) that tells dbt how to connect to your warehouse. For BigQuery, it looks something like:

1
2
3
4
5
6
7
8
9
10
my_analytics_project:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: service-account
      project: my-gcp-project
      dataset: analytics_dev
      threads: 4
      keyfile: /path/to/keys.json

Run dbt debug to check everything is wired up before moving on.

Models: Where the Work Happens

A dbt model is just a .sql file containing a SELECT statement. That is it. dbt takes that SELECT and materializes it as a view or table in your warehouse.

Let us build a simple model. Say you have a raw orders table from your application database and you want to create a cleaned version with consistent column names and a few derived fields.

models/staging/stg_orders.sql:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
with source as (
    select * from 
),

renamed as (
    select
        id as order_id,
        user_id,
        amount / 100.0 as amount_dollars,
        status,
        created_at as order_date
    from source
)

select * from renamed

The `` function is a Jinja reference to a source you define in a .yml file. This lets you declare where your raw data lives once and reference it everywhere — much cleaner than hardcoding schema names across twenty models.

sources.yml:

1
2
3
4
5
6
7
version: 2
sources:
  - name: app_db
    database: raw_data
    tables:
      - name: orders
      - name: users

Now you can build another model on top of this one. models/marts/daily_order_summary.sql:

1
2
3
4
5
6
select
    date(order_date) as order_day,
    status,
    count(*) as order_count,
    sum(amount_dollars) as total_revenue
from 

Notice ``. This is the killer feature. Instead of hardcoding table names, you reference another model by its file name. dbt uses these references to build a DAG (directed acyclic graph) and run models in the correct order. Run dbt run and dbt builds stg_orders first, then daily_order_summary.

Materializations

By default, models are created as views. You can change this in dbt_project.yml or per-model.

MaterializationBehaviourWhen to use
viewRebuilt on every queryLightweight, no storage cost
tableRebuilt on every dbt runFaster queries, needs storage
incrementalAppends new rows since last runLarge fact tables
ephemeralInlined as CTE in downstream modelsKeep intermediate logic tidy without creating objects

For our daily_order_summary, a table makes more sense:

1
2
3
4
5

select
    date(order_date) as order_day,
    ...

Testing: Stop Bad Data Before It Ships

dbt has two kinds of tests: generic and singular. Generic tests are the standard ones — unique, not_null, referential_integrity, accepted_values. You add them in the same .yml file where you define your models.

1
2
3
4
5
6
7
8
9
10
11
12
version: 2
models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'shipped', 'delivered', 'cancelled']

Run dbt test and dbt checks every constraint and tells you which rows violated them. This is genuinely useful — I have caught duplicate order IDs and unexpected status values on the first run more times than I can count.

Singular tests are custom SQL files you put in tests/ that should return zero rows. For example, checking that no order has a negative amount:

1
2
3
4
-- tests/assert_no_negative_order_amounts.sql
select order_id, amount_dollars
from 
where amount_dollars < 0

Documentation That Updates Itself

Add descriptions in your .yml files and run dbt docs generate followed by dbt docs serve. You get a local website with a DAG visualisation showing how your models connect, plus column-level lineage. It is not going to replace Confluence, but for the data team it is the fastest way to understand what depends on what without digging through scripts.

Things I Wish I Knew Earlier

Jinja is powerful, but do not overdo it. You can write complex Jinja logic inside your models — loops, conditionals, variables. It is tempting to build a clever macro that handles ten edge cases. Resist that temptation in the beginning. Start with plain SQL models and introduce macros only when you are repeating yourself. Over-engineered Jinja makes debugging miserable.

dbt is not an orchestrator. dbt handles the T in ELT (transformation). It does not extract or load data. You still need something like Airflow, Prefect, or Dagster to orchestrate the full pipeline — extract, load raw data, then trigger dbt run. dbt Cloud has a built-in scheduler, but with dbt Core you need to set that up yourself.

Statefulness and incremental models need care. When you use incremental models, dbt needs to know what it has already loaded. This depends on the is_incremental() Jinja function and a unique key. Get the unique key wrong and you will either drop rows or duplicate them. Always validate incremental logic on a small dataset first.

Source freshness. dbt can check whether your source tables have been updated recently using dbt source freshness. If your upstream EL tool fails silently and stops loading data, stale source checks will catch it before your stakeholders notice. Configure freshness in your sources file:

1
2
3
4
5
6
7
8
sources:
  - name: app_db
    freshness:
      warn_after: {count: 4, period: hour}
      error_after: {count: 12, period: hour}
    tables:
      - name: orders
        loaded_at_field: updated_at

Production Use vs. a Demo

A local dbt project with three models and no tests will run fine on your laptop. In a real team, a few things change:

  • Use dbt Cloud or a CI/CD pipeline. Running dbt from your laptop against production data is a recipe for accidentally dropping the wrong table. Set up a GitHub Actions workflow or use dbt Cloud jobs to run on a schedule.
  • Separate dev and prod environments. Use dbt targets to point at a dev dataset while you work. Merge to main, and CI runs against prod. This stops half-finished models from polluting the reporting layer.
  • Slim CI. When your project has hundreds of models, running all of them on every PR is slow and expensive. dbt supports “state:modified” to only build and test models that changed. Combine this with deferral to reference unchanged models from the production environment.
  • Grant permissions deliberately. dbt can manage grants via the grants config. If everyone has broad write access to your reporting schema, add explicit grants in dbt so only the right service accounts can write.

Is dbt Right for Your Team?

dbt works well when your transformations are primarily SQL, your data lives in a supported cloud warehouse, and you have multiple people contributing to the transformation layer. It is less useful if your transformations need Python or Spark (dbt now has Python models, but they are limited), or if you are on a legacy on-prem database that dbt does not support well.

The biggest win is not the tool itself — it is the workflow. When every transformation is version-controlled, tested, and documented, you stop being afraid to refactor. That is the real value.

This post is licensed under CC BY 4.0 by the author.