Post

dbt basics for analytics engineering teams

In this article, let us see the basics of dbt and why analytics engineering teams use it so often. If your team is writing SQL transformations directly in the warehouse, copying queries between notebooks, or maintaining long scripts in an orchestration tool, dbt gives a cleaner way to manage that work. It helps us structure SQL models, test the data, document the lineage, and deploy changes in a repeatable way.

For a small demo, dbt might feel like one more tool to install. But once a team has more than a handful of models, different environments, and multiple people changing transformation logic, it starts making a lot of sense.

What dbt is really doing

dbt stands for data build tool. In simple terms, it takes SQL files that we maintain in a project and turns them into tables or views in our warehouse. It also lets us define tests, documentation, dependencies between models, and reusable logic with Jinja.

The important thing to understand is that dbt is not an ingestion tool. It does not move files from S3 or GCS into the warehouse. It works after the data is already there. For our use case, we can think of dbt as the transformation layer between raw data and analytics-ready data.

A simple flow could look like this:

  1. Raw data lands in warehouse tables
  2. dbt models clean and rename the columns
  3. More dbt models join and aggregate the data
  4. BI dashboards or downstream jobs read the final models

A simple project structure

When you create a dbt project, you will see a structure something like this:

1
2
3
4
5
6
7
8
9
my_dbt_project/
  models/
    staging/
    marts/
  seeds/
  snapshots/
  tests/
  dbt_project.yml
  profiles.yml

You do not need all of these on day one, but it is useful to know what they are for.

  • models/ contains the SQL transformations
  • seeds/ contains static CSV files if you need small lookup data
  • snapshots/ is used for tracking slowly changing records
  • tests/ can contain custom tests
  • dbt_project.yml is the main project configuration
  • profiles.yml stores connection details to your warehouse

In many teams, the models folder gets organized into layers like staging, intermediate, and marts. That keeps the transformations easier to understand.

Example: from raw orders to reporting model

Let us assume we already have a raw table called raw.orders with these columns:

  • order_id
  • customer_id
  • order_ts
  • amount
  • status

A first dbt model in models/staging/stg_orders.sql could look like this:

1
2
3
4
5
6
7
8
select
    cast(order_id as string) as order_id,
    cast(customer_id as string) as customer_id,
    cast(order_ts as timestamp) as order_timestamp,
    cast(amount as numeric) as order_amount,
    lower(status) as order_status
from raw.orders
where order_id is not null

This model does not do anything fancy. It just standardizes names and types. That is actually a good start. In many projects, the biggest issue is that raw data has inconsistent column types and naming.

Then we can build a reporting model in models/marts/fct_orders.sql:

1
2
3
4
5
6
7
8
select
    order_id,
    customer_id,
    date(order_timestamp) as order_date,
    order_amount,
    order_status
from 
where order_status in ('completed', 'shipped')

The ref() function is one of the most useful things in dbt. It creates the dependency between models and lets dbt build them in the correct order. It also avoids hardcoding the final table name.

Using sources and tests

In a simple demo, people often jump directly into models. But it is better to define your source tables properly. For example, a sources.yml file can tell dbt where the raw tables live:

1
2
3
4
5
6
7
version: 2

sources:
  - name: raw
    schema: raw
    tables:
      - name: orders

Then in the model you can use:

1
2
select *
from 

This is better than manually writing raw.orders everywhere. If the source schema changes later, it is easier to maintain.

Now let us add some basic tests. In a YAML file for the staging model, we could define:

1
2
3
4
5
6
7
8
9
10
11
12
version: 2

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - not_null
          - unique
      - name: customer_id
        tests:
          - not_null

These tests are very basic, but they catch common issues. If duplicate order IDs suddenly appear, or if customer IDs start arriving as null, we would know quickly.

For analytics teams, this is one of the biggest wins. Instead of discovering bad data only when a dashboard breaks, we can fail the transformation run earlier.

A quick comparison

ApproachGood forMain problem
Manual SQL scriptsVery small one-off workHard to maintain and review
SQL inside orchestration toolSimple scheduled jobsLogic and orchestration get mixed together
dbt projectTeam-based warehouse transformationsNeeds project structure and discipline

This is why dbt fits well when the transformations are growing but you still want to keep the work mostly SQL-first.

Running dbt locally

A typical local workflow looks like this:

1
2
3
4
dbt debug
dbt run
dbt test
dbt docs generate
  • dbt debug checks if the profile and warehouse connection are working
  • dbt run builds the models
  • dbt test runs the declared tests
  • dbt docs generate builds the metadata and lineage docs

In practice, developers usually run only selected models while working on a change. For example:

1
dbt run --select stg_orders+

The + tells dbt to include downstream dependencies too. This is useful when you update a staging model and want to confirm that the reporting models still work.

Where dbt helps a team immediately

From my understanding, teams start seeing value from dbt in a few places very quickly:

  1. Code review becomes easier because SQL is kept in version control instead of hidden inside a UI
  2. Dependencies are clearer because ref() makes lineage explicit
  3. Basic data quality improves because tests are easy to add
  4. Naming becomes more consistent when models are split into layers
  5. Onboarding is easier because new team members can read the project structure

This might sound simple, but these are the exact places where transformation projects become messy if not handled early.

Limitations and things to be careful about

There are a few practical limitations too.

First, dbt is not replacing everything in the data platform. You still need ingestion, orchestration, access control, and monitoring around it. Sometimes beginners expect dbt to solve the full pipeline problem, but it mainly solves the transformation layer well.

Second, if your models are poorly designed, dbt will not magically fix that. You can still create long, confusing SQL with too many joins and too much business logic in one file. The tool helps, but the team still needs good modeling discipline.

Third, warehouse cost matters. It is easy to keep stacking models and rerunning large transformations without thinking about compute usage. For a demo dataset this is fine, but in production you need to look at incremental models, partitioning, clustering, and the schedule of your jobs.

Fourth, Jinja is useful but can also become overused. A small amount of templating is good. Too much custom macro logic can make a SQL project harder to read than it should be.

What changes in a production use case

For a simple demo, we might keep everything in one project, run it from a local machine, and point directly to one warehouse schema. In production, we usually make a few changes:

  • separate dev, test, and prod environments
  • run dbt through CI/CD instead of only from local
  • use a scheduler like Airflow, Dagster, or dbt Cloud
  • add more tests, especially relationship and accepted values tests
  • introduce incremental models for large tables
  • document ownership and model purpose clearly

A production team may also define conventions such as stg_ for staging models, int_ for intermediate logic, and fct_ or dim_ for marts. Those conventions are not mandatory, but they help a lot once the project grows.

A small mindset shift

One thing I like about dbt is that it pushes analytics work closer to software engineering practices without forcing everyone to become application developers. Analysts and analytics engineers can still work mainly in SQL, but now they get version control, testing, modularity, and deployable changes.

That is probably why dbt became so common in modern warehouse teams. It fits the way many data teams already want to work, but gives a cleaner structure around it.

To conclude, dbt is a very good starting point if your team already has data in the warehouse and wants a more reliable way to manage transformations. You do not need to use every feature from day one. Start with a few staging and mart models, add basic tests, and keep the structure simple. Once the team gets comfortable, it is easy to expand from there.

This post is licensed under CC BY 4.0 by the author.