Post

ETL vs ELT: A Practical Guide with Real Examples

In this article, let us look at what ETL and ELT actually mean in practice — not the textbook definitions you have already read a hundred times, but what happens when you sit down to build a pipeline and need to decide which approach to pick.

I have worked on projects that used both patterns, and honestly, the choice usually comes down to what your data warehouse can handle and how much you care about controlling data before it lands. Let us walk through both with concrete examples.

What is the actual difference?

If you have been in data engineering for more than a week, you have seen this diagram: source → extract → transform → load → warehouse. That is ETL. ELT flips the last two steps: extract → load → transform. The data lands in the warehouse raw, and you transform it there using SQL or the warehouse’s compute.

But the real difference is not the order of letters. It is about where the transformation happens and what that means for how you work.

AspectETLELT
Transform locationExternal compute (Spark, Python, etc.)Inside the warehouse (SQL, dbt)
Data lands clean?Yes, transformed before loadNo, raw data lands first
Best forCompliance-heavy, fixed-schema needsExploratory, schema-later workflows
ReprocessingRe-run the pipelineRe-run SQL transforms
Storage cost sensitivityHigher (you store less raw data)Lower (modern warehouses handle this)
Time to first insightSlower (transform then load)Faster (load first, figure out later)

ETL in practice: A Python pipeline

Let us say you are pulling data from an API and writing it to a PostgreSQL warehouse. You need to clean it up before it hits the warehouse because the downstream reporting tool expects clean data, and your warehouse is not built for heavy transforms.

Here is what a simple ETL pipeline might look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import requests
import pandas as pd
from sqlalchemy import create_engine

def extract():
    resp = requests.get("https://api.example.com/orders")
    return pd.DataFrame(resp.json()["data"])

def transform(df):
    # Drop columns we do not need
    df = df[["order_id", "customer_id", "amount", "order_date"]]
    # Rename to match warehouse convention
    df = df.rename(columns={"order_date": "order_timestamp"})
    # Remove rows with missing order_id
    df = df.dropna(subset=["order_id"])
    # Convert amount to cents (avoid floats)
    df["amount_cents"] = (df["amount"] * 100).astype(int)
    return df

def load(df, engine):
    df.to_sql("orders", engine, if_exists="append", index=False)

if __name__ == "__main__":
    engine = create_engine("postgresql://user:pass@localhost:5432/warehouse")
    raw = extract()
    clean = transform(raw)
    load(clean, engine)

This is straightforward. You control exactly what goes into the warehouse. If the API returns garbage, your transform catches it before it hits the database. For regulated industries or environments where storage is expensive, this is often the right call.

But there are downsides. If someone later asks “hey, did we drop any columns that might have been useful?” — too bad, they are gone. You cannot go back and re-derive something from raw data that you never stored.

ELT in practice: Using BigQuery + dbt

Now let us look at the same use case but with an ELT approach. We use BigQuery as the warehouse and dbt for transforms. The philosophy here is: load everything, figure out the schema later.

First, load the raw data. This could be a simple script or a managed service like BigQuery Data Transfer or Fivetran:

1
2
3
4
5
6
7
8
9
10
11
from google.cloud import bigquery
import requests

def load_raw():
    client = bigquery.Client()
    resp = requests.get("https://api.example.com/orders")
    rows = resp.json()["data"]
    # Load as-is, no transformation
    errors = client.insert_rows_json("raw.orders_raw", rows)
    if errors:
        print(f"Insert errors: {errors}")

Notice we are not dropping columns, renaming anything, or cleaning. The raw table gets whatever the API returns. Messy, but recoverable.

Then dbt takes over. Here is a simple dbt model:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- models/staging/stg_orders.sql
with source as (
    select * from 
),

cleaned as (
    select
        order_id,
        customer_id,
        -- Store as cents to avoid float issues
        cast(amount * 100 as int64) as amount_cents,
        cast(order_date as timestamp) as order_timestamp
    from source
    where order_id is not null
)

select * from cleaned

If you realize next week that you need a column you originally thought was useless, you just edit the dbt model. The raw data is still there. No pipeline rebuild needed.

This is the biggest advantage of ELT. You decouple ingestion from transformation. The ingestion job is dumb — it just dumps data. The smart stuff lives in SQL, which is easier to iterate on than Python scripts.

When ETL still makes sense

ELT has become the default in a lot of data teams, especially with tools like dbt and warehouses like BigQuery and Snowflake. But ETL is not dead. Here are situations where I would still reach for it:

PII and compliance. If you are dealing with personally identifiable information and regulations say you cannot store raw PII in the warehouse, you need to mask or drop it before load. ETL gives you that control at the ingestion layer.

Cost control on managed Spark clusters. If your transformation logic is complex — ML feature engineering, window functions over years of data, graph processing — running it inside the warehouse can get expensive fast. Doing it on a Dataproc or EMR cluster with spot instances might be cheaper.

Downstream systems that do not speak SQL. If you are feeding data into a NoSQL store, a search index, or a feature store, SQL transforms in the warehouse are not enough. You need a Python/Java pipeline anyway.

Strict schema contracts. If your downstream consumers have strict expectations and cannot handle schema drift, transforming before load means you catch issues early rather than propagating bad data.

When ELT makes more sense

You are on a modern cloud warehouse. BigQuery, Snowflake, Redshift Spectrum — these are built for ELT. The compute scales separately from storage, so running transforms on raw data is not punishingly expensive.

The source schema changes often. APIs evolve. Third-party integrations are unpredictable. With ELT, you can load the new schema as new columns and handle the differences in your dbt models, instead of having your pipeline break every time the source changes.

You have analysts who write SQL. In an ETL world, analysts have to ask engineers to change the pipeline if they need a new column. In ELT, they can often write the SQL themselves if the raw data is already there.

You are still figuring out what matters. Early in a project, you might not know which columns are useful. ELT lets you load everything now and decide later. ETL forces you to decide up front.

Things to watch out for

ELT is not a silver bullet. Here are a few things I have learned the hard way:

  • Raw data gets messy fast. If you have ten sources each with their own schema quirks, your raw layer becomes a swamp. You need discipline around naming conventions and documentation even in the raw zone. A medallion architecture (bronze → silver → gold) helps here.
  • Warehouse costs can creep. Running dbt models every hour on terabytes of raw data adds up. Partition and cluster your raw tables from day one, or you will pay for it later.
  • Not every warehouse handles JSON well. If your source data is deeply nested JSON and your warehouse is not great at flattening it (looking at you, older versions of Redshift), you might be better off transforming before load.
  • Orchestration still matters. With ELT you now have two things to orchestrate: the ingestion jobs and the dbt runs. If ingestion finishes late, your dbt models run on stale data unless you have proper dependencies set up. Tools like Airflow or Dagster help here.

A real scenario: Both patterns in one project

On one project, we used both. We had sensitive customer data that needed masking before it hit the warehouse — that part was ETL. But we also had clickstream data coming in at high volume where we wanted the raw events preserved for future analysis — that was ELT.

The lesson: you do not have to pick one and marry it. Different data sources in the same project can follow different patterns. What matters is that you understand the trade-offs and make the choice consciously, not because a blog post told you one pattern is dead.

Wrapping up

ETL and ELT are not competing religions. They are tools. If your warehouse can handle the compute and you value flexibility, lean toward ELT. If you need control before data lands — for compliance, cost, or downstream contract reasons — ETL is still a perfectly valid choice.

The worst thing you can do is pick one because it sounds modern and then spend six months fighting it. Pick the one that fits what you are actually building.

This post is licensed under CC BY 4.0 by the author.