Post

ETL vs ELT with practical examples

In this article let us understand ETL and ELT with practical examples, why teams choose one over the other, and what changes when we move from a small demo pipeline to a production data platform. If you are building pipelines in a warehouse like BigQuery, Snowflake, or Redshift, this topic comes up very often because the choice affects where your transformation logic runs, how much you store, and how easy it is to debug failures.

A lot of tutorials explain ETL and ELT as simple definitions. That is useful for interviews, but when we actually build pipelines the more important question is this: where should the transformation happen for our use case?

What is the difference?

ETL means Extract, Transform, Load. We take data from a source system, transform it before it reaches the destination, and then load the final shaped data into the target table.

ELT means Extract, Load, Transform. We first load the raw data into the target platform, usually a cloud data warehouse or data lake, and then run the transformations inside that platform.

A simple way to think about it is:

ApproachWhere transformation happensBest fit
ETLIn the pipeline engine before loadingWhen data needs strong validation or reshaping before storage
ELTIn the warehouse after loading raw dataWhen warehouse compute is strong and we want flexible downstream modeling

Both are valid. The better choice depends on the source data, latency, cost, governance rules, and what tools your team already knows.

A practical ETL example

Let us assume we receive a CSV file every hour from an external vendor. The file has customer orders, but the column names are inconsistent and some records are invalid.

For our use case, we may not want to load the raw file directly into analytics tables because:

  1. The file may contain duplicate rows.
  2. Some records may have missing order IDs.
  3. The vendor may change column order without warning.
  4. We want only cleaned business-ready data in the warehouse table.

In this case ETL can be a good fit. We could build a small job in Python or Spark that reads the file, validates it, standardizes fields, removes bad records, and only then loads the cleaned output.

Pseudo-code for this could look like below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
rows = read_csv("orders.csv")
clean_rows = []

for row in rows:
    if not row["order_id"]:
        continue

    clean_rows.append({
        "order_id": row["order_id"].strip(),
        "customer_id": row["cust_id"],
        "order_amount": float(row["amount"]),
        "order_ts": parse_timestamp(row["created_at"])
    })

load_to_table(clean_rows, "analytics.orders")

Here the transformation happens before the load. The warehouse receives only the cleaned version.

This approach is useful when the source data quality is poor or when compliance rules say we should not store certain raw fields at all. It is also useful when the target system is not powerful enough to do heavy transformations efficiently.

A practical ELT example

Now let us take another example. We ingest application events from a product database and some JSON event streams. The volume is high, schemas evolve over time, and analysts want access to raw data for debugging and experimentation.

In this case ELT is often easier. We first land the source data as raw tables, then use SQL models to create trusted analytics tables.

A simple flow could be:

  1. Extract orders from the source database.
  2. Load them into raw.orders.
  3. Run SQL transformations to build staging.orders_clean.
  4. Build final business tables like mart.daily_sales.

Example SQL for the transform step:

1
2
3
4
5
6
7
8
9
10
create or replace table staging.orders_clean as
select
  cast(order_id as string) as order_id,
  cast(customer_id as string) as customer_id,
  cast(order_total as numeric) as order_total,
  timestamp(order_created_at) as order_created_at,
  date(order_created_at) as order_date
from raw.orders
where order_id is not null
  and order_total is not null;

Then for the reporting table:

1
2
3
4
5
6
7
create or replace table mart.daily_sales as
select
  order_date,
  count(*) as total_orders,
  sum(order_total) as total_sales
from staging.orders_clean
group by order_date;

This style works well in BigQuery because we can keep the raw table, rerun transforms easily, and separate ingestion from business logic. If tomorrow the finance team asks for a different revenue definition, we do not need to re-extract the source. We can update the SQL model and rebuild the downstream table.

When ETL makes more sense

From my experience, ETL is a better choice in the below situations:

  • We must reject or mask sensitive fields before they reach the warehouse.
  • The source data is messy and expensive to store in raw form.
  • We need row-level validations before loading.
  • The destination is an operational store or a database that is not designed for large transform workloads.

For example, if a payment file contains card-related fields that should never land in analytics storage, transforming before load is safer.

When ELT makes more sense

ELT fits better when:

  • We are using a modern analytical warehouse.
  • The team is comfortable managing transformations in SQL or dbt.
  • We want to preserve raw data for audit or reprocessing.
  • Source schemas change often and we need flexibility.

This is why many cloud-native data platforms prefer ELT. Compute in the warehouse is scalable, and SQL-based transforms are easier for many teams to review and maintain.

Things to be careful about

Even though ELT is popular, it is not automatically better. Keeping raw data means extra storage cost. It can also create governance issues if personally identifiable information lands in raw tables without enough controls.

ETL also has its own problems. If too much business logic is embedded in pipeline code, debugging becomes harder. Small changes may require application deployments instead of just changing SQL. If we only store the transformed output, backfilling historical logic changes can be painful.

One common mistake is mixing the two without being intentional. For example, we may do partial cleanup before load, then more transforms after load, but we do not document which layer owns which rule. That usually creates confusion during incidents.

What changes in production

For a simple demo, one script is enough. In production, I would add a few things:

  1. Separate raw, staging, and mart layers clearly.
  2. Data quality checks such as null checks, uniqueness checks, and freshness checks.
  3. Metadata like load timestamps, source file names, and batch IDs.
  4. Incremental loading strategy instead of full reloads where possible.
  5. Error handling for bad records, not just silent drops.
  6. Cost monitoring, especially for ELT-heavy warehouse jobs.

For ETL pipelines, I would also keep a rejected-records table or dead-letter bucket so that bad rows can be reviewed later. For ELT pipelines, I would add access controls on raw datasets because those tables usually contain more sensitive and less curated information.

My practical rule of thumb

If the warehouse is the main compute engine and we want flexibility, I would start with ELT. If data must be heavily validated, standardized, or filtered before storage, I would choose ETL. In many real systems we end up with a hybrid approach, but it is still useful to decide which step owns the main transformation logic.

ETL and ELT are not competing buzzwords. They are two patterns for solving data movement problems. The right one is the one that makes your pipeline easier to operate, easier to trust, and easier to change when the business asks for something new next week.

This post is licensed under CC BY 4.0 by the author.