Building your first Airflow DAG for ETL

Posted Jan 4, 2025

By Ashok KS 7 min read

In this article, let us build our first Apache Airflow DAG for a simple ETL pipeline. If you are starting with data engineering, you will quickly notice that writing the transformation code is only one part of the work. You also need a reliable way to run it on time, handle failures, and understand what happened when something goes wrong. That is where Airflow fits in well.

For our use case, we will take a small CSV file, stage it, run a transformation, and load the result into an analytics table. This is not meant to be a production-ready enterprise setup, but it is enough to understand how a DAG is structured and what decisions matter. Once the basics are clear, it is much easier to extend the same pattern.

What is a DAG in Airflow

Airflow uses the term DAG, which stands for Directed Acyclic Graph. In simple words, it is just a workflow with tasks and dependencies. One task runs first, the next waits, and so on.

For an ETL pipeline, a basic DAG usually has steps like:

Check whether input data is available
Extract the data
Transform it
Load it into the target system
Optionally run quality checks

The nice thing with Airflow is that each of these steps can be tracked separately. If the load step fails, you do not have to guess whether extract completed or not.

A simple ETL flow

Let us say we receive a daily file called sales.csv. We want to clean the data and load daily totals into a reporting table. Our demo flow will do the below:

read the CSV from a local folder
remove rows with missing order_id
cast the amount column to numeric
aggregate the sales by date
load the output to a Postgres table

This could be local for a demo, but in a real use case your source might be S3, GCS, an API, or a landing database.

Project structure

A simple Airflow project could look like this:

dags/
  sales_etl_dag.py
scripts/
  transform_sales.py
data/
  sales.csv

For a first DAG, it is better to keep things simple. Do not add too many operators, sensors, and custom plugins on day one. First make the flow visible and reliable.

Writing the DAG

Below is an example DAG using Python operators.

  
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
import pandas as pd
from sqlalchemy import create_engine

def extract():
    df = pd.read_csv('/opt/airflow/data/sales.csv')
    df.to_parquet('/opt/airflow/data/staged_sales.parquet', index=False)

def transform():
    df = pd.read_parquet('/opt/airflow/data/staged_sales.parquet')
    df = df[df['order_id'].notna()]
    df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
    df = df.dropna(subset=['amount'])
    result = df.groupby('sale_date', as_index=False)['amount'].sum()
    result.to_parquet('/opt/airflow/data/final_sales.parquet', index=False)

def load():
    df = pd.read_parquet('/opt/airflow/data/final_sales.parquet')
    engine = create_engine('postgresql+psycopg2://user:password@postgres:5432/analytics')
    df.to_sql('daily_sales', engine, if_exists='append', index=False)

default_args = {
    'owner': 'airflow',
    'retries': 2,
    'retry_delay': timedelta(minutes=5)
}

with DAG(
    dag_id='sales_etl_demo',
    start_date=datetime(2025, 1, 1),
    schedule='@daily',
    catchup=False,
    default_args=default_args,
    tags=['etl', 'demo']
) as dag:

    extract_task = PythonOperator(task_id='extract', python_callable=extract)
    transform_task = PythonOperator(task_id='transform', python_callable=transform)
    load_task = PythonOperator(task_id='load', python_callable=load)

    extract_task >> transform_task >> load_task

This is enough to understand the core pattern. We define the DAG, define the tasks, and then define the dependency order.

What each part is doing

A few pieces are worth paying attention to.

start_date tells Airflow when the DAG becomes valid
schedule='@daily' means it runs every day
catchup=False stops Airflow from trying to backfill all older dates
retries and retry_delay help with temporary failures
tags make the DAG easier to filter in the UI

When I first started with orchestration tools, one small confusion was that scheduling and execution date are not always the same thing. Airflow tracks logical run dates, so if you work with partitions, file names, or date-based SQL, make sure you understand which date your task is actually processing.

Loading with SQL

If your transformation is SQL-based, it is often cleaner to keep the business logic in SQL rather than inside Python code. For example, after staging the raw file into a table, your transformation might look like this:

  
insert into analytics.daily_sales (sale_date, total_amount)
select
    sale_date,
    sum(cast(amount as numeric)) as total_amount
from staging.sales_raw
where order_id is not null
group by sale_date;

For many teams, this is easier to review and maintain because SQL expresses the transformation more directly. Airflow then becomes the orchestrator, not the place where all data logic lives.

PythonOperator vs other approaches

For a first DAG, PythonOperator is usually fine. But depending on your stack, you may choose something else.

Approach	Good for	Limitation
PythonOperator	Quick custom logic, file handling, simple demos	Can become messy if all logic stays inside DAG code
BashOperator	Running scripts you already have	Harder to test and observe deeply
SQL operators	Warehouse transformations	Limited when you need file or API processing
Docker/Kubernetes operators	Isolated task environments	More setup required for beginners

For our simple example, Python keeps things easy. In production, I would normally move the heavy transformation code into versioned scripts or containers instead of keeping everything in one DAG file.

Things to be careful about

This is the part that matters more in real work than in demos.

1. Idempotency

If your DAG retries, it should not create duplicate data. Suppose the load task writes rows and then fails before marking the task complete. On retry, you may insert the same rows again.

A safer pattern is to load into a partition for a specific date, or delete-and-insert for the run window, or use a merge/upsert strategy if your database supports it.

2. Connections and secrets

In the example above, the database connection string is hardcoded, which is okay only for a demo. In practice, store this in Airflow Connections, a secret manager, or environment variables. Keeping credentials in DAG code is not a good idea.

3. Observability

Airflow gives logs and task history, but that alone may not be enough. If a pipeline is important, you also want row counts, validation checks, and alerts. A DAG that finishes successfully but loads bad data is still a failed pipeline from a business point of view.

4. Small files and local state

Writing intermediate files on local disk is fine for learning. In a shared production environment, local files can become fragile because workers may be ephemeral or tasks may run on different nodes. Object storage or staging tables are usually safer.

What changes in production

For a simple tutorial, we are reading one local CSV and loading one table. In a production use case, I would usually change a few things:

use object storage instead of local disk
parameterize the run date
use Airflow Connections for credentials
add data quality checks after load
make the load idempotent
move transformation logic into separate tested modules or SQL models
set up alerts for repeated failures

You might also split the pipeline further. For example, ingestion could be one DAG and downstream transformations could be another. That depends on how independent the steps are and how your team wants to operate them.

A small next step

Once this first DAG is running, the next useful improvement is to add a validation task. Even a simple row count or null check can save time. For example, you can add a task that fails if the transformed dataset is empty when the source file is not empty. That catches bad assumptions early.

Airflow is a large tool, so it is easy to feel that you need to understand everything before writing your first workflow. In practice, you do not. Start with one small DAG, a few tasks, clear dependencies, and sensible retries. That is enough to learn the model.

For our use case, the main goal was not to build the most advanced ETL pipeline, but to understand how orchestration fits around ETL work. Once you are comfortable with that, adding sensors, branching, task groups, or containerized tasks becomes much easier.

Data Engineering, Apache Airflow, ETL

This post is licensed under CC BY 4.0 by the author.