Building your first Airflow DAG for ETL
In this article, let us build our first Apache Airflow DAG for a simple ETL pipeline. If you are starting with data engineering, you will quickly notice that writing the transformation code is only one part of the work. You also need a reliable way to run it on time, handle failures, and understand what happened when something goes wrong. That is where Airflow fits in well.
For our use case, we will take a small CSV file, stage it, run a transformation, and load the result into an analytics table. This is not meant to be a production-ready enterprise setup, but it is enough to understand how a DAG is structured and what decisions matter. Once the basics are clear, it is much easier to extend the same pattern.
What is a DAG in Airflow
Airflow uses the term DAG, which stands for Directed Acyclic Graph. In simple words, it is just a workflow with tasks and dependencies. One task runs first, the next waits, and so on.
For an ETL pipeline, a basic DAG usually has steps like:
- Check whether input data is available
- Extract the data
- Transform it
- Load it into the target system
- Optionally run quality checks
The nice thing with Airflow is that each of these steps can be tracked separately. If the load step fails, you do not have to guess whether extract completed or not.
A simple ETL flow
Let us say we receive a daily file called sales.csv. We want to clean the data and load daily totals into a reporting table. Our demo flow will do the below:
- read the CSV from a local folder
- remove rows with missing
order_id - cast the
amountcolumn to numeric - aggregate the sales by date
- load the output to a Postgres table
This could be local for a demo, but in a real use case your source might be S3, GCS, an API, or a landing database.
Project structure
A simple Airflow project could look like this:
1
2
3
4
5
6
dags/
sales_etl_dag.py
scripts/
transform_sales.py
data/
sales.csv
For a first DAG, it is better to keep things simple. Do not add too many operators, sensors, and custom plugins on day one. First make the flow visible and reliable.
Writing the DAG
Below is an example DAG using Python operators.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
import pandas as pd
from sqlalchemy import create_engine
def extract():
df = pd.read_csv('/opt/airflow/data/sales.csv')
df.to_parquet('/opt/airflow/data/staged_sales.parquet', index=False)
def transform():
df = pd.read_parquet('/opt/airflow/data/staged_sales.parquet')
df = df[df['order_id'].notna()]
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
df = df.dropna(subset=['amount'])
result = df.groupby('sale_date', as_index=False)['amount'].sum()
result.to_parquet('/opt/airflow/data/final_sales.parquet', index=False)
def load():
df = pd.read_parquet('/opt/airflow/data/final_sales.parquet')
engine = create_engine('postgresql+psycopg2://user:password@postgres:5432/analytics')
df.to_sql('daily_sales', engine, if_exists='append', index=False)
default_args = {
'owner': 'airflow',
'retries': 2,
'retry_delay': timedelta(minutes=5)
}
with DAG(
dag_id='sales_etl_demo',
start_date=datetime(2025, 1, 1),
schedule='@daily',
catchup=False,
default_args=default_args,
tags=['etl', 'demo']
) as dag:
extract_task = PythonOperator(task_id='extract', python_callable=extract)
transform_task = PythonOperator(task_id='transform', python_callable=transform)
load_task = PythonOperator(task_id='load', python_callable=load)
extract_task >> transform_task >> load_task
This is enough to understand the core pattern. We define the DAG, define the tasks, and then define the dependency order.
What each part is doing
A few pieces are worth paying attention to.
start_datetells Airflow when the DAG becomes validschedule='@daily'means it runs every daycatchup=Falsestops Airflow from trying to backfill all older datesretriesandretry_delayhelp with temporary failurestagsmake the DAG easier to filter in the UI
When I first started with orchestration tools, one small confusion was that scheduling and execution date are not always the same thing. Airflow tracks logical run dates, so if you work with partitions, file names, or date-based SQL, make sure you understand which date your task is actually processing.
Loading with SQL
If your transformation is SQL-based, it is often cleaner to keep the business logic in SQL rather than inside Python code. For example, after staging the raw file into a table, your transformation might look like this:
1
2
3
4
5
6
7
insert into analytics.daily_sales (sale_date, total_amount)
select
sale_date,
sum(cast(amount as numeric)) as total_amount
from staging.sales_raw
where order_id is not null
group by sale_date;
For many teams, this is easier to review and maintain because SQL expresses the transformation more directly. Airflow then becomes the orchestrator, not the place where all data logic lives.
PythonOperator vs other approaches
For a first DAG, PythonOperator is usually fine. But depending on your stack, you may choose something else.
| Approach | Good for | Limitation |
|---|---|---|
| PythonOperator | Quick custom logic, file handling, simple demos | Can become messy if all logic stays inside DAG code |
| BashOperator | Running scripts you already have | Harder to test and observe deeply |
| SQL operators | Warehouse transformations | Limited when you need file or API processing |
| Docker/Kubernetes operators | Isolated task environments | More setup required for beginners |
For our simple example, Python keeps things easy. In production, I would normally move the heavy transformation code into versioned scripts or containers instead of keeping everything in one DAG file.
Things to be careful about
This is the part that matters more in real work than in demos.
1. Idempotency
If your DAG retries, it should not create duplicate data. Suppose the load task writes rows and then fails before marking the task complete. On retry, you may insert the same rows again.
A safer pattern is to load into a partition for a specific date, or delete-and-insert for the run window, or use a merge/upsert strategy if your database supports it.
2. Connections and secrets
In the example above, the database connection string is hardcoded, which is okay only for a demo. In practice, store this in Airflow Connections, a secret manager, or environment variables. Keeping credentials in DAG code is not a good idea.
3. Observability
Airflow gives logs and task history, but that alone may not be enough. If a pipeline is important, you also want row counts, validation checks, and alerts. A DAG that finishes successfully but loads bad data is still a failed pipeline from a business point of view.
4. Small files and local state
Writing intermediate files on local disk is fine for learning. In a shared production environment, local files can become fragile because workers may be ephemeral or tasks may run on different nodes. Object storage or staging tables are usually safer.
What changes in production
For a simple tutorial, we are reading one local CSV and loading one table. In a production use case, I would usually change a few things:
- use object storage instead of local disk
- parameterize the run date
- use Airflow Connections for credentials
- add data quality checks after load
- make the load idempotent
- move transformation logic into separate tested modules or SQL models
- set up alerts for repeated failures
You might also split the pipeline further. For example, ingestion could be one DAG and downstream transformations could be another. That depends on how independent the steps are and how your team wants to operate them.
A small next step
Once this first DAG is running, the next useful improvement is to add a validation task. Even a simple row count or null check can save time. For example, you can add a task that fails if the transformed dataset is empty when the source file is not empty. That catches bad assumptions early.
Airflow is a large tool, so it is easy to feel that you need to understand everything before writing your first workflow. In practice, you do not. Start with one small DAG, a few tasks, clear dependencies, and sensible retries. That is enough to learn the model.
For our use case, the main goal was not to build the most advanced ETL pipeline, but to understand how orchestration fits around ETL work. Once you are comfortable with that, adding sensors, branching, task groups, or containerized tasks becomes much easier.
