Databricks Workflows for Scheduled Jobs: A Practical Guide
When I started building data pipelines on Databricks, the orchestration piece was always the awkward part. You would write your notebooks, get the transformations working, and then you had to figure out how to actually run them on a schedule without someone manually hitting “Run All” every morning.
We tried Airflow first — and don’t get me wrong, Airflow is great — but when everything you are orchestrating lives inside Databricks anyway, it starts to feel like you are maintaining two different things just to run one pipeline. That is where Databricks Workflows comes in.
In this article, let us walk through what Databricks Workflows actually is, how to set up a scheduled job with multiple tasks, how it compares to Airflow, and the things I wish someone had told me before I used it in production.
What is Databricks Workflows?
Databricks Workflows is the native orchestration service inside the Databricks platform. It lets you define jobs that can run notebooks, Python scripts, JAR files, dbt transformations, or even SQL queries — and schedule them to run on a recurring basis.
If you have used the “Jobs” tab in the Databricks UI before, you have already seen the basics. But Workflows goes further — you can define multi-task jobs with dependencies, retry policies, alerts, and even parameterised runs. Think of it as a lightweight orchestrator that lives right next to your data, rather than a separate service you need to host somewhere else.
Setting Up Your First Workflow
Let us walk through creating a job that runs two notebooks: one that ingests raw data, and a second that transforms it. The second task should only run after the first one succeeds.
Step 1: Create the Notebooks
Assume we have a notebook called ingest_sales_data that reads from a raw CSV in a DBFS location and writes the result as a Delta table:
1
2
3
# ingest_sales_data notebook
df = spark.read.option("header", "true").csv("/mnt/raw/sales_*.csv")
df.write.format("delta").mode("overwrite").saveAsTable("bronze.sales_raw")
And a second notebook called transform_sales that does some cleaning and aggregation:
1
2
3
4
5
# transform_sales notebook
df = spark.table("bronze.sales_raw")
df_clean = df.dropDuplicates(["transaction_id"]).filter("amount > 0")
df_clean.groupBy("date", "product_category").agg({"amount": "sum"}) \
.write.format("delta").mode("overwrite").saveAsTable("silver.sales_daily")
Step 2: Create the Job from the UI
Go to Workflows in the sidebar and click Create Job. Give it a name like sales_pipeline_daily.
Add your first task — choose the ingest_sales_data notebook. Then add a second task for transform_sales. In the second task, under Depends on, select the first task. This means the transform only runs if the ingestion succeeds.
You can also configure:
- Cluster: Pick an existing cluster or let it spin up a job cluster (more on that later)
- Retries: How many times to retry on failure before giving up
- Timeout: Maximum time a task can run before being killed
- Email notifications: Get an alert when the job fails or succeeds
Step 3: Set the Schedule
Under Schedule, set it to run daily at, say, 6:00 AM UTC. You can also use cron syntax if you need something more specific:
1
0 6 * * *
Click Create, and your job is ready. You can trigger it manually to test before the schedule kicks in.
Doing the Same Thing with Code
Clicking around the UI is fine for a one-off, but if you have multiple environments (dev, staging, prod), you want this defined as code. The Databricks CLI or the REST API lets you do exactly that.
Here is what the JSON payload for the same job looks like when using the Jobs API:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
{
"name": "sales_pipeline_daily",
"schedule": {
"quartz_cron_expression": "0 0 6 * * ?",
"timezone_id": "UTC"
},
"tasks": [
{
"task_key": "ingest_sales",
"notebook_task": {
"notebook_path": "/Shared/ingest_sales_data",
"source": "WORKSPACE"
},
"job_cluster_key": "main_cluster",
"timeout_seconds": 1800
},
{
"task_key": "transform_sales",
"depends_on": [{"task_key": "ingest_sales"}],
"notebook_task": {
"notebook_path": "/Shared/transform_sales",
"source": "WORKSPACE"
},
"job_cluster_key": "main_cluster"
}
],
"job_clusters": [
{
"job_cluster_key": "main_cluster",
"new_cluster": {
"spark_version": "14.3.x-scala2.12",
"node_type_id": "i3.xlarge",
"num_workers": 2,
"spark_conf": {
"spark.sql.adaptive.enabled": "true"
}
}
}
]
}
You can deploy this using databricks jobs create --json-file job.json from the CLI. If you are managing this through CI/CD, you would version this file in your repo and apply it as part of your deployment pipeline.
Databricks Workflows vs Airflow: Where Each Fits
I have used both, and I don’t think one replaces the other entirely. Here is how I would compare them on the things that actually matter day to day:
| Aspect | Databricks Workflows | Apache Airflow |
|---|---|---|
| Setup | Zero — lives inside Databricks | Need a server or MWAA / Cloud Composer |
| Best for | Pipelines where everything is on Databricks | Complex cross-system orchestration |
| Monitoring | Built into Databricks UI, job run history | Rich DAG view, logs per task |
| Retry / alerting | Built in, basic email alerts | Flexible — callbacks, Slack, custom hooks |
| External dependencies | Limited (dbt, SQL) | Huge operator ecosystem |
| Cost | Part of your Databricks compute cost | Separate infrastructure cost |
| Code-as-config | Jobs API / CLI / Terraform | DAGs are Python files |
If your pipeline only touches things inside Databricks — notebooks, Delta tables, SQL warehouses — Workflows is the simpler choice. You skip the overhead of running an Airflow instance and you get tight integration with the rest of the platform.
If you need to trigger an AWS Lambda, then wait for an email, and then kick off a Databricks job based on the result, Airflow is still the better fit. Or honestly, maybe a combination of both — Airflow to orchestrate the big picture and call Databricks Workflows as a step.
Things to Be Careful About
Here are a few things I learned the hard way.
Job clusters vs interactive clusters. When you create a job, you can either reuse an existing interactive (all-purpose) cluster or let Workflows spin up a dedicated job cluster. Always use a job cluster for scheduled runs. They start up quickly, you only pay while the job runs, and they shut down automatically when the job finishes. Keeping an interactive cluster running 24/7 just to run one job a day is a waste.
Notebook dependencies. Workflows does not manage library dependencies across tasks the way Airflow does. If two notebooks need different versions of the same library, you need separate clusters or you need to handle that with %pip install in each notebook. The latter works but makes your run slower and less predictable.
Timeout defaults. The default task timeout is generous, but I have seen jobs hang silently for hours because a notebook got stuck waiting for resources. Set a realistic timeout on every task. It is better to fail fast and get alerted than to find out at the end of the day that nothing ran.
Parameter passing. You can pass parameters between tasks using and. It works well for simple strings, but if you need to pass a DataFrame or a large result, you are better off writing to a Delta table and reading from the downstream task. The parameter system is not designed for heavy data movement.
Retry behaviour. By default, a retry re-runs the entire failed task from scratch. There is no partial retry or checkpointing built in. If your notebook writes to multiple tables, make sure each write is idempotent so a re-run does not create duplicates.
What Changes in Production
In a production setup, here is what I would add beyond the simple example above:
Use the REST API or Terraform to create and update jobs. Never click through the UI for anything that runs in production. You want a version-controlled definition that goes through code review.
Set up a webhook or integration for alerts. The built-in email notifications are fine to start, but for anything serious you want alerts going to Slack or PagerDuty. You can use Databricks SQL alerts or set up a webhook-based notification destination.
Separate job clusters per environment. Dev jobs should not share a cluster spec with production. Different node types, different auto-scaling limits — you tune these differently.
Use
run_asto control permissions. You can configure a job to run as a service principal rather than a user account. This matters when people leave the team and their personal tokens expire.Monitor cost with tags. Add custom tags to your job clusters so you can track which pipelines are costing the most in your Databricks bill. The Cost Attribution feature in Databricks helps with this.
Wrapping Up
Databricks Workflows is one of those tools that does not get a lot of attention because it just works. If your team is already on Databricks for Spark and Delta Lake, it makes sense to keep your orchestration there too — fewer moving parts, no separate scheduler to maintain, and the pricing is baked into what you already pay.
It is not a full replacement for Airflow in every scenario, but for the majority of batch pipelines that run inside Databricks, it is the more practical choice.
