Post

Databricks Workflows for Scheduled Jobs

In this article let us see how to use Databricks Workflows for scheduled jobs, why you might choose it over bringing in a separate orchestrator, and what it looks like in a simple real project. If your data pipeline already runs inside Databricks, using Workflows is often the quickest way to schedule it without adding Airflow, Step Functions, or some other tool too early.

For a small to medium use case, that simplicity matters. You can create a job, attach one or more tasks, set a schedule, pass parameters, and monitor the runs from the same place where your notebooks or Python files already live. That does not mean it replaces every orchestration tool, but for many scheduled batch jobs it is enough.

What is Databricks Workflows

Databricks Workflows is the built-in orchestration feature in Databricks. You can use it to run notebooks, Python scripts, SQL tasks, dbt tasks, or Delta Live Tables on a schedule or based on task dependencies.

If I already have a bronze to silver pipeline in Databricks, I usually prefer starting with Workflows because:

  • the compute is already there
  • logs are in one place
  • retries and task dependencies are built in
  • there is less infrastructure to manage

If the entire pipeline spans many systems, then an external orchestrator may still make more sense. But if most of the work is Spark, SQL, and Delta tables, Workflows is a very reasonable starting point.

A simple use case

Let us say we receive sales files every morning into cloud storage. We want to run a scheduled job at 6 AM UTC that does the below:

  1. Read the raw files into a bronze Delta table
  2. Clean and deduplicate into a silver table
  3. Aggregate daily revenue into a gold table

This is a classic batch pipeline, and it maps nicely to a Databricks Workflow with three tasks.

Creating the tasks

You can create a workflow from the Databricks UI, but it is better if your code is already in notebooks or Python files under source control. For a small example, our task flow could look like this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# bronze_ingest.py
from pyspark.sql import functions as F

input_path = dbutils.widgets.get("input_path")
run_date = dbutils.widgets.get("run_date")

df = spark.read.option("header", True).csv(f"{input_path}/{run_date}/*.csv")

(df
 .withColumn("ingested_at", F.current_timestamp())
 .write
 .format("delta")
 .mode("append")
 .saveAsTable("demo.sales_bronze"))

Then the silver task can do basic cleanup.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# silver_transform.py
from pyspark.sql import functions as F
from pyspark.sql.window import Window

df = spark.table("demo.sales_bronze")

window_spec = Window.partitionBy("order_id").orderBy(F.col("ingested_at").desc())

clean_df = (df
    .withColumn("rn", F.row_number().over(window_spec))
    .filter("rn = 1")
    .drop("rn"))

(clean_df
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable("demo.sales_silver"))

And the gold task could be SQL.

1
2
3
4
5
6
7
INSERT OVERWRITE demo.daily_revenue_gold
SELECT
  order_date,
  SUM(amount) AS total_revenue,
  COUNT(DISTINCT order_id) AS order_count
FROM demo.sales_silver
GROUP BY order_date;

The nice thing here is that the workflow does not care if one task is Python and another is SQL. You can connect them in sequence and Databricks will handle the dependency order.

Scheduling the workflow

In the workflow configuration, you can create three tasks:

  • bronze_ingest
  • silver_transform depends on bronze_ingest
  • gold_aggregate depends on silver_transform

Then attach a schedule such as 0 6 * * * if you want it to run every day at 6 AM UTC.

For the first task, it is useful to pass parameters. For example:

1
2
3
4
{
  "input_path": "s3://company-raw-sales",
  "run_date": ""
}

The exact dynamic value syntax can vary based on task type and how you configure it, so I always test this once from the UI before assuming the parameter value is what I expect. Date handling is one of those small things that can cause confusion later.

Job clusters versus all-purpose clusters

One good practice is to use a job cluster for scheduled runs instead of an always-running interactive cluster. A job cluster starts for the workflow run and terminates after it finishes.

For example, a simple cluster config could be:

1
2
3
4
5
{
  "spark_version": "14.3.x-scala2.12",
  "node_type_id": "i3.xlarge",
  "num_workers": 2
}

For a demo this is fine. In a production use case, I would also think about:

  • cluster policies
  • instance pools
  • cost controls
  • retry settings
  • alerting on failure
  • service principal or managed identity instead of user-owned credentials

These things are easy to ignore in the beginning, but they matter once multiple pipelines start using the platform.

Monitoring and retries

This is one of the more useful parts of Workflows. You can see the DAG view, task history, duration, logs, and retry behavior from the Databricks UI. For many teams this is enough visibility without building another monitoring layer on day one.

I usually set a small retry policy for transient issues, for example a cloud storage read timeout or short cluster startup issue. But I would not hide real data quality problems behind too many retries. If the input is bad, the job should fail loudly enough for someone to notice.

Where Workflows fits compared to other tools

A quick comparison looks like this:

ToolBest fitWhat to watch for
   
Databricks WorkflowsPipelines mostly inside DatabricksLess ideal for cross-platform orchestration
AirflowMany systems and complex dependenciesMore infrastructure and maintenance
AWS Step FunctionsAWS-native service orchestrationNot as natural for Databricks-first pipelines
GitHub ActionsLightweight scheduled automation and CI/CDLimited for heavy data orchestration

For our use case, Workflows is attractive because the compute, code, logs, and scheduling stay close together. That reduces moving parts.

Limitations to be aware of

There are a few things I would keep in mind before standardizing on it for everything.

  1. If your pipeline needs to coordinate across many external systems, Workflows can start feeling too narrow.
  2. The UI makes it easy to get started, but unmanaged UI changes can drift away from what is stored in Git.
  3. Parameter handling and environment promotion need some discipline between dev, test, and prod.
  4. If a workflow becomes very large, debugging dependencies and ownership can get messy.

For that reason, I like Workflows most when the scope is clear. If the main concern is running Databricks jobs on a schedule, it does that well. If the main concern is enterprise-wide orchestration across ten services, I would evaluate other tools.

Managing it in production

In a real production setup, I would avoid creating everything manually from the UI. I would define jobs through Databricks Asset Bundles, Terraform, or another repeatable deployment approach. That way the schedule, cluster settings, permissions, and task definitions are version controlled.

I would also add a few operational basics:

  • notifications to email or chat on failures
  • clear ownership for each job
  • naming standards for jobs and tasks
  • separate configs per environment
  • checks for data freshness after successful runs

A workflow that says success in the UI is only part of the story. We still need to know whether the expected data actually landed in the target tables.

Conclusion

Databricks Workflows is a good option when you want to schedule Databricks-native jobs without introducing another orchestration platform too soon. It is simple to start, supports task dependencies and retries, and keeps monitoring close to the code. For many batch pipelines, that is more than enough. Once the platform grows, you can decide whether to keep using it for Databricks-heavy pipelines or move broader orchestration into a separate tool.

This post is licensed under CC BY 4.0 by the author.