Getting Started with Apache Spark on Databricks
In this article let us walk through what it actually looks like to run Apache Spark on Databricks when you are coming from a self-managed Spark background. If you have been running Spark on EMR, Cloudera, or a home-grown YARN cluster and someone just handed you a Databricks workspace, this is for you.
I am not going to explain what Spark is or why distributed computing matters — there are already a hundred articles for that. Instead we will focus on the parts that feel different when you move to Databricks: how clusters work, where your code lives, and what you need to unlearn from the old way of doing things.
Setting Up Your Databricks Workspace
If you do not have a workspace yet, the quickest way is to sign up on the Databricks website and pick a cloud provider. For most teams, this is already done by whoever manages your cloud account, and you will just get a URL that looks like https://<your-workspace>.cloud.databricks.com.
Once you log in, the sidebar is your starting point. It has a few main sections that you will keep coming back to:
- Workspace — where notebooks and code live, organised in folders like a file system
- Compute — your clusters, SQL warehouses, and anything that runs code
- Data — the catalog, which is where you register tables and browse data assets
- Jobs — for scheduled or triggered production runs
If you are used to SSH-ing into an edge node and running spark-submit, this UI-first approach takes a few days to get comfortable with. But once you settle in, it is hard to go back.
Your First Cluster
Before you can run anything, you need a cluster. Click Compute → Create Compute and pick a cluster type. For learning and development, an all-purpose cluster is what you want. Job clusters are for production runs and we will touch on those later.
Important settings to pay attention to:
- Databricks Runtime Version — this is not just Spark. Databricks ships its own optimised runtime that bundles Spark with performance improvements (Photon, Delta Lake integration, better shuffle). Pick the latest LTS version unless you have a reason not to.
- Worker Type and Count — start small. Two workers with 8–16 GB each is plenty for learning. You can scale up when you have real data.
- Auto Termination — set this to something sensible like 30 or 60 minutes. Nobody wants to explain to their finance team why the cluster was running over the weekend.
Hit create and wait a couple of minutes. The cluster UI shows you Spark UI links, driver logs, and metrics even before you write a single line of code. This alone is a huge upgrade from digging through YARN resource manager pages.
The First Notebook
Databricks notebooks are not just Jupyter with a different logo. They support multiple languages in the same notebook — Python, SQL, Scala, and R — and you can switch between them using magic commands like %sql or %scala at the top of any cell.
Here is a dead-simple first notebook to check that everything works:
1
2
3
# Cell 1: Create a simple DataFrame
df = spark.range(0, 100)
display(df)
The display() function is a Databricks-specific thing. It renders DataFrames as interactive tables with sorting, filtering, and charting built in. You do not need to call .show() and squint at the console output anymore.
Let us read a CSV from DBFS (Databricks File System) and do a small aggregation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Cell 2: Read a CSV from DBFS
from pyspark.sql.functions import col, sum as spark_sum
sales_df = spark.read \
.option("header", "true") \
.option("inferSchema", "true") \
.csv("dbfs:/databricks-datasets/retail-org/sales_orders/")
# Show schema to verify it was inferred correctly
sales_df.printSchema()
# Simple aggregation: total sales by region
sales_df.groupBy("region") \
.agg(spark_sum("total").alias("total_sales")) \
.orderBy(col("total_sales").desc()) \
.display()
If you have worked with Spark before, this code looks familiar — and that is the point. Databricks does not force you to learn a new API. The same PySpark code you wrote on your old cluster works here, just with a better development experience around it.
Where Data Lives
This is where things get different from a typical Spark-on-YARN setup. Databricks has two storage layers you need to know about:
| Storage | What It Is | When To Use It |
|---|---|---|
| DBFS | A distributed file system abstraction over your cloud object storage (S3, ADLS, GCS) | Quick exploration, uploading small CSVs, or referencing datasets that already live in your cloud buckets |
| Delta Lake | An open table format with ACID transactions, time travel, and schema enforcement | Anything production — analytical tables, feature stores, streaming sinks |
DBFS is fine for one-off exploration, but do not build pipelines on top of CSV files in DBFS. You will regret it the moment you need to handle schema changes or concurrent writes. Delta Lake is the default table format in Databricks and you should lean on it from day one.
Here is how you write a DataFrame to a Delta table:
1
2
3
4
# Write as a managed Delta table
sales_df.write \
.mode("overwrite") \
.saveAsTable("default.sales_orders")
Now you can query it from any notebook using SQL:
1
2
3
4
5
%sql
SELECT region, COUNT(*) as order_count, SUM(total) as revenue
FROM default.sales_orders
GROUP BY region
ORDER BY revenue DESC
The fact that you can mix Python and SQL in the same notebook and pass DataFrames between them is something that grows on you quickly.
Notebooks vs Jobs: The Part People Get Wrong
When you start out, it is tempting to just write a notebook, hit “Run All,” and call it a pipeline. That works for exploration but it is not how you should run production workloads.
Here is the difference:
- Notebooks are for interactive development and ad-hoc analysis. They are your scratchpad.
- Jobs are for scheduled, monitored, production execution. You can run a notebook as a job or — even better — package your code as a Python wheel or JAR and run it as a job with proper parameterisation.
To create a job from a notebook, go to Workflows → Create Job, pick your notebook, set a schedule, and define a job cluster. Job clusters are cheaper than all-purpose clusters because they spin up, run the work, and shut down immediately. You only pay for the compute you actually use.
A cleaner pattern for production is to move your logic out of notebooks and into regular Python files packaged as a wheel. You can still develop interactively in a notebook, but when you are ready to go to production, extract the functions into a .py file, upload it as a workspace file or to a repo, and point the job at that instead.
Things I Wish Someone Had Told Me Earlier
Autoscaling is not magic. Databricks can add and remove workers based on load, but it is not instantaneous. If your job spikes from 0 to 100 GB of data in two minutes, the autoscaler will lag behind and you will see slow stages. Size your cluster with a minimum that matches your steady-state workload and let autoscaling handle the peaks.
Photon makes a real difference. If your workload is SQL-heavy and your runtime supports it, turn on Photon. It is a native vectorised query engine that Databricks ships. I have seen aggregations run 3-5x faster with no code changes at all. It costs a bit more per DBU but usually saves money overall because jobs finish faster.
Partition pruning depends on Delta statistics. When you apply a filter on a Delta table, the engine uses column-level min/max statistics to skip files entirely. This works great — unless your data is evenly spread across partitions and the min/max range covers everything. Pay attention to how you partition and Z-order your tables.
The Spark UI is your friend. Databricks exposes the full Spark UI from the cluster detail page. If a job is slow, open it. Look for skewed tasks, shuffle spills, and stages that read way more data than they should. The answer is usually there, in the numbers.
What Changes for Production
A development cluster and a production setup are different things. Here is what you should add before calling something production-ready:
- Use job clusters, not all-purpose clusters. They are cheaper, ephemeral, and you get clean state on every run.
- Store credentials in Databricks Secrets. Do not hardcode API keys or database passwords in your notebooks. Use
dbutils.secrets.get()and scope your secrets properly. - Set up cluster policies. If you have multiple people creating clusters, policies prevent someone from accidentally spinning up a 64-node GPU cluster for a 10 MB CSV file.
- Use Unity Catalog if it is available in your workspace. It gives you fine-grained access control across workspaces and tracks lineage. It is the replacement for the older Hive metastore and worth the migration effort.
- Version your code in Git. Use Databricks Repos to link your notebooks and Python modules to a Git repository. Do not let the workspace folder be the only copy of your production code.
Wrapping Up
Moving from self-managed Spark to Databricks is less about learning new Spark tricks and more about unlearning the operational overhead you used to carry. You stop worrying about cluster bootstrap scripts, log rotation, and dependency hell — and you start spending more time on the data itself.
The trade-off is that you need to understand Databricks-specific concepts like DBUs, Delta Lake, and the workspace model. But these are learnable in a week or two, and once you are through that curve, you will probably not miss the old way of doing things.
Start with a small all-purpose cluster, write a notebook that reads some data and writes a Delta table, then turn it into a scheduled job. That alone covers 80% of what most data engineers need from the platform. The rest — streaming, MLflow, Unity Catalog — can come later, when you actually need them.
