Post

Delta Lake Basics for Data Engineers — A Practical Guide

In this article, let us go through Delta Lake and understand what it actually does for a data engineer. If you have been working with plain Parquet files on a data lake and running into issues like partial writes, schema drift breaking your queries, or not being able to roll back a bad write, then Delta Lake is exactly what you need.

We will set up a local Spark session with Delta Lake, create a Delta table, run some updates and deletes (yes, on Parquet files), use time travel to recover from mistakes, and then talk about what changes when you move this to production.

What Problem Does Delta Lake Solve?

Before Delta Lake, a typical data lake pipeline looked something like this: read from a source, run transformations in Spark, write the output as Parquet files to cloud storage. This works fine until it does not.

Let us say your job fails halfway through a write. With plain Parquet, the output folder now has a mix of partially written files and no way to know which ones are complete. Downstream readers pick up the partial data and things break silently. Or maybe someone adds a new column to the upstream source, and suddenly the Parquet schema does not match, and your pipeline throws an error at 2 AM.

Delta Lake adds a transaction log on top of Parquet. Every write, delete, or update is recorded in a JSON-based log. This gives us:

  • ACID transactions — a write either completes fully or not at all, no partial data visible to readers
  • Time travel — roll back to a previous version of the data, or query what the table looked like an hour ago
  • Schema enforcement and evolution — refuse writes that do not match the schema, or evolve it explicitly when you want to
  • Upserts and deletes — run MERGE, UPDATE, and DELETE SQL on your data lake files

These are not academic features. If you have ever woken up to a broken pipeline because a job overwrote the wrong partition, you will appreciate what the transaction log gives you.

Setting Up Delta Lake with Spark

Delta Lake works as a library on top of Spark. For our local setup, we will use PySpark with the Delta Lake packages. If you are using Databricks, Delta Lake is already included and you can skip this step.

1
2
3
4
5
6
7
8
9
# Local setup with pyspark and delta-spark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("delta-lake-basics") \
    .config("spark.jars.packages", "io.delta:delta-spark_2.12:3.2.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

Three configs are important here. The first pulls the Delta Lake JAR from Maven. The second enables Delta Lake SQL extensions so you can use CREATE TABLE and MERGE syntax. The third registers Delta Lake as the default catalog so spark_catalog understands Delta tables.

For the examples below, we will write everything to a local directory, but in cloud setups you would point this to S3, GCS, or ADLS.

Creating Your First Delta Table

Let us create a simple Delta table from a DataFrame.

1
2
3
4
5
6
7
8
9
10
11
12
data = [
    ("2025-04-01", "customer_a", 150.00),
    ("2025-04-01", "customer_b", 200.50),
    ("2025-04-02", "customer_a", 75.25),
    ("2025-04-02", "customer_c", 310.00),
]

df = spark.createDataFrame(data, ["order_date", "customer_id", "amount"])

df.write.format("delta") \
    .mode("overwrite") \
    .save("/tmp/delta/orders")

Now we have a Delta table on disk. We can read it back and register it as a table for SQL queries.

1
2
3
4
spark.read.format("delta").load("/tmp/delta/orders") \
    .createOrReplaceTempView("orders")

spark.sql("SELECT * FROM orders WHERE order_date = '2025-04-01'").show()

If you look at the directory where the table is stored, you will see Parquet files just like before, but also a _delta_log folder. That folder contains the transaction log that makes everything else possible. Every operation you run appends a new JSON file to that log.

Running Updates and Deletes on Parquet Files

This is where it gets useful. With plain Parquet, you cannot run UPDATE or DELETE without rewriting the entire partition or working around it in your application code. With Delta Lake, these are first-class operations.

1
2
3
4
5
6
7
8
-- Update a single row
UPDATE orders
SET amount = 165.00
WHERE customer_id = 'customer_a' AND order_date = '2025-04-01';

-- Delete rows matching a condition
DELETE FROM orders
WHERE amount < 100.00;

Under the hood, Delta Lake does not actually modify the original Parquet files. It reads the matching files, filters out the rows being updated or deleted, and writes new files — all inside a transaction. Readers either see the old state or the new state, never anything in between.

MERGE for Upserts

For our use case, this is probably the feature you will use the most. If you have incoming CDC events or daily snapshots and need to insert new rows and update existing ones in one shot, MERGE handles it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# New data coming in
updates = [
    ("2025-04-03", "customer_a", 99.00),
    ("2025-04-03", "customer_d", 450.00),
]
updates_df = spark.createDataFrame(updates, ["order_date", "customer_id", "amount"])
updates_df.createOrReplaceTempView("updates")

spark.sql("""
    MERGE INTO orders AS target
    USING updates AS source
    ON target.customer_id = source.customer_id AND target.order_date = source.order_date
    WHEN MATCHED THEN
        UPDATE SET target.amount = source.amount
    WHEN NOT MATCHED THEN
        INSERT (order_date, customer_id, amount)
        VALUES (source.order_date, source.customer_id, source.amount)
""")

A point worth mentioning: if the ON condition matches more than one row in the target, Delta Lake will throw an error. The join key must uniquely identify a row in the target. If your data can have duplicates, you need to deduplicate before the merge.

Time Travel — Rolling Back a Bad Write

Let us say someone ran a bad UPDATE and corrupted the amount column for all rows. With plain files, you would hope your backup exists and is recent. With Delta Lake, you can query or restore a previous version.

1
2
3
4
5
6
7
8
9
10
11
# See what the table looked like at version 0
spark.read.format("delta") \
    .option("versionAsOf", 0) \
    .load("/tmp/delta/orders") \
    .show()

# Restore the table to version 0
from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "/tmp/delta/orders")
delta_table.restoreToVersion(0)

You can also time travel using a timestamp instead of a version number. This is handy when someone says “what did this table look like before the 7 AM job ran?”.

1
2
3
4
spark.read.format("delta") \
    .option("timestampAsOf", "2025-04-02T06:00:00") \
    .load("/tmp/delta/orders") \
    .show()

Delta Lake vs Plain Parquet — A Quick Comparison

FeaturePlain ParquetDelta Lake
ACID transactionsNo — partial writes possibleYes — atomic commits
Updates and deletesFull partition rewrite requiredSQL UPDATE / DELETE / MERGE
Schema enforcementNone — writes succeed with wrong schemaFails on mismatch, explicit evolution
Time travelNo — only the current stateYes — query or restore any version
CompactionManual, separate processOPTIMIZE command built in
Reader isolationReaders may see partial writesSnapshot isolation

Schema Enforcement and Evolution

By default, Delta Lake checks that the data you are writing matches the table schema. If your DataFrame has an extra column that is not in the target table, the write fails.

1
2
3
4
# This would fail if 'discount' is not in the target schema
df.withColumn("discount", lit(0.10)).write.format("delta") \
    .mode("append") \
    .save("/tmp/delta/orders")

This is actually a good thing. In production pipelines, we have had sources add columns without warning, and schema enforcement caught the issue at write time instead of breaking downstream readers two hours later.

When you do want to add a column, you set mergeSchema to true.

1
2
3
4
df.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save("/tmp/delta/orders")

Alternatively, you can add the column explicitly using SQL.

1
ALTER TABLE orders ADD COLUMN discount DOUBLE;

Practical Limitations and Caveats

  1. Small file problem. If you run many small inserts or merges, you will end up with a lot of tiny Parquet files. Delta Lake can handle this with the OPTIMIZE command, but you need to schedule it. In production, we run OPTIMIZE daily on tables that receive frequent small writes.

  2. Transaction log size. Every operation adds a JSON commit file to the _delta_log folder. Over time, this log can grow large, especially on high-throughput tables. There is a VACUUM command to clean up old Parquet files, but it does not shrink the log. Delta Lake periodically creates checkpoint files that summarize the log, but if you have millions of commits, it is worth checking log size.

  3. Concurrency. Multiple writers to the same table can conflict. Delta Lake uses optimistic concurrency, so if two jobs try to write at the same time, one will fail and need a retry. For simple append-only workloads this is rarely an issue, but for merge-heavy pipelines you need retry logic.

  4. Not a full database. Delta Lake gives you acid transactions, but it is still file-based storage. If you need sub-second queries with joins across many tables, a warehouse like BigQuery or Snowflake is still the right tool. Delta Lake is for your lakehouse — the place where raw and refined data lives before it hits the serving layer.

  5. Version history retention. By default, Delta Lake keeps 30 days of transaction history. If you need longer retention for audit or compliance, increase delta.logRetentionDuration and delta.deletedFileRetentionDuration, but be aware this increases storage costs.

What Changes in Production

For a simple demo like this, writing to a local directory and running manual restores works. In a production use case, we would also:

  • Write to cloud storage with appropriate bucket policies, versioning, and encryption
  • Set up OPTIMIZE jobs on a schedule (daily or weekly depending on write frequency) to compact small files
  • Run VACUUM periodically to clean up old Parquet files no longer referenced by any snapshot, with a retention period that matches your time travel needs
  • Monitor the transaction log size and set up checkpointing if the table gets heavy writes
  • Use structured streaming with Delta Lake if the data arrives continuously, since Delta supports exactly-once streaming writes
  • Configure retries and idempotency for merge operations, especially when multiple jobs write to the same table
  • Set up alerts for schema mismatch errors so you catch upstream changes early

For smaller teams, Delta Lake on top of S3 or GCS with a simple Spark job is a low-overhead way to get transactional guarantees without running a full warehouse. That is the sweet spot.

Wrapping Up

Delta Lake fills a very specific gap that most data engineers hit at some point: you want the reliability of a database but the scale and cost profile of a data lake. The transaction log is the key insight — once you have ordered, versioned metadata tracking every change, features like time travel and upserts become straightforward.

We could see how Delta Lake works with a few lines of PySpark, and the same patterns carry over directly to cloud deployments. If you are still writing plain Parquet files and hoping nothing breaks, give Delta Lake a try. The setup is minimal, and the first time a bad write happens and you roll back in thirty seconds instead of scrambling for backups, you will understand why it exists.

This post is licensed under CC BY 4.0 by the author.