Post

Iceberg vs Delta Lake: A Practical Guide for Data Engineers

If you have spent any time in the data engineering space over the last couple of years, you have probably heard about table formats — Delta Lake, Apache Iceberg, Apache Hudi. They all sound like they do similar things, and honestly, at a high level, they do. But when you are the one who has to pick one for a project or migrate an existing pipeline, the differences start to matter.

In this article, we will look at Iceberg and Delta Lake side by side. This is not a deep-dive into internals. It is a practical guide for someone who needs to understand what each one brings, where they work well, and what trade-offs you are making. There is already a Delta Lake basics post on this blog, so we will not rehash the fundamentals of Delta — instead, we will focus on how Iceberg compares and when you would reach for one over the other.

What problem do table formats solve?

Before jumping into the comparison, let us get one thing straight. Parquet, ORC, Avro — these are file formats. A table format sits one level above. It gives you things like:

  • ACID transactions on data lakes
  • Time travel (query data as it looked at some earlier point)
  • Schema evolution without breaking readers
  • Partition evolution (change partition columns without rewriting everything)
  • Efficient upserts and deletes

Without a table format, your data lake is just a pile of files. With one, it starts behaving more like a database table — but still on cheap object storage. That is the pitch, and it is a good one.

Delta Lake: The Databricks-native option

Delta Lake was created by Databricks and is open source (the project moved to the Linux Foundation in 2022). It is deeply embedded in the Databricks ecosystem, but you can use it anywhere you run Spark — and increasingly with other engines too.

If you work in a Databricks shop, Delta Lake is the default, and it works really well. Creating a Delta table is as simple as:

1
2
3
4
5
6
7
CREATE TABLE orders (
  order_id BIGINT,
  customer_id BIGINT,
  amount DECIMAL(10,2),
  order_date DATE
) USING DELTA
LOCATION 's3://my-bucket/orders/';

Once the table exists, you get ACID guarantees, time travel via VERSION AS OF, and MERGE statements that work the way you would expect from a relational database.

One thing I have noticed in practice: the Delta Spark connector is mature and rarely surprises you. If you are doing standard ETL with Spark, it is hard to go wrong with Delta. The OPTIMIZE command is also genuinely useful — it compacts small files and can co-locate related data using Z-ordering, which speeds up queries.

1
OPTIMIZE orders ZORDER BY (customer_id);

Apache Iceberg: The engine-agnostic one

Iceberg started at Netflix and now lives under the Apache Software Foundation. The big difference is that Iceberg was built from day one to work with multiple query engines — Spark, Trino, Presto, Flink, Hive, and more. It is not tied to any single processing engine or vendor.

This engine-agnostic design shows up in how Iceberg tracks table metadata. Instead of relying on a Spark-specific transaction log (Delta does this with its _delta_log directory), Iceberg uses a layered metadata tree with manifest files, manifest lists, and a metadata file. The current metadata pointer is stored in a version file. This means any engine that understands the Iceberg spec can read and write to the same table without stepping on each other.

Creating an Iceberg table in Spark looks similar:

1
2
3
4
5
6
7
CREATE TABLE orders (
  order_id BIGINT,
  customer_id BIGINT,
  amount DECIMAL(10,2),
  order_date DATE
) USING ICEBERG
LOCATION 's3://my-bucket/orders/';

But the real power comes when a Trino user queries that same table without any extra configuration, or when a Flink job writes streaming data into it. That is not something Delta Lake handles as smoothly today, though Delta is improving on this front with UniForm.

Head-to-head comparison

Here is a quick table for the things that matter day-to-day:

FeatureDelta LakeApache Iceberg
Primary engine supportSpark (excellent), others improvingSpark, Trino, Flink, Presto, Hive, Snowflake, BigQuery
Time travelYes (version number or timestamp)Yes (snapshot ID or timestamp)
Schema evolutionADD/ALTER columnsADD, DROP, RENAME, REORDER columns
Partition evolutionYesYes
Compaction / optimizeOPTIMIZE with Z-orderRewrite data files (manual or via engine)
Row-level deletesStandard MERGEMERGE plus copy-on-write or merge-on-read delete files
Streaming writesSpark Structured StreamingFlink, Spark Streaming
Catalog optionsUnity Catalog, Hive MetastoreHive, Glue, Nessie, JDBC, REST catalog
Vendor backingDatabricksNetflix, Apple, Snowflake, AWS, Google, Dremio, many others

Where each one shines

Pick Delta Lake when:

  • Your stack is Databricks or Spark-heavy. You will have the smoothest experience.
  • You want one vendor to own the roadmap and support. Databricks drives Delta, and the integration is tight.
  • You are doing classic batch ETL with occasional MERGE operations. Delta handles this well out of the box.
  • Your team does not want to think about catalogs too much. Unity Catalog (if on Databricks) or Hive Metastore works.

Pick Iceberg when:

  • You have multiple query engines hitting the same data. This is the original Iceberg use case.
  • You are not married to Databricks and want vendor flexibility. Iceberg has broad industry support.
  • You need partition evolution — changing partition columns on a live table is huge for production data that grows over time.
  • You are using Snowflake, BigQuery, or Athena as your query layer. All of these now support Iceberg tables natively.

Practical limitations and things to watch out for

Delta Lake limitations:

  • If you are outside the Databricks ecosystem, some features lag. The open-source Delta Spark connector does not always keep pace with the Databricks version.
  • Multi-engine reads with concurrent writes can be tricky. Delta’s protocol assumes Spark-like semantics for conflict resolution.
  • The Delta transaction log is a single directory of JSON files. At very high transaction rates (streaming with tiny batches), this can become a bottleneck.

Iceberg limitations:

  • The catalog story can feel fragmented. Do you use Hive Metastore? Glue? Nessie? JDBC? The REST catalog spec is meant to fix this, but it is still stabilising.
  • File compaction is more manual. Delta has OPTIMIZE built in; with Iceberg you will likely need to schedule a compaction job yourself.
  • Documentation and community are smaller than Delta’s. You will find fewer blog posts and Stack Overflow answers when you hit a weird edge case.

What changes in production

If you are prototyping or doing a proof of concept, either format will work fine on a few hundred gigabytes. The differences only start to matter when:

  • You have multiple teams querying the same tables with different engines (Iceberg wins).
  • You are doing high-frequency streaming writes with concurrent readers (test both — the answer depends on your exact workload).
  • You need to change partitioning on a live multi-TB table without downtime (Iceberg’s partition evolution is genuinely better here).
  • You care about catalog governance, access control, and audit. Delta with Unity Catalog is hard to beat on Databricks. If you are not on Databricks, Iceberg plus a catalog like Polaris or Glue is a reasonable alternative.

One thing I would recommend: do not make the decision based on a feature checklist. Spin up a small dataset, write some queries, simulate a schema change, do a rollback. See what the actual workflow feels like. A feature that exists on paper but is painful to use does not count.

Wrapping up

Delta Lake and Iceberg are both solid, and the gap between them is narrowing. The choice matters less than it did two years ago, and both projects are improving fast. If your stack is Spark-centric, start with Delta. If you need multi-engine access or want to avoid vendor lock-in, Iceberg is a strong choice. Either way, moving from raw Parquet files on S3 to a proper table format is the bigger win — so do not overthink it.

This post is licensed under CC BY 4.0 by the author.