Iceberg vs Delta Lake for beginners

Posted Jan 3, 2025

By Ashok KS 6 min read

In this article let us understand Apache Iceberg and Delta Lake from a beginner point of view, why these table formats became popular, and when you might choose one over the other. If you are building data pipelines on a data lake and do not want to manage only raw parquet files manually, then this comparison is useful.

A lot of teams start with files in S3, GCS, or ADLS and think that storing parquet files is enough. It works initially, but after some time we need features like schema evolution, partition pruning, time travel, upserts, and a reliable way for multiple engines to read the same tables. That is the gap Iceberg and Delta Lake are trying to fill.

Both are table formats for the lakehouse world. They sit on top of object storage and maintain metadata in a way that query engines can understand table state properly. So instead of saying, “my table is just a folder full of parquet files,” we say, “my table has a metadata layer that knows what files belong to which snapshot and what schema and partitions are valid.”

Why not just use parquet files directly?

For simple use cases, plain parquet files are fine. But in practice we run into some problems very quickly:

We do not have transactions, so concurrent writes can be messy.
Renaming or dropping columns becomes hard to manage safely.
Upserts and deletes are not straightforward.
Query engines may need to scan many files because metadata is limited.
We often end up building our own table management logic.

This is where these formats help. They add transactional metadata and better table management without giving up the flexibility of object storage.

Very simple difference

If I explain it in one line, I would say Delta Lake became popular first with the Spark and Databricks ecosystem, while Iceberg is often chosen when teams want broader engine interoperability. That is not the complete story, but it is a good starting point for a beginner.

Area	Delta Lake	Apache Iceberg
Origin	Started by Databricks	Started by Netflix, now Apache
Metadata model	Transaction log with JSON and checkpoints	Snapshot-based metadata tree
Ecosystem strength	Very strong in Spark and Databricks	Strong across many engines
Engine support	Good, but some features vary outside Spark	Broad support in Trino, Flink, Spark, Snowflake and others
Best fit	Teams already using Databricks heavily	Teams wanting open multi-engine access

This table is simplified, but it helps when you are just getting started.

How Delta Lake works at a high level

Delta Lake stores parquet data files and also keeps a _delta_log folder. This log tracks every transaction on the table, like add file, remove file, schema update, or checkpoint. When a query engine reads the table, it uses this log to understand the current state.

A very simplified folder layout may look like this:

/orders
  /part-0001.parquet
  /part-0002.parquet
  /_delta_log
    /00000000000000000000.json
    /00000000000000000001.json
    /00000000000000000002.json

And a write in Spark may look like below:

  
(df.write
  .format("delta")
  .mode("overwrite")
  .save("s3://demo-lake/orders"))

For many teams this is attractive because the developer experience is simple, especially in Spark. You can also do merge operations quite naturally.

  
from delta.tables import DeltaTable

target = DeltaTable.forPath(spark, "s3://demo-lake/orders")

(target.alias("t")
 .merge(source_df.alias("s"), "t.order_id = s.order_id")
 .whenMatchedUpdateAll()
 .whenNotMatchedInsertAll()
 .execute())

For a pipeline that receives changed records every hour, this is quite useful.

How Iceberg works at a high level

Iceberg also stores parquet, avro, or orc data files, but it manages table state using metadata files, manifest lists, and manifests. Instead of scanning a transaction log line by line, engines can read the current snapshot and quickly identify which files belong to the table.

A simplified view looks like this:

/orders
  /data/...
  /metadata
    /v1.metadata.json
    /snap-123.avro
    /manifest-1.avro

A create table example in Spark SQL can look like this:

  
CREATE TABLE lakehouse.sales.orders (
  order_id BIGINT,
  customer_id BIGINT,
  order_ts TIMESTAMP,
  amount DECIMAL(10,2)
)
USING iceberg
PARTITIONED BY (days(order_ts));

An upsert-style pattern may be written as:

  
MERGE INTO lakehouse.sales.orders t
USING staging.orders_delta s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

The main appeal of Iceberg is that many engines are adopting it well. If your ingestion is in Spark, some transformations are in Flink, and ad hoc queries are in Trino, Iceberg becomes very attractive.

What beginners should compare first

When you are new to both, I think these are the practical questions to ask:

1. Which engines are you really using?

If almost everything is inside Databricks and Spark, Delta Lake is usually an easy choice. If you know multiple engines need first class access, Iceberg is worth serious consideration.

2. Do you need open interoperability?

This matters more than many teams realize. It is easy to start with one compute engine, but over time reporting, ML, streaming, and governance tools may all want to read the same tables.

3. How complex are your write patterns?

Both support deletes, updates, and merges in many environments, but the exact maturity can differ by engine and catalog setup. Always test your actual write path rather than trusting a feature matrix.

4. How are you managing the catalog?

For production use, the catalog choice matters a lot. For example, you may use Hive Metastore, Glue Catalog, Unity Catalog, or a REST catalog. Beginners often focus only on file format and forget the catalog, but this is a big part of operability.

Things to be careful about

There are a few caveats that are worth knowing early.

Just because a query engine says it supports Iceberg or Delta does not mean every advanced feature is equally mature.
Compaction and file sizing still matter. Small files will hurt performance no matter how nice the table format is.
Partition design still matters. A bad partition strategy can make even a good table format perform poorly.
Metadata cleanup and maintenance jobs are needed. Snapshot expiration, vacuum, and compaction should not be ignored.
Time travel is useful, but it also means old metadata and files may remain until cleanup runs.

For a demo, we might skip some of this and simply create a table and run a few merges. In production, I would also define retention policies, optimize file sizes, monitor failed commits, and clearly document which engines are allowed to write versus only read.

A simple way to decide

If you want a simple starting rule, I would use this:

Choose Delta Lake if your platform is already centered around Databricks or Spark and you want a smooth developer experience.
Choose Iceberg if you want an open table format that works well across multiple engines and you expect that flexibility to matter.

This is not a religious choice. Both are solid technologies. The better choice depends more on your platform constraints than on feature marketing.

Conclusion

For beginners, the most important thing is to understand that Iceberg and Delta Lake are solving the same broad problem, which is making data lake tables reliable and manageable. Delta Lake often feels simpler when you are already in the Spark world. Iceberg becomes very compelling when you care about engine interoperability and a more open ecosystem. Start with your actual workloads, tools, and team needs, then choose the one that reduces operational pain rather than the one with the nicest comparison chart.

Data Engineering, Lakehouse

This post is licensed under CC BY 4.0 by the author.