Using Databricks Delta tables for analytics pipelines

Posted Jan 23, 2025

By Ashok KS 7 min read

In this article let us go through how to use Databricks Delta tables in analytics pipelines and why this approach is useful when you want something more reliable than plain parquet files. If you are building ingestion jobs, transformation layers, or incremental reporting datasets, Delta tables give you a practical middle ground between a raw data lake and a fully managed warehouse experience.

For our use case, we will keep it simple. We will assume data arrives from an application database or API as daily files, then we load it into a bronze table, clean and standardize it in silver, and finally create a gold table that reporting jobs can use. This is not the only pattern, but it is one I have seen teams understand quickly and extend later.

Why Delta tables help in real pipelines

A lot of teams start with parquet because it is easy. Spark can read it, write it, and most cloud platforms support it. The problem starts when the pipeline becomes less ideal. Files may arrive late, schemas may change, a job may partially fail, or you may need to update existing rows instead of just appending new ones.

This is where Delta tables become useful. Delta adds a transaction log on top of your files, which helps with reliability features that are awkward to manage manually. In practice, the things I find most useful are:

ACID style writes for concurrent or retryable jobs
Easier upserts and deletes with MERGE
Time travel when you need to inspect an earlier table state
Better schema management compared to unmanaged files
Cleaner batch and streaming patterns on the same table format

It is not magic, but it removes a lot of custom pipeline logic.

A simple bronze, silver, gold layout

A practical pattern is to keep three layers.

Layer	Purpose	Typical data shape
Bronze	Raw ingestion	Minimal changes, mostly append
Silver	Cleaned and standardized	Deduplicated, typed, validated
Gold	Business-ready outputs	Aggregated or modeled for reporting

For example, let us say we receive order files every hour in cloud storage. The raw file might contain duplicate rows, string timestamps, and a few optional columns that appear only for some records.

A bronze load can stay simple.

  
bronze_df = spark.read.json("/mnt/raw/orders/2025/01/23/*.json")

(bronze_df
  .withColumn("ingestion_time", current_timestamp())
  .write
  .format("delta")
  .mode("append")
  .saveAsTable("bronze.orders_raw"))

Here I am not trying to fix everything during ingestion. The goal is to land the data reliably first. If the source sends something unexpected, I still want the raw copy available.

Cleaning data in silver tables

The silver layer is where I usually standardize data types, remove obvious bad rows, and deduplicate records based on a business key.

  
CREATE OR REPLACE TABLE silver.orders_clean AS
SELECT
  CAST(order_id AS STRING) AS order_id,
  CAST(customer_id AS STRING) AS customer_id,
  TO_TIMESTAMP(order_ts) AS order_timestamp,
  CAST(order_total AS DECIMAL(12,2)) AS order_total,
  status,
  ingestion_time
FROM bronze.orders_raw
WHERE order_id IS NOT NULL
  AND order_total IS NOT NULL;

If the source can resend the same order multiple times, then I would not stop at a plain create table statement. I would deduplicate before exposing this table to downstream users.

  
CREATE OR REPLACE TEMP VIEW latest_orders AS
SELECT *
FROM (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY ingestion_time DESC) AS rn
  FROM silver.orders_clean
) t
WHERE rn = 1;

This pattern is simple, and it works well when the business key is clear.

Upserts are one of the biggest wins

One reason teams move to Delta is that updating existing records becomes much easier. If an order status changes from PENDING to SHIPPED, we usually do not want duplicate rows in the curated layer.

With Delta, we can use MERGE.

  
MERGE INTO silver.orders_current AS target
USING latest_orders AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET
  target.customer_id = source.customer_id,
  target.order_timestamp = source.order_timestamp,
  target.order_total = source.order_total,
  target.status = source.status,
  target.ingestion_time = source.ingestion_time
WHEN NOT MATCHED THEN INSERT *;

In a parquet-only setup, this kind of logic usually becomes a full overwrite or a more complicated rewrite pattern. Delta makes it much more natural for analytics engineering work where records change over time.

Schema evolution is helpful, but be careful

Schema evolution is one of those features that sounds very convenient, and it is, but it can also hide bad upstream changes if you are too relaxed with it.

For example, if a new nullable column appears from the source, allowing schema evolution may be fine. But if a critical field changes type unexpectedly, you may want the job to fail rather than quietly accept something that breaks later.

A typical write pattern may look like this:

  
(clean_df
  .write
  .format("delta")
  .mode("append")
  .option("mergeSchema", "true")
  .saveAsTable("silver.orders_current"))

For a demo or a fast-moving internal project, this may be acceptable. In production, I would usually put expectations around which columns may evolve and alert on the rest. It is better to treat schema evolution as controlled flexibility, not unlimited freedom.

Building a gold table for reporting

Once the silver table is stable, the gold layer becomes straightforward. This is where we shape data for dashboards, finance reporting, or downstream extracts.

  
CREATE OR REPLACE TABLE gold.daily_order_metrics AS
SELECT
  DATE(order_timestamp) AS order_date,
  COUNT(*) AS order_count,
  SUM(order_total) AS total_sales,
  AVG(order_total) AS average_order_value
FROM silver.orders_current
WHERE status NOT IN ('CANCELLED')
GROUP BY DATE(order_timestamp);

At this point, analysts and BI tools do not need to know about the raw ingestion quirks. They just query the curated layer. That separation is one of the main reasons this setup ages better than letting everyone read directly from raw files.

Things to watch out for

Delta tables are useful, but there are still practical caveats.

Small file problems still exist if your jobs write too frequently with tiny batches.
MERGE can become expensive on very large tables if partitioning and table design are poor.
Vacuum and retention settings need care, especially if you rely on time travel for debugging.
Teams sometimes mix raw, curated, and ad hoc workloads in the same table design, which becomes messy quickly.
Governance still matters. Delta does not automatically solve naming, lineage, or access control.

I have also noticed that beginners sometimes use Delta features everywhere without asking whether the table actually needs them. Some datasets are append-only and simple. For those, the value is mostly operational consistency rather than advanced merge logic.

What I would change in production

For a simple demo, a notebook job writing bronze, silver, and gold tables is enough. In production, I would add a few more controls:

Partition strategy based on real query patterns
Data quality checks before promoting records into silver
Monitoring for failed writes, late arrivals, and schema drift
Clear retention rules for bronze versus curated layers
Separate service principals or clusters for ingestion and transformation jobs

If the platform supports it, I would also think about using Unity Catalog, job orchestration, and expectations or validation frameworks so that the tables do not become just another pile of files with better marketing. The table format helps, but the operating model still matters.

Conclusion

Delta tables are a practical option when your analytics pipelines need more than basic file appends but you still want the flexibility of a lake-based approach. For me, the biggest benefits are reliable incremental loads, easier upserts, and clearer layering between raw and curated data. If you keep the design simple and stay careful with schema and file layout, Delta tables can make your Databricks pipelines much easier to maintain over time.

Databricks, Data Engineering, Analytics

This post is licensed under CC BY 4.0 by the author.