Post

Medallion Architecture Explained: A Practical Guide to Bronze, Silver, and Gold Layers

In this article, let us look at the Medallion Architecture — a data design pattern that organises your data lake into three layers: Bronze, Silver, and Gold. If you have heard terms like “bronze table,” “silver layer,” or “gold data” thrown around in data engineering discussions and wondered what they actually mean in practice, this article is for you.

We will walk through what each layer does, why you would structure your data this way, and then build a small example using Delta Lake and PySpark to see it in action. By the end, you should have a clear idea of when this pattern helps and when it might be overkill.

Where Did Medallion Architecture Come From?

The term was popularised by Databricks, but the idea itself is not new. If you have worked with data warehouses before, you have probably seen similar patterns — raw staging tables feeding into cleansed tables which then feed into aggregated reporting tables. The Medallion Architecture just gives it a name and structures it specifically for data lakes and lakehouses.

The core idea is simple: instead of having one massive table where raw, cleaned, and aggregated data all sit together, you split your data processing into distinct stages. Each stage has a clear purpose and produces tables of increasing quality and business value.

The Three Layers

Bronze Layer — Raw Ingestion

The Bronze layer is where your data lands first. Think of it as the landing zone. You ingest data from source systems as-is, without any transformation. No cleaning, no filtering, no schema changes. The goal is to have an immutable record of exactly what the source gave you.

In practice, Bronze tables usually:

  • Retain the original schema and data types from the source
  • Append new data rather than overwriting (you want the full history)
  • May include additional metadata columns like ingested_at or source_file_name

A typical Bronze table for, say, sales transactions coming from a PostgreSQL database might look like this:

1
2
3
4
5
6
7
+------------+----------+--------+-------------------+
| order_id   | amount   | status | ingested_at       |
+------------+----------+--------+-------------------+
| ORD-001    | 150.00   | shipped| 2025-06-17 10:00  |
| ORD-002    | NULL     | pending| 2025-06-17 10:00  |
| ORD-003    | -50.00   | shipped| 2025-06-17 10:00  |
+------------+----------+--------+-------------------+

Notice the problems? NULL amounts, negative values where they should not be, and no standardisation. That is fine — Bronze does not judge. It just stores.

Silver Layer — Cleaned and Enriched

The Silver layer is where you start making sense of your data. You read from Bronze, apply cleaning rules, standardise formats, deduplicate, and join with reference data. The output is still at a relatively granular level, but the data is now trustworthy enough for most analytical queries.

In our sales example, the Silver table might:

  • Replace NULL amounts with 0 or flag them for review
  • Filter out records with negative amounts (or send them to a quarantine table)
  • Standardise the status field to lowercase
  • Join with a product or customer dimension table to enrich the data

After processing, the same data might look like:

1
2
3
4
5
6
+------------+----------+---------+--------------+-------------------+
| order_id   | amount   | status  | customer_name| processed_at      |
+------------+----------+---------+--------------+-------------------+
| ORD-001    | 150.00   | shipped | Acme Corp    | 2025-06-17 11:00  |
| ORD-002    | 0.00     | pending | Globex Inc   | 2025-06-17 11:00  |
+------------+----------+---------+--------------+-------------------+

ORD-003 with the negative amount has been quarantined. ORD-002’s NULL amount is now 0 and flagged. Data is clean and ready to use.

Gold Layer — Business-Level Aggregates

The Gold layer is where business logic lives. Gold tables are typically aggregated, joined, and shaped specifically for a business use case — a dashboard, a report, or a machine learning feature set. They are optimised for query performance, not for storage efficiency.

A Gold table for a daily sales dashboard might look like:

1
2
3
4
5
6
+------------+-------------+-----------------+
| order_date | total_sales | orders_shipped  |
+------------+-------------+-----------------+
| 2025-06-16 | 12500.50    | 42              |
| 2025-06-17 | 8700.00     | 31              |
+------------+-------------+-----------------+

This table is small, fast to query, and directly answers a business question: “How much did we sell yesterday?”

Medallion Architecture vs Other Approaches

Here is a quick comparison of how the Medallion Architecture stacks up against other common data organisation patterns:

PatternBest ForMain Drawback
Medallion (Bronze/Silver/Gold)Lakehouse environments, incremental processingAdds latency; three stages of ETL
Traditional ETL (Staging → DW → Marts)Mature warehouse setupsLess flexible for ML and unstructured data
ELT (Load raw, transform in-warehouse)Cloud warehouses like BigQuery, SnowflakeCan get expensive if transforms are heavy
One Big Table (OBT)Simple, denormalised reportingPainful to maintain as data sources grow

The Medallion Architecture sits somewhere between traditional ETL and ELT. You land data as-is (like ELT) but process it in defined stages (like ETL). It works particularly well in lakehouse environments where you have both structured and unstructured data and need to serve both analysts and ML engineers from the same platform.

A Practical Example with Delta Lake and PySpark

Let us build a small Medallion pipeline. We will use Delta Lake on top of a data lake (could be S3, ADLS, or GCS) and process data using PySpark.

First, let us set up the Bronze table. We are ingesting order data from CSV files that land in a cloud storage bucket.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Bronze layer — raw ingestion
bronze_df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("gs://raw-data-bucket/orders/2025/06/17/*.csv")

bronze_df = bronze_df \
    .withColumn("ingested_at", current_timestamp()) \
    .withColumn("source_file", input_file_name())

bronze_df.write \
    .format("delta") \
    .mode("append") \
    .save("/delta/bronze/orders")

Nothing fancy here. We just read the files and add a couple of metadata columns. The key decision is .mode("append") — we want to keep every record that comes in.

Now the Silver layer. Here we apply cleaning and business rules:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Silver layer — clean and standardise
bronze_df = spark.read.format("delta").load("/delta/bronze/orders")

from pyspark.sql.functions import col, when, lower

silver_df = bronze_df \
    .withColumn("amount", when(col("amount").isNull(), 0)
                .otherwise(col("amount"))) \
    .withColumn("status", lower(col("status"))) \
    .filter(col("amount") >= 0) \
    .withColumn("processed_at", current_timestamp())

# Deduplicate based on order_id, keeping the latest record
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window_spec = Window.partitionBy("order_id").orderBy(col("ingested_at").desc())

silver_df = silver_df \
    .withColumn("rn", row_number().over(window_spec)) \
    .filter(col("rn") == 1) \
    .drop("rn")

silver_df.write \
    .format("delta") \
    .mode("overwrite") \
    .save("/delta/silver/orders")

Notice we are using .mode("overwrite") here — unlike Bronze, Silver tables are typically fully recomputed or merged, since the cleaning logic can change over time. In production you would use merge instead of overwrite to handle incremental updates more efficiently.

Finally, the Gold layer — an aggregated view for a daily sales dashboard:

1
2
3
4
5
6
7
8
9
10
11
12
-- Gold layer — daily sales summary
CREATE OR REPLACE TABLE delta.`/delta/gold/daily_sales`
USING DELTA
AS
SELECT
    date(processed_at) AS order_date,
    SUM(amount) AS total_sales,
    COUNT(CASE WHEN status = 'shipped' THEN 1 END) AS orders_shipped,
    COUNT(*) AS total_orders
FROM delta.`/delta/silver/orders`
GROUP BY date(processed_at)
ORDER BY order_date;

You could also write this in PySpark if you prefer. I used SQL here because once the data is clean in Silver, SQL becomes the easiest way to build Gold aggregates.

Things to Watch Out For

Storage costs. You are storing the same data three times (or more if you keep history in Bronze). With Delta Lake’s time travel and versioning, the actual storage footprint can grow quickly. Use retention policies — set delta.logRetentionDuration and delta.deletedFileRetentionDuration to reasonable values. Do not keep Bronze data forever unless you have a compliance reason.

Latency. Every layer adds processing time. If your use case needs sub-minute freshness on dashboards, a three-layer Medallion pipeline might not be the right fit. Consider whether you can combine Silver and Gold for real-time use cases, or use streaming for the layers that need freshness.

Not every dataset needs all three layers. If you have a small reference table (say, a list of country codes), going Bronze → Silver → Gold is ridiculous. Use common sense. Some datasets might skip Bronze entirely if they come from a trusted source. Some might stop at Silver if nobody needs aggregated views.

Schema evolution. Source systems change their schemas. Your Bronze layer handles this fine (just append with mergeSchema), but your Silver transformation logic needs to be aware of it. If a new column appears in Bronze, your Silver job should not silently drop it unless you have made that choice explicitly.

Production Considerations

In a production setup, you would want:

  1. Orchestration. Wire the layers together using Airflow, Databricks Workflows, or Step Functions. Bronze runs first, then Silver, then Gold. Each depends on the previous one completing successfully.

  2. Data quality checks. Add validation steps between layers. Tools like Great Expectations or Delta Live Tables expectations can catch issues before they propagate. If Bronze data is 90% NULLs on a given day, you probably want to halt the Silver job and investigate rather than silently writing garbage.

  3. Incremental processing. Reading the full Bronze table every time you run Silver does not scale. Use Delta Lake’s change data feed or maintain a watermark column to process only new or changed records.

  4. Access control. Your analysts probably only need Silver and Gold. Your data scientists might want Bronze for raw feature engineering. Set up table-level access controls in your catalog (Unity Catalog, AWS Glue, etc.) to keep things tidy and secure.

  5. Monitoring and alerting. If the Bronze ingestion fails, you need to know. Set up alerts on pipeline failures and also on data volume anomalies (e.g., Bronze received 10 records instead of the usual 10,000).

Wrapping Up

The Medallion Architecture is not magic — it is just a disciplined way of organising your data processing. Bronze gives you a safety net (you can always re-process from raw data), Silver gives you clean and reliable data for daily work, and Gold gives you business-ready views optimised for consumption.

Like any pattern, it works well when it fits your problem and becomes overhead when it does not. Start simple — maybe just Bronze and Silver to begin with — and add layers as your team and use cases grow. The architecture should serve you, not the other way around.

This post is licensed under CC BY 4.0 by the author.