Post

Medallion architecture explained simply

In this article let us see medallion architecture explained in a simple way, why teams use it, and how you can implement it without turning a small pipeline into a complicated platform project. If you keep hearing terms like bronze, silver, and gold in Databricks, lakehouse, or warehouse discussions, the basic idea is actually not difficult. It is just a way of organizing data so raw inputs, cleaned data, and business-ready outputs do not get mixed together.

For our use case, let us assume we receive daily order files from an operational system. We want to land the data first, clean it, remove duplicates, and then publish a table that BI users can query. That is exactly the kind of flow where medallion architecture makes sense.

What medallion architecture really means

At a high level, medallion architecture separates data into layers:

  • Bronze for raw data as received
  • Silver for cleaned and standardized data
  • Gold for business-ready aggregates or curated tables

People sometimes explain it like a strict framework, but I think it is easier to treat it as a pattern. The real purpose is to avoid doing everything in one step. If raw files, cleaned records, and dashboard outputs all write into the same place, then debugging gets painful very quickly.

With layered data, we can answer basic questions much faster:

  • What did we actually receive from the source?
  • What rules did we apply during cleanup?
  • Which table should analysts use?

That separation is the main value. The names bronze, silver, and gold are just a convenient way to talk about it.

A simple example

Let us say we receive a CSV file every day with order data.

order_id,customer_id,order_ts,product_id,quantity,price,status
1001,C001,2025-01-10T09:01:00Z,P100,2,35.00,PAID
1002,C004,2025-01-10T09:05:00Z,P220,1,18.50,PAID
1002,C004,2025-01-10T09:05:00Z,P220,1,18.50,PAID
1003,C010,2025-01-10T09:10:00Z,P310,1,,PENDING

This file already shows why we need layers. We have a duplicate row and a missing price. If we directly expose this data to reporting users, they will get wrong numbers or start creating their own cleanup logic in every dashboard.

What each layer would contain

A simple implementation could look like this:

LayerPurposeExample tableWhat happens here
BronzeKeep source data as receivedbronze.orders_rawMinimal changes, maybe add load timestamp and file name
SilverClean and standardizesilver.orders_cleanDeduplication, type casting, null handling, basic validation
GoldServe business use casesgold.daily_salesAggregations, business metrics, reporting-friendly models

That is all medallion architecture is in practice. We move from raw to trusted to consumable.

Bronze layer

In the bronze layer, I usually try not to be clever. The point is to preserve what came from the source. That helps later when someone asks why a record disappeared or why a value changed.

A bronze table might add a few operational columns:

1
2
3
4
5
6
create table bronze.orders_raw as
select
  *,
  current_timestamp() as ingested_at,
  'orders_2025_01_10.csv' as source_file
from landing.orders_file;

If you are using Spark or a lakehouse setup, you might write this directly from object storage into a Delta or parquet table. If you are in a warehouse world, you may load the file into a raw schema first. Same idea, different storage.

One thing to be careful about is not over-transforming bronze. If you fix every issue at this layer, then you lose a faithful copy of the source. For debugging, that raw copy is very useful.

Silver layer

This is where the actual cleanup starts. In the silver layer, we convert data types, remove obvious bad records, deduplicate, and standardize columns so downstream users are not dealing with source-system mess.

A simple silver transformation could look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
with ranked_orders as (
  select
    order_id,
    customer_id,
    cast(order_ts as timestamp) as order_ts,
    product_id,
    cast(quantity as int) as quantity,
    cast(price as decimal(10,2)) as price,
    upper(status) as status,
    ingested_at,
    row_number() over (partition by order_id order by ingested_at desc) as rn
  from bronze.orders_raw
)
select
  order_id,
  customer_id,
  order_ts,
  product_id,
  quantity,
  price,
  status
from ranked_orders
where rn = 1
  and price is not null;

This is also where business rules begin, but I try to keep silver focused on data quality and standard structure rather than final reporting logic. For example, renaming columns, fixing timestamp formats, filtering impossible records, and keeping one row per business key are all normal silver tasks.

If there are records you do not want to lose completely, it is often better to quarantine them into a separate rejected table instead of just dropping them. In production, this becomes important because operations teams need to track data quality issues.

Gold layer

The gold layer is what analysts, dashboards, and downstream applications usually consume. These tables should represent business meaning more clearly than the upstream layers.

For example, a daily sales summary could be:

1
2
3
4
5
6
7
8
create table gold.daily_sales as
select
  cast(order_ts as date) as order_date,
  count(distinct order_id) as total_orders,
  sum(quantity * price) as gross_sales
from silver.orders_clean
where status = 'PAID'
group by cast(order_ts as date);

Now the reporting team does not need to know how duplicates were removed or how null prices were handled. They just query a business-ready table.

When this pattern helps

I find medallion architecture useful when:

  • source data is messy or changes over time
  • several teams use the same data for different purposes
  • you want clearer ownership between ingestion and reporting
  • debugging source-to-report issues takes too long in the current setup

If your project is extremely small, you may not need all three layers physically separated on day one. Sometimes one raw table and one curated table are enough. The pattern still helps, but you do not need to force extra layers just to follow a trend.

Things to be careful about

There are a few common mistakes with this approach.

First, teams sometimes copy everything through many layers with no real reason. That increases storage and maintenance cost. Each layer should have a purpose.

Second, the silver layer can become a dumping ground for every transformation. Then nobody knows what silver actually means. I prefer keeping silver focused on standardization and quality, and gold focused on business outputs.

Third, some people assume medallion architecture automatically gives good governance. It does not. You still need naming standards, data tests, ownership, and monitoring.

What changes in production

For a simple demo, manual SQL and a scheduled job are enough. In production, I would usually add:

  • schema checks on incoming bronze data
  • data quality tests on silver tables
  • partitioning or clustering for larger tables
  • lineage or documentation for gold datasets
  • alerting when rejected records spike

You may also want incremental processing instead of full reloads, especially once data volumes grow.

Conclusion

Medallion architecture sounds more complex than it really is. In practice, it is just a clean way to separate raw data, cleaned data, and business-ready outputs. If your pipelines are getting harder to debug or analysts are constantly redoing cleanup logic, this pattern is usually worth adopting. Start simple, keep each layer honest about its purpose, and do not add complexity unless your use case really needs it.

This post is licensed under CC BY 4.0 by the author.