Post

Landing, Bronze, Silver, and Gold Layers Explained

In this article let us walk through the medallion architecture pattern — landing, bronze, silver, and gold layers — and why teams use this approach when building data lakehouses. If you are coming from a traditional data warehouse background or trying to clean up a messy data lake, this pattern gives you a clear way to organize data as it moves from raw ingestion to business-ready outputs.

I have seen teams try to skip layers to save time, and I have seen teams add too many layers and drown in complexity. In practice, four layers tends to be the sweet spot for most use cases. Let us go through what each one actually does.

The four layers at a glance

LayerPurposeWho owns itTypical retention
LandingRaw files as they arrivedIngestion pipeline30-90 days
BronzeSchema-on-read copy with basic metadataData engineeringMonths to years
SilverCleaned, deduplicated, conformedData engineeringIndefinite
GoldBusiness aggregates and modeled outputsAnalytics engineers / domain teamsBased on use case

This is not a rigid standard. Some teams merge landing and bronze. Some add a platinum layer above gold. But the four-layer model is a practical starting point that most teams can reason about.

Landing layer: keep what actually arrived

The landing layer is where files live immediately after ingestion, before any processing happens. If your pipeline pulls from an API, receives SFTP drops, or listens to a message queue, the landing zone holds those files exactly as they arrived.

For our use case, let us say we receive daily JSON exports from an e-commerce platform. A simple landing structure might look like this:

1
2
3
/landing/ecommerce/orders/2025/01/23/orders_20250123_001.json
/landing/ecommerce/orders/2025/01/23/orders_20250123_002.json
/landing/ecommerce/orders/2025/01/24/orders_20250124_001.json

I like keeping the source timestamp in the path. It makes debugging easier when a file arrives late or when you need to reprocess a specific day. Here is a simple Python pattern for landing files:

1
2
3
4
5
6
7
8
9
10
import os
from datetime import datetime

def landing_path(source: str, entity: str, filename: str) -> str:
    now = datetime.utcnow()
    return f"/landing/{source}/{entity}/{now:%Y/%m/%d}/{filename}"

# Usage: save API response directly to landing
with open(landing_path("ecommerce", "orders", "orders_001.json"), "w") as f:
    f.write(api_response)

One thing I have learned: do not validate or transform data in the landing layer. If the API sends malformed JSON, save it anyway. You want the raw evidence in case you need to replay or investigate later.

Bronze layer: schema-on-read with audit columns

Bronze is your first table layer. This is where you read the landing files and convert them into a queryable format, usually with minimal changes. The goal is not perfection — it is making the data accessible while preserving lineage.

In Databricks with Delta Lake, a bronze load might look like this:

1
2
3
4
5
6
7
8
9
10
11
bronze_df = (spark.read
    .option("multiLine", True)
    .json("/landing/ecommerce/orders/2025/01/23/*"))

(bronze_df
    .withColumn("_ingestion_timestamp", current_timestamp())
    .withColumn("_source_file", input_file_name())
    .write
    .format("delta")
    .mode("append")
    .saveAsTable("bronze.orders"))

Notice the audit columns _ingestion_timestamp and _source_file. These are essential for debugging and replaying specific batches. I always add them to bronze tables.

For BigQuery users, the equivalent would be using an external table pointing at the landing GCS bucket:

1
2
3
4
5
6
CREATE OR REPLACE EXTERNAL TABLE bronze.orders
OPTIONS (
  format = 'JSON',
  uris = ['gs://my-bucket/landing/ecommerce/orders/*'],
  max_bad_records = 10
);

The max_bad_records option is useful because bronze should tolerate some bad data. You can always filter it out in silver.

Silver layer: where the real work happens

Silver is where you apply business rules, deduplication, type casting, and joins between sources. This layer should be the single source of truth for your analytics team.

Here is an example silver transformation that handles duplicates and data quality:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
WITH deduplicated AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY order_id
      ORDER BY _ingestion_timestamp DESC
    ) as rn
  FROM bronze.orders
  WHERE order_id IS NOT NULL
),
cleaned AS (
  SELECT
    order_id,
    CAST(order_date AS DATE) as order_date,
    customer_id,
    CAST(total_amount AS DECIMAL(10,2)) as total_amount,
    UPPER(TRIM(status)) as status,
    _ingestion_timestamp
  FROM deduplicated
  WHERE rn = 1
    AND status IN ('completed', 'pending', 'cancelled')
)
SELECT * FROM cleaned;

I like to write silver transformations as idempotent SQL queries that can be run multiple times without creating duplicates. In Databricks, use MERGE statements. In BigQuery, you might use CREATE OR REPLACE TABLE.

One production pattern I recommend: add a _silver_processed_at timestamp and only process bronze records that do not have it yet. This makes incremental loads efficient.

Gold layer: business-ready outputs

Gold tables are designed for specific use cases — dashboards, ML feature stores, or downstream API consumption. They are often heavily aggregated and denormalized.

A typical gold table might look like this:

1
2
3
4
5
6
7
8
9
10
CREATE TABLE gold.daily_order_metrics AS
SELECT
  order_date,
  COUNT(DISTINCT order_id) as total_orders,
  COUNT(DISTINCT customer_id) as unique_customers,
  SUM(total_amount) as daily_revenue,
  AVG(total_amount) as avg_order_value,
  COUNT(CASE WHEN status = 'cancelled' THEN 1 END) * 100.0 / COUNT(*) as cancellation_rate
FROM silver.orders
GROUP BY order_date;

Gold tables should be documented well because they are what business users and analysts interact with. I usually add column descriptions and ensure the table names clearly indicate what they contain.

Common pitfalls to avoid

After implementing this pattern a few times, here are the mistakes I see most often:

  1. Trying to do too much in bronze: Bronze is not the place for complex joins or business logic. Keep it simple.

  2. Not partitioning silver tables: Silver tables can get huge. Partition by date or another natural key to keep queries fast.

  3. Treating gold as permanent: Business requirements change. Design gold tables to be rebuildable from silver, not as the final word.

  4. Skipping the landing zone: It is tempting to go straight from API to bronze. But when you need to replay data after finding a bug in your bronze logic, you will wish you had kept those raw files.

What changes in production

For a demo or proof of concept, you can run these transformations manually or on a schedule. In production, you will want:

  • Incremental processing: Only process new or changed records, not full reloads
  • Data quality checks: Use tools like Great Expectations or dbt tests between layers
  • Lineage tracking: Know which silver records came from which bronze records
  • Retention policies: Landing files should auto-delete after 30-90 days to save storage costs
  • Access controls: Bronze is usually internal-only, while gold might be accessed by analysts

The four-layer model scales well because each layer has a single responsibility. When something breaks, you know where to look.

Wrapping up

The landing, bronze, silver, and gold pattern is not magic — it is just a way to organize data transformations so they are understandable and maintainable. Start simple, add complexity only when you need it, and always keep the raw data around until you are sure your pipelines work.

If you are just getting started, focus on getting data into bronze reliably. Silver and gold can come later once you understand your data sources better.

This post is licensed under CC BY 4.0 by the author.