Post

Designing Your First Medallion Lakehouse on AWS (Without Overengineering)

If you’re building a new data platform on AWS, medallion architecture (bronze/silver/gold) is one of the cleanest patterns to start with.

But many teams either:

  • under-design it (everything in one layer), or
  • over-design it (too many zones and too much process too early).

This guide gives a practical middle path.

What medallion architecture solves

A medallion setup creates progressive trust levels in data:

  • Bronze: source truth and replay point
  • Silver: cleaned, standardized, quality-validated
  • Gold: business-consumable and AI-consumable models

This separation makes operations, debugging, and ownership easier.

Layer responsibilities

Bronze (raw)

  • store source as received
  • immutable where possible
  • minimal transformation
  • retention policy for replay/backfill

Silver (clean)

  • normalize types and schema
  • dedupe and apply core quality checks
  • enforce standard business keys

Gold (curated)

  • build domain models (orders, customer 360, product performance)
  • optimize for analytics and application use cases
  • include tested metrics and definitions

A practical AWS implementation

flowchart LR
    A[Sources] --> B[S3 Bronze]
    B --> C[Glue Transform]
    C --> D[S3 Silver]
    D --> E[dbt Models]
    E --> F[S3 Gold]
    F --> G[Athena/BI]
    F --> H[AI Features]

Recommended components:

  • S3 for storage layers
  • Glue for heavy transformation jobs
  • dbt for curated SQL modeling and tests
  • Athena for serving and ad hoc analytics
  • Step Functions/EventBridge for orchestration

Naming and partitioning conventions

Good conventions reduce long-term pain.

Suggested path pattern

1
2
3
s3://data-lake/bronze/<domain>/<entity>/dt=YYYY-MM-DD/
s3://data-lake/silver/<domain>/<entity>/dt=YYYY-MM-DD/
s3://data-lake/gold/<domain>/<model>/dt=YYYY-MM-DD/

File format

  • prefer Parquet for silver/gold
  • avoid long-term CSV in analytic layers

Partition rule

  • partition by fields commonly filtered in queries
  • don’t create tiny partitions for low-volume entities

Quality controls by layer

Quality should increase as data moves upward.

  • Bronze: schema detect + ingestion metadata
  • Silver: null, duplicate, and conformance checks
  • Gold: business rule tests and freshness SLAs

A failed quality gate should block promotion to next layer.

Cost controls baked into design

To keep Athena/Glue costs predictable:

  1. compact small files during silver writes
  2. avoid full reloads when incremental is possible
  3. query gold models by default, not raw datasets
  4. enforce time filters in analytics queries

Architecture decisions usually drive 80% of cloud data cost.

Team ownership model

A simple ownership split works well:

  • platform/data engineering owns bronze/silver reliability
  • analytics engineering owns gold semantic models
  • shared ownership on quality contracts

This keeps delivery speed without confusion in incident response.

Where AI fits in this design

Gold layer should be designed so it can also support AI use cases:

  • entity-centric tables
  • consistent identifiers
  • freshness metadata
  • documented definitions

These reduce friction when building feature pipelines or retrieval systems later.

Common mistakes to avoid

  1. putting business metrics directly in bronze transforms
  2. no replay strategy for failed partitions
  3. too many layer variants (bronze1, bronze2, etc.)
  4. no clear data contracts between layers

Minimal rollout plan

Week 1-2:

  • define bronze/silver/gold path conventions
  • implement one domain end to end

Week 3-4:

  • add quality gates and basic lineage
  • add orchestration + alerting

Month 2:

  • extend pattern to additional domains
  • add cost dashboards and optimization loops

Final take

A medallion lakehouse is not about complexity.

It’s about creating clear trust boundaries in data so teams can move faster with fewer production surprises.

If you keep layer intent simple and strict, your architecture will scale for both BI and AI workloads.

This post is licensed under CC BY 4.0 by the author.