S3 Data Lake Folder Design – Best Practices from the Trenches

Posted Jul 8, 2025

By Ashok KS 9 min read

You have probably heard the phrase “S3 is schema on read” a hundred times. What people say less often is that your folder structure becomes your schema — whether you planned it that way or not.

In this article, let us walk through how to design an S3 folder layout that actually works in production. Not the theory, not the vendor whitepaper version — the stuff you learn after you inherit a bucket with 8 million objects and a partition scheme that makes Athena scans cost $40 a pop.

We will cover naming conventions, how to think about partitioning, when to use Hive-style partitions versus flat prefixes, and a few patterns that save you from migrating the entire lake six months in.

Why Folder Design Matters More Than You Think

S3 is just an object store. There are no real folders, no indexes. Every / in a key is a convention, nothing more. But that convention feeds directly into:

Query performance: Athena, Redshift Spectrum, and Spark all use prefixes as partition filters. A well-structured prefix means you scan 50 MB instead of 500 GB.
Lifecycle rules: If you want to expire data after 90 days, you need a prefix that isolates that data by date.
Access control: IAM policies with s3:prefix conditions are far easier to write when your folder scheme is predictable.
Operational sanity: Ever tried debugging a pipeline that ingests to a prefix that has three different date formats across different ingestion sources? Do not be that team.

A Folder Structure That Scales

Here is a structure I have seen work well across multiple projects:

s3://my-data-lake/
  raw/
    source=mysql-orders/
      ingest_year=2025/
        ingest_month=07/
          ingest_day=08/
            orders_20250708_143000.csv
    source=kafka-events/
      ingest_year=2025/
        ingest_month=07/
          ingest_day=08/
            events_20250708_143000.parquet
  bronze/
    source=mysql-orders/
      event_year=2025/
        event_month=07/
          event_day=08/
            ...
  silver/
    domain=orders/
      ...
  gold/
    dataset=daily_order_summary/
      ...

A few things to notice here:

The layer comes first. This is deliberate. Most lifecycle rules operate at the layer level — raw data gets deleted after 30 days, bronze after 180, silver and gold are kept long-term. If you put the date at the top level, you end up writing one lifecycle rule per date prefix, which is a nightmare.

Source or domain comes second. In the raw and bronze layers, we partition by source system. In silver and gold, we switch to business domain. This reflects the fact that raw data is organised by where it came from, and curated data is organised by what it means.

Ingestion time vs event time. The raw layer uses ingest_* prefixes — the date the data arrived. The bronze layer switches to event_* — the date the event actually happened. This distinction matters more than you might think. If your ingestion pipeline fails for two days and replays, the ingest_ partition tells you when data was loaded, while event_ tells you what data you are looking at. Both are useful, but they serve different purposes.

Hive-Style Partitions: When They Help and When They Hurt

Hive-style partitions look like year=2025/month=07/day=08/ and are automatically recognised by Athena and Glue. They are great when:

You use Athena or Glue Crawlers heavily
You want automatic partition projection without managing a Glue table’s partition list
Your queries almost always filter on these columns

But they come with trade-offs:

Aspect	Hive-Style (`key=value/`)	Flat Prefix (`value/`)
Athena auto-detection	Yes	No, needs table definition
Readability by humans	Good — self-documenting	Okay, needs external docs
S3 API listing performance	Same as flat	Same as flat
Flexibility to rename	Harder — key name is part of every object path	Easier — just rename the prefix
Works with non-Hive engines	Yes, most tools parse it	Yes, everything works
S3 key length	Longer (extra bytes per object)	Shorter

In practice, I default to Hive-style for date partitions in silver and gold layers where Athena is the primary query engine. For raw and bronze, I use flat prefixes — they are simpler, and raw data is more likely to be consumed by Spark jobs that do not care about Hive conventions.

One thing that bit me early on: Hive-style partition names are part of the object key. If you start with year=2025/ and later want to switch to dt=2025/, you are renaming every object, which is an expensive copy operation in S3. Decide on your partition key names before you have a few hundred thousand objects.

Partition Granularity: Finding the Right Level

The most common mistake I see is over-partitioning. Teams partition by year, month, day, hour, tenant, and region on day one, and then their raw bucket has 40,000 prefixes with four objects each.

S3 can handle the request rate, but listing performance is affected when you have an extremely high number of prefixes. More importantly, small files kill query performance in anything that is not Parquet-aware. A Spark job reading 10,000 tiny CSV files is going to be slower than one reading 100 larger Parquet files.

A good starting point:

Raw layer: Partition by ingestion date (daily is fine). Do not overthink it — raw data is transient.
Bronze layer: Partition by event date, daily. Optionally add a second level for source if you have many sources landing in the same bronze bucket.
Silver layer: Partition by the most common query filter for that dataset. If 80% of queries filter by region, partition by region.
Gold layer: Partition by whatever the downstream consumer (dashboard, ML model) needs. Often daily snapshots.

If you find yourself partitioning by a column that has 3,000 unique values, stop. That is not a partition key, that is a regular column. Use it as a sort key in your file format instead.

Naming Conventions That Stick

One of the quiet killers in a data lake is inconsistent naming. I have seen the same bucket use created_date, dt, date, event_date, and p_date as partition keys — all from different pipelines feeding into the same silver layer. Whoever inherits that will not thank you.

A few rules I follow:

Pick one date format and use it everywhere. ISO 8601 (YYYY-MM-DD) is the only reasonable choice. Not MM-DD-YYYY, not epoch seconds in a folder name.
Use snake_case for all keys. event_year not eventYear. Hive handles = in partition names specially, and camelCase mixed with Hive partitions is just asking for escaping bugs.
Do not put file extensions in folder names. I once saw raw_csv/ and raw_parquet/ as top-level prefixes. The file extension already tells you the format. Use the layer, then the content.
Prefix temporary or staging areas with an underscore. E.g., _temp/ or _staging/. It sorts to the top in the console and signals to everyone that those objects are not part of the permanent dataset.

Lifecycle Rules and How Your Folder Design Affects Them

S3 lifecycle rules match on prefix, not suffix. This means if your date prefixes are nested three levels deep, you need your lifecycle rule to match at the highest level where the time boundary makes sense.

Example: you want to expire raw data after 30 days. If your structure is raw/source=mysql/ingest_year=2025/ingest_month=06/ingest_day=15/file.csv, there is no single prefix for “everything 30 days old.” You either:

Set the lifecycle rule on the entire raw/ prefix and use object tags with date-based rules (more complex but more precise), or
Restructure so the date is higher in the prefix hierarchy

For most teams, option two is simpler. One pattern I have used: put a broad date bucket at a higher level for lifecycle purposes while keeping detailed partitions deeper.

raw/
  retention_30d/
    source=mysql/
      year=2025/month=07/day=08/
  retention_90d/
    source=audit/
      year=2025/month=04/

You set one lifecycle rule on retention_30d/ and another on retention_90d/. It is not elegant, but it works reliably and costs nothing in query performance.

Things to Be Careful About

S3 eventual consistency for overwrites. If you have a pipeline that overwrites objects in place (e.g., a daily full refresh of a dimension table), make sure your consumers know an object might briefly not reflect the latest write. Better yet, avoid in-place overwrites — write a new object and update a pointer or view.

Prefix limits and API costs. LIST requests are charged per 1,000 objects returned. If you have 100 million objects in a single prefix with no intermediate partitioning, even listing becomes expensive. This is rare, but I have seen it happen with IoT ingest use cases.

Cross-region replication. If you replicate your bucket to another region, S3 replicates the full key path. Your folder structure stays identical, which is good. Just make sure your downstream consumers are aware of which region they are reading from — Athena and Glue do not auto-switch.

Small files problem. Your partition scheme can make this worse. If you partition by hour and each partition gets 2–3 small CSV files from a low-volume source, consider running a compaction job that merges them into larger files. Or increase the partition granularity to daily.

Production vs PoC: What Changes

In a proof of concept, you can just dump everything to s3://my-bucket/data/ and write queries that scan the whole thing. It works fine for a few gigabytes.

In production, the things you need to add are:

Proper lifecycle rules with cost estimates (S3 Intelligent Tiering helps if access patterns are unpredictable)
A compaction job for small files in the raw and bronze layers
A data catalog (Glue, or at least documented prefix conventions) so people know what is where
Bucket policies that prevent accidental deletes on silver and gold layers
Monitoring on S3 request metrics so you notice if a bad query is scanning everything

Wrapping Up

Your S3 folder structure is not going to be perfect on day one, and that is fine. What matters is that you make deliberate choices about layer ordering, partition granularity, and naming — and that you document those choices somewhere that is not a Slack thread from six months ago.

Start simple. Prefer fewer partition levels and add more only when query patterns and data volumes justify them. Write your lifecycle rules early, before the bucket has a year of data and you start sweating the storage bill. And for the love of everything, pick one date format and stick to it.

This stuff is not glamorous, but when your Athena query finishes in 8 seconds instead of 40 seconds and your monthly S3 bill drops by 30%, you will be glad you spent an afternoon thinking about folders.

AWS, Data Engineering, Data Lake

This post is licensed under CC BY 4.0 by the author.