S3 Data Lake Folder Design – Best Practices from the Trenches
You have probably heard the phrase “S3 is schema on read” a hundred times. What people say less often is that your folder structure becomes your schema — whether you planned it that way or not.
In this article, let us walk through how to design an S3 folder layout that actually works in production. Not the theory, not the vendor whitepaper version — the stuff you learn after you inherit a bucket with 8 million objects and a partition scheme that makes Athena scans cost $40 a pop.
We will cover naming conventions, how to think about partitioning, when to use Hive-style partitions versus flat prefixes, and a few patterns that save you from migrating the entire lake six months in.
Why Folder Design Matters More Than You Think
S3 is just an object store. There are no real folders, no indexes. Every / in a key is a convention, nothing more. But that convention feeds directly into:
- Query performance: Athena, Redshift Spectrum, and Spark all use prefixes as partition filters. A well-structured prefix means you scan 50 MB instead of 500 GB.
- Lifecycle rules: If you want to expire data after 90 days, you need a prefix that isolates that data by date.
- Access control: IAM policies with
s3:prefixconditions are far easier to write when your folder scheme is predictable. - Operational sanity: Ever tried debugging a pipeline that ingests to a prefix that has three different date formats across different ingestion sources? Do not be that team.
A Folder Structure That Scales
Here is a structure I have seen work well across multiple projects:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
s3://my-data-lake/
raw/
source=mysql-orders/
ingest_year=2025/
ingest_month=07/
ingest_day=08/
orders_20250708_143000.csv
source=kafka-events/
ingest_year=2025/
ingest_month=07/
ingest_day=08/
events_20250708_143000.parquet
bronze/
source=mysql-orders/
event_year=2025/
event_month=07/
event_day=08/
...
silver/
domain=orders/
...
gold/
dataset=daily_order_summary/
...
A few things to notice here:
The layer comes first. This is deliberate. Most lifecycle rules operate at the layer level — raw data gets deleted after 30 days, bronze after 180, silver and gold are kept long-term. If you put the date at the top level, you end up writing one lifecycle rule per date prefix, which is a nightmare.
Source or domain comes second. In the raw and bronze layers, we partition by source system. In silver and gold, we switch to business domain. This reflects the fact that raw data is organised by where it came from, and curated data is organised by what it means.
Ingestion time vs event time. The raw layer uses ingest_* prefixes — the date the data arrived. The bronze layer switches to event_* — the date the event actually happened. This distinction matters more than you might think. If your ingestion pipeline fails for two days and replays, the ingest_ partition tells you when data was loaded, while event_ tells you what data you are looking at. Both are useful, but they serve different purposes.
Hive-Style Partitions: When They Help and When They Hurt
Hive-style partitions look like year=2025/month=07/day=08/ and are automatically recognised by Athena and Glue. They are great when:
- You use Athena or Glue Crawlers heavily
- You want automatic partition projection without managing a Glue table’s partition list
- Your queries almost always filter on these columns
But they come with trade-offs:
| Aspect | Hive-Style (key=value/) | Flat Prefix (value/) |
|---|---|---|
| Athena auto-detection | Yes | No, needs table definition |
| Readability by humans | Good — self-documenting | Okay, needs external docs |
| S3 API listing performance | Same as flat | Same as flat |
| Flexibility to rename | Harder — key name is part of every object path | Easier — just rename the prefix |
| Works with non-Hive engines | Yes, most tools parse it | Yes, everything works |
| S3 key length | Longer (extra bytes per object) | Shorter |
In practice, I default to Hive-style for date partitions in silver and gold layers where Athena is the primary query engine. For raw and bronze, I use flat prefixes — they are simpler, and raw data is more likely to be consumed by Spark jobs that do not care about Hive conventions.
One thing that bit me early on: Hive-style partition names are part of the object key. If you start with year=2025/ and later want to switch to dt=2025/, you are renaming every object, which is an expensive copy operation in S3. Decide on your partition key names before you have a few hundred thousand objects.
Partition Granularity: Finding the Right Level
The most common mistake I see is over-partitioning. Teams partition by year, month, day, hour, tenant, and region on day one, and then their raw bucket has 40,000 prefixes with four objects each.
S3 can handle the request rate, but listing performance is affected when you have an extremely high number of prefixes. More importantly, small files kill query performance in anything that is not Parquet-aware. A Spark job reading 10,000 tiny CSV files is going to be slower than one reading 100 larger Parquet files.
A good starting point:
- Raw layer: Partition by ingestion date (daily is fine). Do not overthink it — raw data is transient.
- Bronze layer: Partition by event date, daily. Optionally add a second level for source if you have many sources landing in the same bronze bucket.
- Silver layer: Partition by the most common query filter for that dataset. If 80% of queries filter by
region, partition by region. - Gold layer: Partition by whatever the downstream consumer (dashboard, ML model) needs. Often daily snapshots.
If you find yourself partitioning by a column that has 3,000 unique values, stop. That is not a partition key, that is a regular column. Use it as a sort key in your file format instead.
Naming Conventions That Stick
One of the quiet killers in a data lake is inconsistent naming. I have seen the same bucket use created_date, dt, date, event_date, and p_date as partition keys — all from different pipelines feeding into the same silver layer. Whoever inherits that will not thank you.
A few rules I follow:
- Pick one date format and use it everywhere. ISO 8601 (
YYYY-MM-DD) is the only reasonable choice. NotMM-DD-YYYY, not epoch seconds in a folder name. - Use snake_case for all keys.
event_yearnoteventYear. Hive handles=in partition names specially, and camelCase mixed with Hive partitions is just asking for escaping bugs. - Do not put file extensions in folder names. I once saw
raw_csv/andraw_parquet/as top-level prefixes. The file extension already tells you the format. Use the layer, then the content. - Prefix temporary or staging areas with an underscore. E.g.,
_temp/or_staging/. It sorts to the top in the console and signals to everyone that those objects are not part of the permanent dataset.
Lifecycle Rules and How Your Folder Design Affects Them
S3 lifecycle rules match on prefix, not suffix. This means if your date prefixes are nested three levels deep, you need your lifecycle rule to match at the highest level where the time boundary makes sense.
Example: you want to expire raw data after 30 days. If your structure is raw/source=mysql/ingest_year=2025/ingest_month=06/ingest_day=15/file.csv, there is no single prefix for “everything 30 days old.” You either:
- Set the lifecycle rule on the entire
raw/prefix and use object tags with date-based rules (more complex but more precise), or - Restructure so the date is higher in the prefix hierarchy
For most teams, option two is simpler. One pattern I have used: put a broad date bucket at a higher level for lifecycle purposes while keeping detailed partitions deeper.
1
2
3
4
5
6
7
raw/
retention_30d/
source=mysql/
year=2025/month=07/day=08/
retention_90d/
source=audit/
year=2025/month=04/
You set one lifecycle rule on retention_30d/ and another on retention_90d/. It is not elegant, but it works reliably and costs nothing in query performance.
Things to Be Careful About
S3 eventual consistency for overwrites. If you have a pipeline that overwrites objects in place (e.g., a daily full refresh of a dimension table), make sure your consumers know an object might briefly not reflect the latest write. Better yet, avoid in-place overwrites — write a new object and update a pointer or view.
Prefix limits and API costs. LIST requests are charged per 1,000 objects returned. If you have 100 million objects in a single prefix with no intermediate partitioning, even listing becomes expensive. This is rare, but I have seen it happen with IoT ingest use cases.
Cross-region replication. If you replicate your bucket to another region, S3 replicates the full key path. Your folder structure stays identical, which is good. Just make sure your downstream consumers are aware of which region they are reading from — Athena and Glue do not auto-switch.
Small files problem. Your partition scheme can make this worse. If you partition by hour and each partition gets 2–3 small CSV files from a low-volume source, consider running a compaction job that merges them into larger files. Or increase the partition granularity to daily.
Production vs PoC: What Changes
In a proof of concept, you can just dump everything to s3://my-bucket/data/ and write queries that scan the whole thing. It works fine for a few gigabytes.
In production, the things you need to add are:
- Proper lifecycle rules with cost estimates (S3 Intelligent Tiering helps if access patterns are unpredictable)
- A compaction job for small files in the raw and bronze layers
- A data catalog (Glue, or at least documented prefix conventions) so people know what is where
- Bucket policies that prevent accidental deletes on silver and gold layers
- Monitoring on S3 request metrics so you notice if a bad query is scanning everything
Wrapping Up
Your S3 folder structure is not going to be perfect on day one, and that is fine. What matters is that you make deliberate choices about layer ordering, partition granularity, and naming — and that you document those choices somewhere that is not a Slack thread from six months ago.
Start simple. Prefer fewer partition levels and add more only when query patterns and data volumes justify them. Write your lifecycle rules early, before the bucket has a year of data and you start sweating the storage bill. And for the love of everything, pick one date format and stick to it.
This stuff is not glamorous, but when your Athena query finishes in 8 seconds instead of 40 seconds and your monthly S3 bill drops by 30%, you will be glad you spent an afternoon thinking about folders.
