S3 data lake folder design best practices
In this article, let us see how to design folder structure in an S3-based data lake, why it matters early, and what kind of layout usually works well for a simple project without making production painful later. When people start with S3, it is easy to think folder design is just a naming preference. But after a few pipelines, backfills, and consumers, bad structure becomes hard to live with.
I have seen this happen in simple projects where files were just dropped into one bucket with names that made sense only to the first person who created them. That works for a few days. Later, when you need to debug one source, reload one table, or apply lifecycle rules, everything starts feeling messy. For our use case, it is better to put some structure around the lake before many jobs depend on it.
Why folder design matters in S3
S3 is object storage, not a file system in the traditional sense. We still use prefixes that look like folders because they help us organize data and make downstream processing easier. A good prefix design helps with:
- separating raw, processed, and curated data
- defining partition strategy clearly
- making Athena, Spark, Glue, and other tools easier to work with
- applying retention and lifecycle rules
- making backfills and reprocessing less risky
- keeping access control simpler
If the layout is inconsistent, even simple jobs become full of special cases.
A practical starting structure
A simple and useful pattern is to separate the lake into zones. For example:
1
2
3
4
5
s3://company-data-lake/
raw/
staging/
curated/
sandbox/
And within each zone, organize by source system and dataset:
1
2
3
s3://company-data-lake/raw/salesforce/accounts/
s3://company-data-lake/raw/shopify/orders/
s3://company-data-lake/curated/finance/daily_revenue/
This is easier to manage than putting everything directly under one flat bucket path. It also makes ownership conversations simpler. A raw prefix usually means data landed as-is. A curated prefix means it is cleaned and ready for wider consumption.
Recommended path pattern
For many datasets, a path pattern like below works well:
1
s3://company-data-lake/raw/{source_system}/{dataset}/load_date=YYYY-MM-DD/part-000.parquet
Example:
1
s3://company-data-lake/raw/orders_api/orders/load_date=2025-01-15/part-000.parquet
For curated tables, I usually prefer business-style partitions instead of only ingestion partitions when that makes query sense:
1
s3://company-data-lake/curated/orders/order_date=2025-01-15/region=apac/part-000.parquet
The important thing is to be consistent. If one dataset uses dt=2025-01-15, another uses date=2025/01/15, and another uses 2025-01-15 without a key, then maintenance becomes annoying very quickly.
Partitioning: do it with a purpose
Partitioning in a data lake is useful, but over-partitioning is also a common mistake. People sometimes create too many partition columns because it looks organized. Later they get lots of tiny files and poor performance.
A simple rule is to partition by fields that are actually used for filtering often. Date is the most common one. Region, country, or tenant can also help in some cases. But if a field has too many possible values, think twice before using it as a folder partition.
A quick comparison looks like this:
| Approach | Good for | Things to watch |
|---|---|---|
load_date=YYYY-MM-DD | raw ingestion and reprocessing | not always best for analytics queries |
event_date=YYYY-MM-DD | query-heavy analytical tables | late arriving data needs care |
multi-level partitions like year/month/day | broad tool compatibility | path depth gets noisy |
| too many partition keys | narrow access patterns | many small files and harder maintenance |
If you are not sure, start with one date partition. Add more only when there is a clear reason.
Raw, staging, and curated should mean something
These names are useful only if the team treats them consistently. My rough definition is:
raw: data copied from source with minimal or no transformationstaging: temporary or intermediate outputs used by pipelinescurated: cleaned and modeled datasets used by downstream consumers
For example, a raw JSON feed from an API might land like this:
1
s3://company-data-lake/raw/payments_api/transactions/load_date=2025-01-15/data.json
Then a Spark or Glue job converts it into Parquet in staging:
1
s3://company-data-lake/staging/payments/transactions_parquet/load_date=2025-01-15/
And the final analytics-ready table goes into curated:
1
s3://company-data-lake/curated/payments/transactions/event_date=2025-01-15/
This makes debugging easier because you can see where the pipeline changed the data.
Naming conventions that help later
A few conventions save a lot of confusion:
- Use lowercase names in bucket prefixes and dataset names.
- Avoid spaces and special characters.
- Use singular or plural consistently. I usually prefer plural for datasets like
orders,customers,transactions. - Keep source system names stable. Do not rename them casually once downstream jobs depend on them.
- Use explicit partition keys like
load_date=2025-01-15instead of hidden date folders.
For example, this is harder to understand:
1
s3://lake/raw/app1/orders/2025/01/15/
This is more self-explanatory:
1
s3://lake/raw/app1/orders/load_date=2025-01-15/
Tools like Athena and Glue also work nicely with Hive-style partition paths.
Example Glue or Athena table alignment
If your S3 layout is consistent, the table definition becomes easier. A simplified Athena example could look like this:
1
2
3
4
5
6
7
8
9
CREATE EXTERNAL TABLE curated_orders (
order_id string,
customer_id string,
amount decimal(10,2),
region string
)
PARTITIONED BY (event_date date)
STORED AS PARQUET
LOCATION 's3://company-data-lake/curated/orders/';
Then the partition paths under that location stay predictable:
1
2
event_date=2025-01-14/
event_date=2025-01-15/
This is much cleaner than trying to point one external table at a bucket path with mixed naming styles.
Things to be careful about
There are a few common problems.
1. Using one bucket for everything without clear prefixes
This makes permissions, lifecycle policies, and troubleshooting harder. One bucket can still work, but the prefixes need discipline.
2. Mixing raw and transformed files together
If JSON, CSV, and final Parquet files all land in the same path, consumers get confused and automation becomes brittle.
3. Too many small files
Even if the folder structure is good, writing thousands of tiny files hurts query performance and job efficiency. Compaction matters.
4. Partitioning by ingestion when the business queries by event date
For demos, load date is fine. In production, analytical queries often care more about event time than ingestion time. You may need both as metadata, but not necessarily both as path partitions.
5. Renaming prefixes after many jobs are built
This looks harmless but creates downstream breakage, Glue crawler confusion, and permission drift. It is better to decide a reasonable pattern early.
What I would change in production
For a simple demo, one bucket and a few clean prefixes are enough. In production, I would usually add a few more controls:
- separate buckets or at least stricter IAM boundaries for sensitive data
- lifecycle policies for raw and staging zones
- file format standards, usually Parquet for curated data
- data quality checks before promoting data into curated
- compaction strategy to avoid small-file problems
- clear ownership for each source and dataset
I would also document the path contract in one short page. That sounds small, but it helps new engineers avoid inventing a different layout for every new pipeline.
Conclusion
S3 folder design in a data lake does not need to be complicated, but it should be intentional. A simple structure with clear zones, stable naming, and sensible partitioning is usually enough to keep the lake usable as it grows. If you set this up early, your future pipeline work, reprocessing, and analytics will be much easier to manage.
