Partitioning Strategy for Data Lake Tables: A Practical Walkthrough

Posted Feb 4, 2025

By Ashok KS 6 min read

In this article, I want to walk through how we approach partitioning for data lake tables. I have seen this done wrong enough times that I think it is worth writing down what actually works in practice. We will look at why partitioning matters, how to pick a strategy, and what goes wrong when you get it wrong. This is based on running partitioned tables in production for a few years now, mostly with Apache Iceberg and Spark.

Why Bother with Partitioning?

Let us start with the basics. When you have a table with a few million rows, you probably do not need to think about partitioning. But once you get into billions of rows, scanning the whole table for every query becomes expensive in both time and money. Partitioning lets you skip reading data you do not need.

For our use case, we had a table tracking user events. It grew to about 8 billion rows over two years. A simple count query took about 45 seconds to run. After we partitioned by date, that same query dropped to under 3 seconds when filtered to a single day. That is the kind of difference we are talking about.

Common Partitioning Strategies

There are a few patterns that tend to work well. Here is what I have seen in practice:

Time-Based Partitioning

This is the most common approach. You partition by date, usually year-month-day or just year-month. It works well when your queries almost always filter by time range.

  
CREATE TABLE events (
    event_id STRING,
    user_id STRING,
    event_type STRING,
    event_time TIMESTAMP,
    data STRING
) USING iceberg
PARTITIONED BY (days(event_time));

I like using days() in Iceberg because it handles the bucketing for you. You could also use months() or years() depending on your data volume. For our event table, daily partitions worked well because we usually query specific days or ranges of a few days.

Hash Partitioning

Sometimes you want to spread data evenly across partitions to avoid hot spots. This is common when you have a high-cardinality column like user_id.

  
CREATE TABLE user_profiles (
    user_id STRING,
    profile_data STRING,
    updated_at TIMESTAMP
) USING iceberg
PARTITIONED BY (bucket(16, user_id));

The bucket(16, user_id) means Iceberg will hash the user_id into 16 buckets. This spreads writes evenly and can help with lookup queries if you filter by user_id.

Combined Partitioning

You can also combine strategies. This is useful when you have multiple access patterns.

  
CREATE TABLE events (
    event_id STRING,
    user_id STRING,
    event_type STRING,
    event_time TIMESTAMP,
    data STRING
) USING iceberg
PARTITIONED BY (days(event_time), bucket(8, user_id));

This gives you time-based pruning plus user-based distribution. We tried this for a while but ended up simplifying back to just time-based because most of our queries did not benefit from the user bucketing. Your mileage may vary.

Comparison of Strategies

Strategy	Best For	Watch Out For	Query Pattern
Time-based	Event data, logs, time-series	Too many small partitions if daily on low-volume data	Time range filters
Hash/bucket	High-cardinality lookups, even write distribution	Query must filter on bucketed column to benefit	Point lookups, joins
Combined	Multiple access patterns	Added complexity, may over-partition	Mixed patterns
No partitioning	Small tables, full scans	Query cost grows with table size	Aggregations on full dataset

What I Learned the Hard Way

The Small Files Problem

If you partition too granularly, you end up with thousands of tiny files. We once partitioned by hour on a table that only got a few thousand events per hour. That created hundreds of tiny Parquet files per day. Queries got slower, not faster, because of all the metadata overhead.

Rule of thumb: aim for files that are at least 100MB. If your partition produces files smaller than that, you are probably partitioning too finely.

Partition Evolution

One nice thing about Iceberg is you can change partitioning without rewriting all your data. We migrated from monthly to daily partitioning mid-project using:

  
ALTER TABLE events ADD PARTITION FIELD days(event_time);

The old data stays in monthly partitions, new data goes to daily. Queries still work across both. This saved us a big migration headache.

Partition Column Selection

Pick a column that appears in your WHERE clauses. I have seen people partition by a column that never gets filtered on. That does nothing for query performance and just adds overhead to writes.

Hive-Style vs Hidden Partitioning

Iceberg has this concept of hidden partitioning. In Hive, your partition columns show up as actual columns in the table. In Iceberg, you can partition by a transformation like days(event_time) without having a separate date column. The partition is tracked in metadata but does not clutter your schema.

For our use case, this was great because we did not need to maintain a separate date column just for partitioning.

Production Considerations

If you are doing this for real, not just a demo, there are a few more things to think about:

Compaction: Over time, you will get small files from streaming writes or frequent updates. Set up a compaction job to rewrite small files into larger ones. We run this daily on our partitioned tables.

Monitoring: Track partition sizes and file counts. We have alerts if a partition grows beyond a certain size or if we see too many small files piling up.

Retention: With time-based partitioning, you can easily drop old partitions. This is cheaper than deleting rows. We set up automated retention jobs that drop partitions older than 2 years.

Partition Pruning: Make sure your query engine is actually using the partition pruning. In Spark, you can check the query plan to see if partition filters are being pushed down.

When Not to Partition

I should mention that partitioning is not always the answer. If your table is under a few hundred million rows and you mostly do full table scans, partitioning might just add complexity without benefit.

Also, if your queries filter on columns that do not correlate with your partition key, you will not get the pruning benefits. In that case, you might be better off with Z-ordering or sorting within files instead of partitioning.

Wrapping Up

Partitioning is one of those things that seems simple but has enough edge cases to trip you up. Start with time-based partitioning for event data, watch your file sizes, and do not be afraid to adjust your strategy as you learn more about your query patterns. The flexibility of modern table formats like Iceberg makes it much easier to evolve your partitioning over time without painful migrations.

If you are just getting started, my advice is to keep it simple. Daily partitions on a timestamp column will get you 80% of the benefits with minimal complexity. You can always make it more sophisticated later.

Data Engineering, Data Lakes

This post is licensed under CC BY 4.0 by the author.