Apache Spark Transformations Every Data Engineer Should Know
In this article, let us go through the Apache Spark transformations that come up again and again in real pipelines. I am not going to list every function in the API. Instead, I will focus on the ones that either confuse people or have hidden performance traps that can blow up a job when the data grows.
We will look at examples using the DataFrame API since that is what most teams use these days, but I will mention the RDD equivalents where it matters.
1. map vs flatMap
This one still trips people up. map applies a function to each row and returns one row back. flatMap can return zero, one, or many rows per input row.
For our use case, if we have a DataFrame of log lines where each line contains multiple comma-separated values, flatMap is what we want to explode those into separate records.
1
2
3
4
5
# map: one input -> one output
df.select(col("raw_log")).rdd.map(lambda row: row[0].upper())
# flatMap: one input -> many outputs
df.select(col("raw_log")).rdd.flatMap(lambda row: row[0].split(","))
With DataFrames, you would normally use explode instead of flatMap, but the concept is the same. I have seen people try to do nested list parsing with map and then wonder why they get arrays inside arrays.
2. filter and where
These are the same thing. where is just an alias for filter. I tend to use filter out of habit, but teams usually settle on one and stick with it.
1
2
df.filter(col("status") == "error")
df.where((col("date") >= "2025-01-01") & (col("date") <= "2025-01-31"))
One thing I noticed in practice: if you chain too many filters, Spark sometimes does not push all of them down to the source. If you are reading Parquet, check the physical plan to see if predicate pushdown is actually happening. I have seen cases where filtering after a join prevented partition pruning and the job scanned way more data than needed.
3. reduceByKey vs groupByKey
This is probably the most expensive mistake people make with RDDs. Both group by a key, but reduceByKey does a map-side combine before shuffling. groupByKey shuffles everything.
| Transformation | Shuffle Size | Use When |
|---|---|---|
reduceByKey | Smaller | You can merge values locally like sum, count, or max |
groupByKey | Larger | You really need all values together for each key |
1
2
3
4
5
# Better: less data shuffled
rdd.reduceByKey(lambda a, b: a + b)
# Worse: sends all values across the network
rdd.groupByKey().mapValues(sum)
With DataFrames, groupBy is usually optimized automatically, but if you drop down to RDDs for a custom operation, remember this distinction. In a production use case, I would avoid RDDs entirely unless there is a very specific reason.
4. withColumn and drop
These are bread and butter for DataFrame work. withColumn adds or replaces a column. drop removes it.
1
2
3
(df.withColumn("processed_date", to_date(col("timestamp")))
.withColumn("is_error", col("status_code") >= 500)
.drop("raw_payload"))
A caveat: if you chain many withColumn calls, Spark builds a deep expression tree. I have seen pipelines with fifty-plus withColumn steps start to slow down during analysis. If that happens, you can break it up with a select to trim the plan.
5. Window functions
Window functions are powerful but easy to misuse. They let you calculate things like running totals or rankings without collapsing rows.
1
2
3
4
5
6
from pyspark.sql.window import Window
window_spec = Window.partitionBy("user_id").orderBy("event_time")
(df.withColumn("row_num", row_number().over(window_spec))
.withColumn("running_total", sum("amount").over(window_spec)))
The catch is that if your window is large or unbounded, Spark has to hold a lot of state in memory. I once had a job run out of memory because the partition was too big and the window frame was much wider than needed. For our use case, if you only need the latest record per key, a groupBy with max is often cheaper than a window.
6. Joins
Spark has several join strategies: broadcast hash join, shuffle hash join, sort-merge join, and shuffle replicate join. You do not always control which one is picked, but you can hint it.
1
2
3
4
from pyspark.sql.functions import broadcast
# Force a broadcast join if the lookup table is small
df.join(broadcast(lookup_df), "user_id")
In production, I would check the SQL tab in the Spark UI to see what join type was chosen. If a small table is causing a shuffle join, that is usually a sign that Spark misestimated its size. You can also tune spark.sql.autoBroadcastJoinThreshold if many lookups are small.
7. repartition vs coalesce
These both change the number of partitions, but they behave very differently.
| Method | Behavior | Use When |
|---|---|---|
repartition | Full shuffle, even if reducing partitions | You need better data distribution |
coalesce | Narrow transformation, usually avoids shuffle | You are reducing partitions near the end |
1
2
# Before writing: reduce to 10 files without shuffling
df.coalesce(10).write.parquet("s3://bucket/output/")
I made the mistake of using repartition(10) before a write and watching the job shuffle gigabytes of data just to produce fewer files. In a production use case, coalesce is almost always what you want for the final write step unless you have badly skewed partitions.
8. distinct vs dropDuplicates
distinct removes duplicate rows based on all columns. dropDuplicates lets you specify which columns to consider.
1
2
3
4
5
# Remove exact duplicates across all columns
df.distinct()
# Remove duplicates based on ID only, keeping the first matching row
df.dropDuplicates(["event_id"])
One thing to be careful about: dropDuplicates without an explicit order is non-deterministic about which row it keeps. If you care about getting the latest record, use a window function or an explicit groupBy with max after defining the business rule.
9. cache and persist
If you reuse a DataFrame multiple times, caching can save recomputation. But it is not free.
1
2
3
4
5
6
7
from pyspark import StorageLevel
# Cache in memory only
df.cache()
# Cache in memory and disk
df.persist(StorageLevel.MEMORY_AND_DISK)
I have seen people cache every intermediate DataFrame and then run out of executor memory. My rule of thumb is simple: only cache if the DataFrame is expensive to compute and is used more than once. In a production use case, I would also unpersist when done so downstream stages have room to run.
Practical limitations
- Not all transformations are lazy in the way people imagine. Actions trigger execution, but transformations like
sortstill introduce expensive shuffles. - The order of operations matters. Filter early to reduce shuffle size.
- Spark UI is your friend. If a stage is slow, look for skewed partitions, spill, or unexpected shuffles.
- Default configurations are rarely right for production. Tuning partition sizes, broadcast thresholds, and memory often yields bigger gains than rewriting logic.
What would change in production?
For a simple demo, you can get away with reading a CSV, doing a few transformations, and writing back out. In production, we would also:
- Add checkpointing or write-ahead logging for long-running streaming jobs
- Use Delta Lake or Iceberg for ACID guarantees and easier recovery
- Partition output by date or business key so downstream queries stay efficient
- Monitor garbage collection pauses and executor memory usage
- Set up retries and idempotency so the same job can run twice without duplicating data
These transformations are the building blocks of most Spark pipelines. Knowing how they work under the hood, and where the common traps are, makes the difference between a job that finishes in minutes and one that hangs for hours. Start with the simple API, but always keep an eye on what Spark is doing underneath.
