Cost optimization basics for AWS data pipelines

Posted Jan 21, 2025

By Ashok KS 6 min read

In this article let us see some cost optimization basics for AWS data pipelines, why they matter early, and what simple changes usually reduce the bill without making the platform too complicated. This approach is useful because many teams start with a working pipeline first and only look at cost later. I have seen this happen often. The jobs run, the tables get built, everyone is happy, and then one month later the Athena queries, Glue job runs, or EMR cluster usage start looking more expensive than expected.

For our use case, think of a small batch pipeline that lands CSV files in S3, uses Glue or Spark to clean them, and then makes the data available for analytics through Athena or Redshift. We do not need advanced FinOps to improve this. A few practical habits already make a big difference.

Where AWS pipeline cost usually comes from

Before trying to optimize, it helps to know the common sources of spend. A simple view is below.

Area	What usually drives cost	Common beginner mistake
S3 storage	Data volume, duplicate files, lifecycle retention	Keeping every intermediate file forever
Athena	Data scanned per query	Querying raw CSV and using `select *`
Glue	DPU runtime and job duration	Oversized jobs for small datasets
EMR	Cluster uptime and instance type	Leaving clusters running idle
Redshift	Cluster size and inefficient queries	Loading too much raw detail into expensive tables

Usually the biggest savings come from reducing unnecessary scanning and not keeping compute running longer than needed.

1. Start with S3 layout and file format

A lot of cost optimization starts before any SQL runs. If raw files are badly organized, every downstream service becomes more expensive.

Let us say source files land like this:

s3://company-raw/orders/orders_2025_01_21.csv
s3://company-raw/orders/orders_2025_01_22.csv

This works, but it is not ideal once the data grows. I prefer keeping data in partition-style folders so queries and jobs can read less data.

s3://company-raw/orders/dt=2025-01-21/orders.csv
s3://company-raw/orders/dt=2025-01-22/orders.csv

Even better, after the raw load, convert the files into Parquet instead of continuing with CSV for analytics use cases. Athena pricing is based on data scanned, so columnar compressed files usually help a lot.

For example, a Glue job can read CSV and write Parquet:

  
df = spark.read.option("header", "true").csv("s3://company-raw/orders/")

df.write.mode("overwrite")   .partitionBy("dt")   .parquet("s3://company-curated/orders/")

For our use case, this one change is often more valuable than trying to fine tune every query later.

2. Be careful with Athena query habits

Athena is very easy to start with, which is exactly why teams sometimes spend more than they expect. It is convenient, but convenience can hide waste.

A query like below is a classic example:

  
select *
from curated_orders
where order_date between date '2025-01-01' and date '2025-01-31';

If the table is wide and not partitioned properly, this scans far more data than needed. A better pattern is to select only the columns we need and filter on the partition column directly.

  
select order_id, customer_id, order_amount
from curated_orders
where dt between '2025-01-01' and '2025-01-31';

Two practical notes here:

Make sure the partition column is actually used in the where clause.
Do not expose raw tables to everyone if most people only need curated data.

If analysts keep querying raw CSV data, the platform bill will show it sooner or later.

3. Right-size Glue jobs instead of assuming bigger is safer

I think many beginner teams over-allocate Glue jobs because they are worried about failures. That is understandable, but it increases cost quickly. If a small transformation handles a few hundred MB of daily data, it usually does not need a large job configuration.

A simple job config might look like this in Terraform or JSON style settings:

  
{
  "glue_version": "4.0",
  "worker_type": "G.1X",
  "number_of_workers": 2,
  "timeout": 30
}

I would start smaller, measure runtime, and increase only if needed. If a job runs in 6 minutes with 2 workers, jumping to 10 workers may not create proportional value.

Also check whether the job is rereading the full dataset every day when only one partition changed. Incremental processing is both a performance and cost optimization.

Pseudo-code for a simple incremental pattern:

  
process_date = args["dt"]
input_path = f"s3://company-raw/orders/dt={process_date}/"
output_path = "s3://company-curated/orders/"

This is much better than scanning the full bucket on every run.

4. Shut down EMR when the work is done

If the team uses EMR, idle cost becomes the main thing to watch. The cluster may be fine during active processing, but a cluster sitting around between jobs is usually just wasted spend.

For batch workloads, I prefer transient clusters where possible. Create the cluster, run the job, write the outputs, and terminate it. In simple terms:

aws emr create-cluster ...
aws emr add-steps ...
aws emr terminate-clusters --cluster-ids j-XXXXXXXX

In a production use case, orchestration should handle this automatically and also alert if a cluster stays alive longer than expected. If the team has continuous usage or many short jobs, then a longer running cluster may still make sense, but it should be a deliberate choice.

5. Use lifecycle policies for raw and intermediate data

One easy win that gets forgotten is S3 lifecycle management. Raw data may need long retention, but intermediate files, temp outputs, and duplicate staging folders often do not.

A simple lifecycle idea is:

keep raw source files for 90 days or based on compliance need
keep temporary staging files for 7 days
move old infrequently used data to cheaper storage classes

Example lifecycle configuration:

  
{
  "Rules": [
    {
      "ID": "cleanup-staging",
      "Prefix": "staging/",
      "Status": "Enabled",
      "Expiration": { "Days": 7 }
    }
  ]
}

This is not glamorous work, but it prevents slow storage growth from going unnoticed.

6. Watch Redshift table design and workload scope

If curated data is loaded into Redshift, cost optimization is not only about the cluster size. It is also about what we choose to keep there. Not every raw event or historical detail needs to live in a heavy analytics warehouse table.

For our use case, I would keep frequently queried reporting tables in Redshift and leave cold detail data in S3 plus Athena if that is acceptable for the user experience. We could also aggregate before loading.

For example, instead of loading every click event into a dashboard schema, create a daily summary first:

  
create table reporting.daily_orders as
select order_date, count(*) as order_count, sum(order_amount) as revenue
from staging.orders
group by order_date;

This reduces storage and often improves dashboard performance too.

Things to be careful about

There are a few caveats with all this.

Over-optimizing too early can waste engineering time if the data volume is still tiny.
Smaller jobs are cheaper, but if they become unreliable then the operational cost comes back in another form.
Lifecycle rules should match recovery and compliance needs, otherwise cleanup can remove data you still need.
Query optimization depends on good table design. If partitions are wrong, even disciplined SQL will not help much.

So I would treat cost optimization as practical guardrails, not as a one-time cleanup task.

Conclusion

If I had to start simple, I would focus on partitioned S3 layouts, Parquet for analytics, tighter Athena query habits, smaller incremental Glue jobs, transient EMR clusters for batch work, and lifecycle rules for stale data. These are not complicated ideas, but they usually cut waste early and make the platform easier to scale later.

Data Engineering, AWS

This post is licensed under CC BY 4.0 by the author.