Querying Your S3 Data Lake with Amazon Athena — A Practical Guide

Posted Jul 15, 2025

By Ashok KS 9 min read

If you have data sitting in S3 and someone asks you a question that needs SQL to answer, you basically have two paths: spin up something that loads the data somewhere first, or query it where it lives. Athena lets you do the second one. You point it at files in S3, write SQL, and get results. No clusters to manage, no data to copy. In this article we will walk through a realistic setup: creating tables over Parquet and CSV files in S3, dealing with partitioning, and a few things that will bite you the first time you try this on a real dataset.

When Athena Makes Sense

Athena is not a database. It is a query engine that reads files directly from S3 and charges you based on how much data it scans. This shapes when you would reach for it and when you wouldn’t.

Scenario	Athena Fit	Notes
Ad-hoc analysis on S3 data	Great	Point at existing files, write SQL, done
Dashboard with sub-second latency	Bad	Athena is not low-latency; queries take seconds minimum
Scheduled reports on partitioned data	Good	Combine with saved queries or Step Functions
Heavy ETL transformations	Bad	Use Glue Jobs or EMR for that; Athena is for querying
Log analysis on raw JSON/CSV	Good	One-off investigation without building a pipeline
Replacing a traditional data warehouse	Maybe	Works for moderate volumes but watch your costs

If your use case is “I want to run SQL on files in S3 without standing up infrastructure,” Athena is probably the right call. If you are building something that needs consistent sub-second response times, look elsewhere.

Setting Up Your First Table

Let us assume you already have data in S3. For this walkthrough, we have Parquet files stored under s3://my-data-lake/sales/. Each file contains rows with columns like order_id, customer_id, amount, and order_date.

Athena needs a table definition before you can query. The most common way is using the Glue Data Catalog. You can create tables from the Athena console, through the Glue console, or using DDL statements in Athena itself. We will use DDL because it is reproducible and version-controllable.

Step 1: Create a Database

  
CREATE DATABASE IF NOT EXISTS sales_analytics;

This is just a namespace in the Glue Catalog. Nothing gets created in S3.

Step 2: Create an External Table for Parquet Files

  
CREATE EXTERNAL TABLE IF NOT EXISTS sales_analytics.orders (
  order_id STRING,
  customer_id STRING,
  amount DOUBLE,
  order_date DATE
)
STORED AS PARQUET
LOCATION 's3://my-data-lake/sales/'
TBLPROPERTIES ('parquet.compress'='SNAPPY');

A few things worth pointing out:

EXTERNAL means Athena does not own the data. Dropping the table removes the metadata only; your S3 files stay untouched.
The LOCATION points to a folder prefix. Every file under that prefix gets read when you query.
STORED AS PARQUET tells Athena what format to expect. It handles Parquet, ORC, Avro, JSON, CSV, and a few others.

Now you can run:

  
SELECT * FROM sales_analytics.orders LIMIT 10;

Behind the scenes, Athena reads the Parquet files from S3, applies the schema, and streams back results. There is no loading step. This is the part that feels like magic the first time.

Partitioning — This Is Where the Cost Lives

If you have a year’s worth of sales data and you query only last week’s orders, Athena still scans every file under s3://my-data-lake/sales/ unless you partition. You pay for scanned bytes, so this matters a lot.

Partitioned Table Example

Assume your folder structure looks like this:

s3://my-data-lake/sales/year=2025/month=06/day=15/
s3://my-data-lake/sales/year=2025/month=06/day=16/
...

This is Hive-style partitioning. You can create a table that understands it:

  
CREATE EXTERNAL TABLE IF NOT EXISTS sales_analytics.orders_partitioned (
  order_id STRING,
  customer_id STRING,
  amount DOUBLE
)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS PARQUET
LOCATION 's3://my-data-lake/sales/';

After creating the table, you need to load the partition metadata:

  
MSCK REPAIR TABLE sales_analytics.orders_partitioned;

This scans the S3 prefix and registers every partition it finds. For buckets with many partitions, MSCK REPAIR TABLE can be slow. A faster alternative is ALTER TABLE ... ADD PARTITION if you know the partitions in advance, which you often do in automated pipelines:

  
ALTER TABLE sales_analytics.orders_partitioned
ADD PARTITION (year=2025, month=6, day=15)
LOCATION 's3://my-data-lake/sales/year=2025/month=06/day=15/';

Now when you query with a filter on the partition columns, Athena only scans the matching folders:

  
SELECT SUM(amount)
FROM sales_analytics.orders_partitioned
WHERE year = 2025 AND month = 6 AND day = 15;

Without partitioning, this query might scan terabytes. With it, maybe a few megabytes. The cost difference is not subtle.

Querying CSVs and JSON — The Messy Reality

Parquet is clean. The schema is embedded in the file. CSV and JSON are where things get interesting.

CSV Headaches

  
CREATE EXTERNAL TABLE sales_analytics.orders_csv (
  order_id STRING,
  customer_id STRING,
  amount DOUBLE,
  order_date DATE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://my-data-lake/sales-csv/'
TBLPROPERTIES ('skip.header.line.count'='1');

A few things that go wrong in practice:

Header rows: If you forget skip.header.line.count, your column headers become a data row and you get type mismatch errors on the first query.
Commas inside quoted fields: Athena’s default CSV SerDe does not handle quoted fields gracefully. If your data has commas inside values, use the OpenCSVSerde or convert to Parquet first.
Date formats: Athena expects yyyy-MM-dd by default. If your CSV uses MM/dd/yyyy, you will need to read it as STRING and cast it in the query.

JSON and Nested Structures

For JSON, use the JSON SerDe:

  
CREATE EXTERNAL TABLE sales_analytics.orders_json (
  order_id STRING,
  customer_id STRING,
  amount DOUBLE,
  items ARRAY<STRUCT<item_id:STRING, quantity:INT, price:DOUBLE>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://my-data-lake/sales-json/';

Athena handles nested JSON with structs and arrays, which works well for most use cases. The one gotcha is that every line in your JSON files must be a single JSON object (JSON Lines format). If you have a single pretty-printed JSON file with an array of objects at the root, Athena will not parse it correctly. You would need to preprocess it or use a different approach.

CTAS — Create Table as Select

If you query the same raw CSV files repeatedly, it is worth converting them to Parquet with partitioning in one shot:

  
CREATE TABLE sales_analytics.orders_optimized
WITH (
  format = 'PARQUET',
  parquet_compression = 'SNAPPY',
  partitioned_by = ARRAY['year', 'month', 'day'],
  external_location = 's3://my-data-lake/sales-optimized/'
)
AS SELECT
  order_id,
  customer_id,
  amount,
  YEAR(order_date) AS year,
  MONTH(order_date) AS month,
  DAY(order_date) AS day
FROM sales_analytics.orders_csv;

This does a one-time scan of the CSV data, converts everything to compressed Parquet with partitioning, and writes it to a new S3 location. From that point on you query the optimized table and your costs drop significantly. For any data you query regularly, this step more than pays for itself.

Cost Control — Things You Should Do Before Someone Notices the Bill

Athena charges $5 per terabyte scanned. That sounds cheap until someone runs SELECT * on a year of uncompressed JSON. A few practical safeguards:

Use workgroups with data usage limits. You can set a per-query data scan limit at the workgroup level. If someone accidentally runs a query that would scan 10 TB, Athena blocks it instead of running it.
Partition everything you query regularly. Partitioning is the single biggest lever on cost. Even a basic year/month partition cuts scan volume by 90% or more for most date-range queries.
Use columnar formats. Parquet and ORC are compressed and columnar. Athena only reads the columns referenced in your query. A SELECT order_id on a Parquet table with 50 columns reads substantially fewer bytes than the same query on CSV.
Check the console estimate. Before running a query in the console, Athena shows an estimated scanned bytes figure. It is not always accurate but it will catch the obvious disasters.
Monitor with CloudWatch. Set up billing alerts tied to Athena usage. You do not want to discover a $400 query at the end of the month.

Production Considerations

In a demo, you create tables manually in the Athena console and it works fine. In production, a few things change:

Infrastructure as Code: Your table DDL should live in Terraform or CloudFormation (using the aws_glue_catalog_table or AWS::Glue::Table resources). This gives you repeatable environments and change tracking.
Partition management: Instead of running MSCK REPAIR TABLE manually, add a step to your data pipeline that runs ALTER TABLE ... ADD PARTITION after writing new data. Or use Glue Crawlers on a schedule, though crawlers add their own costs and can be slow on large buckets.
IAM permissions: Athena needs permissions to read from your S3 buckets and write query results to an output bucket. Lock this down with least-privilege policies. The query results bucket should have a lifecycle policy to clean up old results, otherwise you accumulate gigabytes of CSV output files that nobody ever looks at.
Query result location: Set a default output location for each workgroup. If you leave it unset, Athena will fail queries and the error message is not particularly helpful.
Named queries and saved queries: If you have a set of queries that get run regularly, save them in the Athena console or version them in your repo. Do not rely on someone copying and pasting from a Slack thread from six months ago.

A Quick Comparison: Athena vs Redshift Spectrum vs Presto on EMR

Since Athena uses Presto under the hood, you might wonder why you would use Redshift Spectrum or run Presto yourself on EMR.

Approach	When to Use
Athena	Ad-hoc queries, light scheduled workloads, no existing cluster infrastructure
Redshift Spectrum	You already have a Redshift cluster and want to query S3 data alongside your warehouse tables
Presto on EMR	You need more control over resources, query concurrency, or you have a dedicated analytics platform team

For most teams without an existing Redshift footprint, Athena is the simplest starting point.

Wrapping Up

Athena does one thing well: it lets you run SQL against files in S3 without managing servers. The trade-off is that you need to care about file formats, partitioning, and query patterns in a way you would not with a database that manages its own storage. Get your data into Parquet, partition it by the columns you filter on most often, and set a data usage limit in your workgroup. Those three things cover most of what goes wrong in practice.

It is not a replacement for a data warehouse, but for the right workloads it saves an enormous amount of pipeline code that would otherwise exist just to load data somewhere before you can ask a simple question.

Data Engineering, AWS

This post is licensed under CC BY 4.0 by the author.