Getting started with AWS Glue crawlers
In this article, let us see how to get started with AWS Glue crawlers, what problem they solve, and why you might want to use them in a simple data lake setup. If you are keeping files in S3 and want AWS to understand the schema, partitions, and table metadata without you manually defining every table, then a crawler is usually the first thing to try.
When teams start building on AWS, it is common to land CSV, JSON, or Parquet files into S3 first and worry about metadata later. That works only for a while. Once Athena, Glue jobs, or downstream ETL starts depending on those files, we need a reliable way to register the datasets in the Glue Data Catalog. That is where crawlers help.
What does an AWS Glue crawler do
A crawler scans data from a source such as S3, identifies the format, tries to infer the schema, and then creates or updates table definitions in the AWS Glue Data Catalog. Those catalog tables can then be queried by services like Athena or used by other Glue jobs.
For a beginner setup, the mental model is simple:
- S3 stores the files
- Glue crawler reads the files and figures out metadata
- Glue Data Catalog stores the table definitions
- Athena or Glue ETL uses that metadata
This is useful because you do not have to manually create every external table up front, especially when you are exploring a new dataset.
A small example use case
Let us say we have sales files landing into S3 like this:
1
2
s3://demo-analytics-raw/sales/year=2025/month=01/part-000.csv
s3://demo-analytics-raw/sales/year=2025/month=02/part-000.csv
And the CSV contents look like this:
order_id,customer_id,amount,order_timestamp
1001,C001,49.99,2025-01-10T10:30:00Z
1002,C002,15.50,2025-01-10T11:00:00Z
Instead of manually creating a table in Athena, we can point a crawler at s3://demo-analytics-raw/sales/ and let it discover:
- columns such as
order_id,customer_id,amount,order_timestamp - partitions such as
yearandmonth - the file format
Prerequisites
Before creating the crawler, we need a few things ready:
- An S3 bucket with sample data
- A Glue database in the Data Catalog
- An IAM role that Glue can assume
- Permission for that role to read from S3 and write logs to CloudWatch
For a quick demo, the S3 read permission is usually enough. In production, I would still keep the role narrow and only allow access to the required buckets and prefixes.
Create a Glue database
From the AWS Glue console, create a database like demo_raw. This is just a logical container for your discovered tables.
You can also create it using AWS CLI:
1
aws glue create-database --database-input '{"Name":"demo_raw"}'
Create the IAM role
Glue needs an IAM role to access the source data. For a simple setup, the role should allow:
s3:GetObjects3:ListBucket- CloudWatch Logs permissions
A minimal S3 policy may look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": ["arn:aws:s3:::demo-analytics-raw"]
},
{
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": ["arn:aws:s3:::demo-analytics-raw/*"]
}
]
}
In many demo environments, people just attach a broader managed policy and move on. That is fine for learning, but I would not keep it that way in a real environment.
Create the crawler
Now let us create the crawler from the console. In AWS Glue, choose Crawlers and click Create crawler.
Set the source as the S3 path:
1
s3://demo-analytics-raw/sales/
Choose the IAM role we created, then select the target database demo_raw. Give the crawler a name like sales-raw-crawler.
You can also do the same using AWS CLI:
1
aws glue create-crawler --name sales-raw-crawler --role GlueCrawlerRole --database-name demo_raw --targets '{"S3Targets":[{"Path":"s3://demo-analytics-raw/sales/"}]}'
Once the crawler is created, run it manually for the first time.
1
aws glue start-crawler --name sales-raw-crawler
What happens after the crawler runs
If everything goes fine, Glue will create a table in the Data Catalog. If your files are partitioned using folder names like year=2025/month=01, the crawler can also register those as partitions.
After the run, go to the Glue tables section and inspect the generated schema. You should verify a few things instead of assuming it got everything right:
- Did it identify the correct delimiter and format?
- Did numeric columns become strings by mistake?
- Did it detect partitions correctly?
- Did it merge very different files into one table?
This part matters because crawlers are convenient, but they are not magic. They infer based on what they see. If the sample files are inconsistent, the generated schema might not be what you want.
Query it from Athena
Once the table exists, we can query it from Athena:
1
2
3
4
5
SELECT order_id, customer_id, amount
FROM demo_raw.sales
WHERE year = '2025'
AND month = '01'
LIMIT 10;
If the crawler created partition columns as strings, that is normal. A lot of times the partition values from S3 folder names are treated as strings unless you design things differently.
When to use a crawler and when not to
A crawler is helpful, but it is not the only way to define tables. Here is the simple tradeoff:
| Approach | Good for | Not ideal for |
|---|---|---|
| Glue crawler | Quick setup, schema discovery, exploratory datasets | Strict schema control |
| Manual table definition | Stable datasets, known schema, repeatable production setup | Fast experimentation |
| ETL job creating curated output | Clean production layers, transformed datasets | Very early raw ingestion stage |
For raw ingestion zones, I think crawlers are a good fit. For curated datasets that many dashboards depend on, I usually prefer more explicit schema management.
Common limitations and caveats
There are a few things to be careful about.
1. Schema inference can be inconsistent
If one file has amount as 49.99 and another bad file has amount as NA, the crawler may infer a wider type than you expected. That can create confusion later.
2. Mixed file structures are a problem
If you point one crawler at a very broad S3 prefix containing multiple unrelated structures, it may create too many tables or build the wrong grouping. It is better to keep prefixes clean.
3. Crawler runs are not free of operational cost
Even if the direct cost is not huge, frequent crawls on large prefixes are still something to think about. I would not run them every few minutes unless there is a clear need.
4. It only manages metadata
The crawler does not clean bad records, deduplicate data, or apply business transformations. It only helps catalog the data. That distinction is easy to miss when starting with Glue.
A production note
For a demo, running a crawler manually is enough. In production, I would usually make a few changes:
- trigger crawlers on a schedule only where needed
- separate raw, staged, and curated S3 paths clearly
- avoid depending only on inference for important downstream tables
- add data quality validation before promoting data to curated layers
- use infrastructure as code for crawler, database, IAM, and related resources
For example, if a pipeline writes Parquet files daily to a stable prefix, I might still use a crawler initially, but later switch to managing the table definitions more explicitly once the schema is known. That gives fewer surprises.
Simple Terraform example
If you want to manage the crawler as code, a small Terraform resource would look like this:
1
2
3
4
5
6
7
8
9
resource "aws_glue_crawler" "sales_raw" {
name = "sales-raw-crawler"
database_name = aws_glue_catalog_database.demo_raw.name
role = aws_iam_role.glue_crawler.arn
s3_target {
path = "s3://demo-analytics-raw/sales/"
}
}
This becomes useful when you want the same setup across dev, test, and prod rather than creating everything manually from the console.
Conclusion
AWS Glue crawlers are a good starting point when you have files in S3 and need metadata quickly for Athena or Glue jobs. They save time in the early stages and are especially useful for raw datasets. Just make sure to review the inferred schema and do not assume the crawler always understands the data exactly as you intended. For simple discovery, it is very handy. For production-critical tables, I would still be a bit more explicit.
