Getting started with AWS Glue crawlers

Posted Jan 25, 2025

By Ashok KS 7 min read

In this article, let us see how to get started with AWS Glue crawlers, what problem they solve, and why you might want to use them in a simple data lake setup. If you are keeping files in S3 and want AWS to understand the schema, partitions, and table metadata without you manually defining every table, then a crawler is usually the first thing to try.

When teams start building on AWS, it is common to land CSV, JSON, or Parquet files into S3 first and worry about metadata later. That works only for a while. Once Athena, Glue jobs, or downstream ETL starts depending on those files, we need a reliable way to register the datasets in the Glue Data Catalog. That is where crawlers help.

What does an AWS Glue crawler do

A crawler scans data from a source such as S3, identifies the format, tries to infer the schema, and then creates or updates table definitions in the AWS Glue Data Catalog. Those catalog tables can then be queried by services like Athena or used by other Glue jobs.

For a beginner setup, the mental model is simple:

S3 stores the files
Glue crawler reads the files and figures out metadata
Glue Data Catalog stores the table definitions
Athena or Glue ETL uses that metadata

This is useful because you do not have to manually create every external table up front, especially when you are exploring a new dataset.

A small example use case

Let us say we have sales files landing into S3 like this:

s3://demo-analytics-raw/sales/year=2025/month=01/part-000.csv
s3://demo-analytics-raw/sales/year=2025/month=02/part-000.csv

And the CSV contents look like this:

order_id,customer_id,amount,order_timestamp
1001,C001,49.99,2025-01-10T10:30:00Z
1002,C002,15.50,2025-01-10T11:00:00Z

Instead of manually creating a table in Athena, we can point a crawler at s3://demo-analytics-raw/sales/ and let it discover:

columns such as order_id, customer_id, amount, order_timestamp
partitions such as year and month
the file format

Prerequisites

Before creating the crawler, we need a few things ready:

An S3 bucket with sample data
A Glue database in the Data Catalog
An IAM role that Glue can assume
Permission for that role to read from S3 and write logs to CloudWatch

For a quick demo, the S3 read permission is usually enough. In production, I would still keep the role narrow and only allow access to the required buckets and prefixes.

Create a Glue database

From the AWS Glue console, create a database like demo_raw. This is just a logical container for your discovered tables.

You can also create it using AWS CLI:

aws glue create-database   --database-input '{"Name":"demo_raw"}'

Create the IAM role

Glue needs an IAM role to access the source data. For a simple setup, the role should allow:

s3:GetObject
s3:ListBucket
CloudWatch Logs permissions

A minimal S3 policy may look like this:

  
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": ["arn:aws:s3:::demo-analytics-raw"]
    },
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": ["arn:aws:s3:::demo-analytics-raw/*"]
    }
  ]
}

In many demo environments, people just attach a broader managed policy and move on. That is fine for learning, but I would not keep it that way in a real environment.

Create the crawler

Now let us create the crawler from the console. In AWS Glue, choose Crawlers and click Create crawler.

Set the source as the S3 path:

s3://demo-analytics-raw/sales/

Choose the IAM role we created, then select the target database demo_raw. Give the crawler a name like sales-raw-crawler.

You can also do the same using AWS CLI:

  
aws glue create-crawler   --name sales-raw-crawler   --role GlueCrawlerRole   --database-name demo_raw   --targets '{"S3Targets":[{"Path":"s3://demo-analytics-raw/sales/"}]}'

Once the crawler is created, run it manually for the first time.

aws glue start-crawler --name sales-raw-crawler

What happens after the crawler runs

If everything goes fine, Glue will create a table in the Data Catalog. If your files are partitioned using folder names like year=2025/month=01, the crawler can also register those as partitions.

After the run, go to the Glue tables section and inspect the generated schema. You should verify a few things instead of assuming it got everything right:

Did it identify the correct delimiter and format?
Did numeric columns become strings by mistake?
Did it detect partitions correctly?
Did it merge very different files into one table?

This part matters because crawlers are convenient, but they are not magic. They infer based on what they see. If the sample files are inconsistent, the generated schema might not be what you want.

Query it from Athena

Once the table exists, we can query it from Athena:

  
SELECT order_id, customer_id, amount
FROM demo_raw.sales
WHERE year = '2025'
  AND month = '01'
LIMIT 10;

If the crawler created partition columns as strings, that is normal. A lot of times the partition values from S3 folder names are treated as strings unless you design things differently.

When to use a crawler and when not to

A crawler is helpful, but it is not the only way to define tables. Here is the simple tradeoff:

Approach	Good for	Not ideal for
Glue crawler	Quick setup, schema discovery, exploratory datasets	Strict schema control
Manual table definition	Stable datasets, known schema, repeatable production setup	Fast experimentation
ETL job creating curated output	Clean production layers, transformed datasets	Very early raw ingestion stage

For raw ingestion zones, I think crawlers are a good fit. For curated datasets that many dashboards depend on, I usually prefer more explicit schema management.

Common limitations and caveats

There are a few things to be careful about.

1. Schema inference can be inconsistent

If one file has amount as 49.99 and another bad file has amount as NA, the crawler may infer a wider type than you expected. That can create confusion later.

2. Mixed file structures are a problem

If you point one crawler at a very broad S3 prefix containing multiple unrelated structures, it may create too many tables or build the wrong grouping. It is better to keep prefixes clean.

3. Crawler runs are not free of operational cost

Even if the direct cost is not huge, frequent crawls on large prefixes are still something to think about. I would not run them every few minutes unless there is a clear need.

4. It only manages metadata

The crawler does not clean bad records, deduplicate data, or apply business transformations. It only helps catalog the data. That distinction is easy to miss when starting with Glue.

A production note

For a demo, running a crawler manually is enough. In production, I would usually make a few changes:

trigger crawlers on a schedule only where needed
separate raw, staged, and curated S3 paths clearly
avoid depending only on inference for important downstream tables
add data quality validation before promoting data to curated layers
use infrastructure as code for crawler, database, IAM, and related resources

For example, if a pipeline writes Parquet files daily to a stable prefix, I might still use a crawler initially, but later switch to managing the table definitions more explicitly once the schema is known. That gives fewer surprises.

Simple Terraform example

If you want to manage the crawler as code, a small Terraform resource would look like this:

  
resource "aws_glue_crawler" "sales_raw" {
  name          = "sales-raw-crawler"
  database_name = aws_glue_catalog_database.demo_raw.name
  role          = aws_iam_role.glue_crawler.arn

  s3_target {
    path = "s3://demo-analytics-raw/sales/"
  }
}

This becomes useful when you want the same setup across dev, test, and prod rather than creating everything manually from the console.

Conclusion

AWS Glue crawlers are a good starting point when you have files in S3 and need metadata quickly for Athena or Glue jobs. They save time in the early stages and are especially useful for raw datasets. Just make sure to review the inferred schema and do not assume the crawler always understands the data exactly as you intended. For simple discovery, it is very handy. For production-critical tables, I would still be a bit more explicit.

AWS, Data Engineering, Tutorial

This post is licensed under CC BY 4.0 by the author.