A practical AWS data platform architecture for a small team
In this article, let us see how to put together a small team data platform on AWS without creating a huge platform engineering project for ourselves. This kind of setup is useful when the team wants to ingest files or database extracts, clean them a bit, store them in layers, and make them queryable for analysts, but does not have the time or headcount to run a very complicated stack.
For our use case, the goal is not to build the perfect enterprise platform. The goal is to build something that is simple enough to maintain, cheap enough to justify, and flexible enough to grow later. We will use managed AWS services as much as possible so that the data team can focus more on pipelines than on managing servers.
What we are trying to build
A simple version of the platform can look like this:
- Source systems place files in S3 or export data on a schedule
- S3 stores raw data in a landing area
- AWS Glue jobs clean and transform the data
- Processed data is written back to S3 in curated folders
- Glue Catalog keeps table metadata
- Athena is used for ad hoc querying
- Step Functions orchestrates the pipeline
- CloudWatch captures logs and alerts
This is usually enough for a small team that needs batch ingestion and reporting. If your team is doing near real-time streaming, machine learning feature serving, or heavy warehouse workloads, then the design would change a bit. But for many teams, this simple pattern works well.
Core services and why I would pick them
Below is a practical comparison for a small team:
| Need | Simple AWS choice | Why it fits a small team |
|---|---|---|
| Raw and curated storage | Amazon S3 | Cheap, durable, easy to organize by layer |
| Batch transformation | AWS Glue | Serverless Spark without managing clusters |
| Metadata | Glue Data Catalog | Works well with Athena and Glue jobs |
| Orchestration | AWS Step Functions | Clear visual workflow and retry handling |
| Query layer | Amazon Athena | Good for ad hoc SQL without managing infra |
| Lightweight event logic | AWS Lambda | Useful for validation and trigger glue code |
I would avoid introducing too many tools in the first version. For example, a small team usually does not need both EMR and Glue, or both Airflow and Step Functions, unless there is a very clear reason. More components usually means more maintenance.
A simple folder structure in S3
One thing that helps early is keeping the S3 structure predictable. We could maintain a bucket layout like below:
1
2
3
4
5
6
7
8
s3://company-data-platform/
raw/customer/ingest_date=2025-03-04/
raw/orders/ingest_date=2025-03-04/
bronze/customer/
bronze/orders/
silver/customer/
silver/orders/
gold/daily_sales/
The names do not have to be bronze, silver, and gold, but many teams understand this pattern quickly. Raw keeps the original extract. Bronze can be minimally standardized. Silver can hold cleaned, joined, and typed records. Gold can contain business-ready tables.
For a small team, I prefer to keep raw immutable. If a file lands in raw, I do not overwrite it. If something goes wrong later, it is easier to trace back what happened.
Sample flow using Step Functions
Let us say we receive a CSV export of orders every night. A simple state machine could do the below:
- Validate file arrival
- Start a Glue crawler or update partitions
- Run a Glue job to clean the data
- Run another Glue job to create an aggregated table
- Send success or failure notification
A simplified Step Functions definition could look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
"StartAt": "RunBronzeJob",
"States": {
"RunBronzeJob": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "orders-bronze-job"
},
"Next": "RunSilverJob"
},
"RunSilverJob": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "orders-silver-job"
},
"End": true
}
}
}
This is not complex, but that is exactly why it is good for a small team. The workflow is visible, retries can be added, and failures are easier to follow than when everything is hidden inside one long script.
A simple Glue job example
If the raw file is CSV and we want to convert it into Parquet with proper types, a Glue PySpark job can do that.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from pyspark.sql.functions import col, to_timestamp
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
df = spark.read.option("header", "true").csv("s3://company-data-platform/raw/orders/")
cleaned_df = (
df.withColumn("order_id", col("order_id").cast("long"))
.withColumn("customer_id", col("customer_id").cast("long"))
.withColumn("order_ts", to_timestamp(col("order_ts")))
.withColumn("amount", col("amount").cast("decimal(10,2)"))
)
cleaned_df.write.mode("overwrite").parquet("s3://company-data-platform/silver/orders/")
In a demo, overwrite may be fine. In production, I would be more careful. I would usually write partitioned output, keep idempotency in mind, and avoid blindly overwriting a full table unless I am sure the upstream extract is complete.
Querying with Athena
Once the data is cataloged, Athena gives a very easy way for analysts or engineers to inspect the output. A simple SQL can look like this:
1
2
3
4
5
6
7
SELECT customer_id,
date(order_ts) AS order_date,
sum(amount) AS total_amount
FROM silver_orders
WHERE order_ts >= current_date - interval '7' day
GROUP BY customer_id, date(order_ts)
ORDER BY order_date DESC, total_amount DESC;
This is one reason I like this setup for small teams. You can land data in S3, transform it with Glue, and expose it quickly with Athena without provisioning a warehouse on day one.
IAM and permissions matter more than people expect
A lot of the pain in AWS data projects is not the transformation logic itself. It is usually permissions, bucket policies, KMS encryption, VPC networking, and service role setup. For our use case, I would create separate IAM roles for Glue, Step Functions, and Lambda rather than reusing one broad admin-style role.
This takes a bit more effort at the start, but it becomes easier to understand failures later. When everything shares one oversized role, debugging and security both get worse.
Practical limitations and caveats
This simple architecture is good, but it has limits:
- Glue job startup time can feel slow for tiny workloads. If the transformation is very small, Lambda or even a container task may be enough.
- Athena is great for ad hoc analysis, but not every BI workload will perform well on top of data lake tables unless the files are partitioned and stored properly.
- S3 gives flexibility, but without naming standards and schema checks, the lake can get messy quickly.
- Step Functions is easier than a self-managed orchestrator, but if you end up with hundreds of interdependent workflows, you may want a more feature-rich orchestration setup.
- Small teams often ignore observability in the first version. That becomes painful later when a daily job fails quietly for three days.
I would also be careful with crawlers. They are convenient, but I would not depend on them for every metadata update in production. For stable pipelines, explicit schema management is often less surprising.
What I would change for production
For a simple demo, one AWS account, one bucket, and a handful of jobs may be enough. In production, I would usually add:
- Infrastructure as code using Terraform or CloudFormation
- Separate environments like dev, test, and prod
- Better partition strategy and lifecycle policies on S3
- Data quality checks before promoting tables
- Alerts on pipeline failures and delayed data
- A documented naming convention for buckets, jobs, tables, and IAM roles
If the platform grows, I might also introduce a warehouse layer such as Redshift or use Iceberg/Hudi/Delta style table formats depending on the query and update patterns. But I would not start there unless the team already knows that need is real.
Final thoughts
For a small data team, the best AWS platform is often the one that solves today’s ingestion and analytics needs without creating a second job in platform maintenance. S3, Glue, Athena, Step Functions, and a bit of Lambda can go a long way if the layout is clean and the workflows are simple. Start with a boring architecture, make it reliable, and only add more components when the real workload asks for them.
