A practical AWS data platform architecture for a small team

Posted Mar 4, 2025

By Ashok KS 7 min read

In this article, let us see how to put together a small team data platform on AWS without creating a huge platform engineering project for ourselves. This kind of setup is useful when the team wants to ingest files or database extracts, clean them a bit, store them in layers, and make them queryable for analysts, but does not have the time or headcount to run a very complicated stack.

For our use case, the goal is not to build the perfect enterprise platform. The goal is to build something that is simple enough to maintain, cheap enough to justify, and flexible enough to grow later. We will use managed AWS services as much as possible so that the data team can focus more on pipelines than on managing servers.

What we are trying to build

A simple version of the platform can look like this:

Source systems place files in S3 or export data on a schedule
S3 stores raw data in a landing area
AWS Glue jobs clean and transform the data
Processed data is written back to S3 in curated folders
Glue Catalog keeps table metadata
Athena is used for ad hoc querying
Step Functions orchestrates the pipeline
CloudWatch captures logs and alerts

This is usually enough for a small team that needs batch ingestion and reporting. If your team is doing near real-time streaming, machine learning feature serving, or heavy warehouse workloads, then the design would change a bit. But for many teams, this simple pattern works well.

Core services and why I would pick them

Below is a practical comparison for a small team:

Need	Simple AWS choice	Why it fits a small team
Raw and curated storage	Amazon S3	Cheap, durable, easy to organize by layer
Batch transformation	AWS Glue	Serverless Spark without managing clusters
Metadata	Glue Data Catalog	Works well with Athena and Glue jobs
Orchestration	AWS Step Functions	Clear visual workflow and retry handling
Query layer	Amazon Athena	Good for ad hoc SQL without managing infra
Lightweight event logic	AWS Lambda	Useful for validation and trigger glue code

I would avoid introducing too many tools in the first version. For example, a small team usually does not need both EMR and Glue, or both Airflow and Step Functions, unless there is a very clear reason. More components usually means more maintenance.

A simple folder structure in S3

One thing that helps early is keeping the S3 structure predictable. We could maintain a bucket layout like below:

s3://company-data-platform/
  raw/customer/ingest_date=2025-03-04/
  raw/orders/ingest_date=2025-03-04/
  bronze/customer/
  bronze/orders/
  silver/customer/
  silver/orders/
  gold/daily_sales/

The names do not have to be bronze, silver, and gold, but many teams understand this pattern quickly. Raw keeps the original extract. Bronze can be minimally standardized. Silver can hold cleaned, joined, and typed records. Gold can contain business-ready tables.

For a small team, I prefer to keep raw immutable. If a file lands in raw, I do not overwrite it. If something goes wrong later, it is easier to trace back what happened.

Sample flow using Step Functions

Let us say we receive a CSV export of orders every night. A simple state machine could do the below:

Validate file arrival
Start a Glue crawler or update partitions
Run a Glue job to clean the data
Run another Glue job to create an aggregated table
Send success or failure notification

A simplified Step Functions definition could look like this:

  
{
  "StartAt": "RunBronzeJob",
  "States": {
    "RunBronzeJob": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "orders-bronze-job"
      },
      "Next": "RunSilverJob"
    },
    "RunSilverJob": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "orders-silver-job"
      },
      "End": true
    }
  }
}

This is not complex, but that is exactly why it is good for a small team. The workflow is visible, retries can be added, and failures are easier to follow than when everything is hidden inside one long script.

A simple Glue job example

If the raw file is CSV and we want to convert it into Parquet with proper types, a Glue PySpark job can do that.

  
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from pyspark.sql.functions import col, to_timestamp

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

df = spark.read.option("header", "true").csv("s3://company-data-platform/raw/orders/")

cleaned_df = (
    df.withColumn("order_id", col("order_id").cast("long"))
      .withColumn("customer_id", col("customer_id").cast("long"))
      .withColumn("order_ts", to_timestamp(col("order_ts")))
      .withColumn("amount", col("amount").cast("decimal(10,2)"))
)

cleaned_df.write.mode("overwrite").parquet("s3://company-data-platform/silver/orders/")

In a demo, overwrite may be fine. In production, I would be more careful. I would usually write partitioned output, keep idempotency in mind, and avoid blindly overwriting a full table unless I am sure the upstream extract is complete.

Querying with Athena

Once the data is cataloged, Athena gives a very easy way for analysts or engineers to inspect the output. A simple SQL can look like this:

  
SELECT customer_id,
       date(order_ts) AS order_date,
       sum(amount) AS total_amount
FROM silver_orders
WHERE order_ts >= current_date - interval '7' day
GROUP BY customer_id, date(order_ts)
ORDER BY order_date DESC, total_amount DESC;

This is one reason I like this setup for small teams. You can land data in S3, transform it with Glue, and expose it quickly with Athena without provisioning a warehouse on day one.

IAM and permissions matter more than people expect

A lot of the pain in AWS data projects is not the transformation logic itself. It is usually permissions, bucket policies, KMS encryption, VPC networking, and service role setup. For our use case, I would create separate IAM roles for Glue, Step Functions, and Lambda rather than reusing one broad admin-style role.

This takes a bit more effort at the start, but it becomes easier to understand failures later. When everything shares one oversized role, debugging and security both get worse.

Practical limitations and caveats

This simple architecture is good, but it has limits:

Glue job startup time can feel slow for tiny workloads. If the transformation is very small, Lambda or even a container task may be enough.
Athena is great for ad hoc analysis, but not every BI workload will perform well on top of data lake tables unless the files are partitioned and stored properly.
S3 gives flexibility, but without naming standards and schema checks, the lake can get messy quickly.
Step Functions is easier than a self-managed orchestrator, but if you end up with hundreds of interdependent workflows, you may want a more feature-rich orchestration setup.
Small teams often ignore observability in the first version. That becomes painful later when a daily job fails quietly for three days.

I would also be careful with crawlers. They are convenient, but I would not depend on them for every metadata update in production. For stable pipelines, explicit schema management is often less surprising.

What I would change for production

For a simple demo, one AWS account, one bucket, and a handful of jobs may be enough. In production, I would usually add:

Infrastructure as code using Terraform or CloudFormation
Separate environments like dev, test, and prod
Better partition strategy and lifecycle policies on S3
Data quality checks before promoting tables
Alerts on pipeline failures and delayed data
A documented naming convention for buckets, jobs, tables, and IAM roles

If the platform grows, I might also introduce a warehouse layer such as Redshift or use Iceberg/Hudi/Delta style table formats depending on the query and update patterns. But I would not start there unless the team already knows that need is real.

Final thoughts

For a small data team, the best AWS platform is often the one that solves today’s ingestion and analytics needs without creating a second job in platform maintenance. S3, Glue, Athena, Step Functions, and a bit of Lambda can go a long way if the layout is clean and the workflows are simple. Start with a boring architecture, make it reliable, and only add more components when the real workload asks for them.

AWS, Data Engineering, Architecture

This post is licensed under CC BY 4.0 by the author.