Post

AWS CDK basics for data platform teams

In this article, let us see how to get started with AWS CDK for data platform work and why you might want to use it instead of creating resources manually from the AWS Console. If your team builds ingestion jobs, S3 buckets, Lambda functions, Step Functions, or Glue jobs, you would quickly realize that clicking around in the console is fine for testing but not a good way to maintain environments over time. CDK gives us a way to define those resources in code and deploy them in a repeatable manner.

For data platform teams, this is useful because the same engineers who build the pipelines usually also need to provision the supporting infrastructure. In many teams, the line between data engineering and platform work is already blurred. So even if you are not a full-time cloud engineer, knowing the basics of CDK is helpful.

What is AWS CDK

AWS CDK stands for Cloud Development Kit. It lets us define AWS infrastructure using programming languages like TypeScript, Python, Java, and C#. Under the hood, CDK still generates CloudFormation templates, but we write normal code instead of large YAML or JSON files.

For example, instead of writing a long CloudFormation template to create an S3 bucket and a Lambda function, we can define them in Python or TypeScript and deploy using the CDK CLI.

That makes it easier when our infrastructure has reusable patterns, loops, conditions, or naming conventions shared across many pipelines.

When does CDK fit well for data teams

From my experience, CDK fits well when your team is already comfortable writing application code and wants infrastructure to look similar. It is also good when you have many small AWS services working together for a pipeline and want to package them into one stack.

A simple comparison is below.

ToolGood forWatch out for
AWS ConsoleQuick testing and learningHard to track, easy to drift
CloudFormationNative AWS IaCTemplates get verbose quickly
AWS CDKReusable AWS-centric infrastructure in codeStill limited by CloudFormation behavior
TerraformMulti-cloud and broader provider supportAnother language/tooling layer to learn

If your company already standardized on Terraform for everything, then it may be better to stay with that. But if your work is heavily AWS-focused and your team likes writing Python or TypeScript, CDK is very approachable.

Install the prerequisites

For this example, let us use Python CDK. We need Node.js for the CDK CLI and Python for the application code.

1
2
3
4
npm install -g aws-cdk
python3 -m venv .venv
source .venv/bin/activate
pip install aws-cdk-lib constructs

We also need AWS credentials configured locally. The simplest path for a demo is to configure the AWS CLI.

1
aws configure

If you are working in a real project, it is better to use an IAM role or a short-lived SSO-based login instead of long-lived access keys.

Create a new CDK project

Now let us initialize a project.

1
2
3
4
5
mkdir cdk-data-platform-demo
cd cdk-data-platform-demo
cdk init app --language python
source .venv/bin/activate
pip install -r requirements.txt

This creates the basic folder structure for the app. We can then define our first stack. For a data platform example, let us create one S3 bucket to hold raw files and one Lambda function that gets triggered when a file lands in the bucket.

Define a simple stack

Open the stack file and update it like below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from aws_cdk import (
    Stack,
    RemovalPolicy,
    aws_s3 as s3,
    aws_lambda as _lambda,
    aws_s3_notifications as s3n
)
from constructs import Construct

class CdkDataPlatformDemoStack(Stack):
    def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        raw_bucket = s3.Bucket(
            self,
            "RawLandingBucket",
            versioned=True,
            removal_policy=RemovalPolicy.DESTROY,
            auto_delete_objects=True
        )

        ingest_fn = _lambda.Function(
            self,
            "IngestHandler",
            runtime=_lambda.Runtime.PYTHON_3_12,
            handler="handler.main",
            code=_lambda.Code.from_asset("lambda")
        )

        raw_bucket.add_event_notification(
            s3.EventType.OBJECT_CREATED,
            s3n.LambdaDestination(ingest_fn)
        )

        raw_bucket.grant_read(ingest_fn)

Then create the Lambda handler inside a lambda/handler.py file.

1
2
3
def main(event, context):
    print("Received event:", event)
    return {"status": "ok"}

This example is intentionally simple, but it shows the pattern. We define the infrastructure in one place and keep the Lambda code inside the same repository.

Bootstrap and deploy

Before the first deployment in an AWS account and region, CDK needs a bootstrap step.

1
2
3
cdk bootstrap
cdk synth
cdk deploy

cdk synth is useful because it shows the generated CloudFormation template. I always like checking this once, especially when trying a new construct, because CDK can hide some details and it is still CloudFormation underneath.

After deployment, upload a sample file to the bucket and check CloudWatch logs for the Lambda invocation.

1
2
aws s3 cp sample.csv s3://<your-bucket-name>/incoming/sample.csv
aws logs tail /aws/lambda/<your-lambda-name> --follow

This is a small thing, but for data teams it is very useful to test the whole flow end to end instead of just trusting that the deploy succeeded.

Useful patterns for a real data platform

Once the basics work, you can extend the stack to include more services commonly used in a data platform:

  • S3 buckets for raw, processed, and curated layers
  • Lambda or ECS tasks for lightweight ingestion
  • Glue jobs and crawlers
  • EventBridge schedules
  • Step Functions for orchestration
  • IAM roles and policies for service access
  • SNS notifications for failures

One thing I like in CDK is that we can build reusable constructs. For example, if every ingestion pipeline needs one bucket, one dead-letter queue, one Lambda, and one CloudWatch alarm, we can package that as a reusable component and instantiate it multiple times.

That is where CDK starts feeling more natural than copying CloudFormation templates around.

Caveats and limitations

CDK is not magic, and there are a few things to be careful about.

  1. CDK still deploys through CloudFormation. If CloudFormation has a limitation, CDK also has that limitation.
  2. Logical ID changes can cause unwanted resource replacement if you refactor carelessly.
  3. Some defaults are convenient for demos but unsafe for production, like RemovalPolicy.DESTROY.
  4. Team members need both application language knowledge and AWS infrastructure understanding.
  5. Reviewing diffs is not always as straightforward as plain Terraform plans unless your team is disciplined about cdk diff.

For example, in the sample stack above, auto_delete_objects=True is useful for cleaning up a demo stack, but I would be very careful with that in production because it makes deletion much easier than you might want.

What I would change for production

For a simple demo, local credentials and a single stack are enough. In production, I would change a few things:

  • Use separate AWS accounts or at least separate environments for dev, test, and prod
  • Use CI/CD to run cdk synth, cdk diff, tests, and deployments
  • Use IAM roles or AWS SSO instead of static access keys
  • Add bucket encryption, lifecycle rules, and tighter IAM policies
  • Add monitoring, alarms, and dead-letter handling
  • Split infrastructure into logical stacks so one small change does not redeploy everything

For data workloads, I would also think early about naming conventions, retention settings, and how to avoid one pipeline getting broad access to all buckets or databases.

A quick example of where SQL still fits

Even though CDK is infrastructure code, the deployed services often support SQL-based workloads. For example, if your pipeline later loads files into Athena-backed tables, the infrastructure might create the S3 bucket and crawler, while your transformation step still runs SQL like this:

1
2
3
4
5
6
7
CREATE TABLE IF NOT EXISTS raw_orders (
  order_id string,
  customer_id string,
  order_total decimal(10,2),
  created_at timestamp
)
LOCATION 's3://my-raw-zone/orders/';

This is one reason CDK is useful for data teams. It does not replace the data logic itself, but it gives a structured way to create the cloud resources around that logic.

Conclusion

AWS CDK is a good starting point if your data platform team works mainly on AWS and prefers defining infrastructure in regular code instead of large templates. For simple demos, it helps you move fast. For larger projects, it gives you a reusable way to standardize pipeline infrastructure. Just keep in mind that CloudFormation behavior still matters, and production setups need more guardrails than the quick examples you might start with locally.

This post is licensed under CC BY 4.0 by the author.