AWS CDK for Data Platform Teams: A Practical Guide

Posted May 20, 2025

By Ashok KS 7 min read

In a previous article we looked at Terraform for managing data infrastructure on GCP. Terraform is great, but if your team works mostly in AWS and writes Python or TypeScript all day, you might find the AWS CDK fits more naturally into your workflow.

This article covers what the AWS CDK is, how to set it up, and how to build a realistic data platform stack — an S3 data lake bucket, a Glue database, and the IAM roles that tie them together. We will also look at where CDK shines vs where Terraform still wins, and what you need to think about before taking this to production.

Why CDK for Data Platforms?

If you have built data pipelines on AWS, you have probably clicked around the console to create S3 buckets, Glue crawlers, IAM roles, and Step Functions — or you have written a lot of CloudFormation YAML.

CloudFormation works, but the templates get long fast. A simple data pipeline stack can easily cross a thousand lines of YAML, and it is hard to reuse logic across stacks. You end up copy-pasting boilerplate and hoping you did not miss a property.

The CDK solves this by letting you define infrastructure in a real programming language — TypeScript, Python, Java, or C#. That means you get loops, conditionals, abstractions and autocomplete while writing infrastructure code. For data platform teams that already write Python or TypeScript for their pipelines, this is a natural fit.

Setting Up

1. Install the CDK CLI

The easiest way is via npm:

npm install -g aws-cdk

Verify it installed:

cdk --version

2. Bootstrap Your AWS Account

The CDK needs a small CloudFormation stack in your account to manage assets like Lambda code or Docker images. This is a one-time step per account/region combination:

cdk bootstrap aws://ACCOUNT-NUMBER/ap-southeast-2

This creates an S3 bucket and IAM roles the CDK uses during deployment. You only need to do this once.

3. Create a new CDK project

I will use TypeScript here since it has the most mature CDK support, but Python works just as well:

mkdir my-data-platform && cd my-data-platform
cdk init app --language typescript

This gives you a bin/ folder with the entry point and a lib/ folder for your stacks.

Building a Data Platform Stack

Let us build a stack that provisions the foundational pieces of a data platform:

An S3 bucket for the data lake (with lifecycle rules)
A Glue database for the data catalog
An IAM role that Glue jobs can assume
A Step Functions state machine placeholder

Here is what the stack looks like in lib/data-platform-stack.ts:

  
import * as cdk from 'aws-cdk-lib';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as glue from 'aws-cdk-lib/aws-glue';
import * as iam from 'aws-cdk-lib/aws-iam';
import { Construct } from 'constructs';

export class DataPlatformStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Data lake bucket with lifecycle
    const dataLakeBucket = new s3.Bucket(this, 'DataLakeBucket', {
      bucketName: `data-lake-${this.account}-${this.region}`,
      versioned: true,
      encryption: s3.BucketEncryption.S3_MANAGED,
      lifecycleRules: [
        {
          id: 'TransitionToIA',
          transitions: [
            {
              storageClass: s3.StorageClass.INFREQUENT_ACCESS,
              transitionAfter: cdk.Duration.days(30),
            },
            {
              storageClass: s3.StorageClass.GLACIER,
              transitionAfter: cdk.Duration.days(90),
            },
          ],
        },
      ],
    });

    // Glue database
    const glueDb = new glue.CfnDatabase(this, 'DataCatalogDb', {
      catalogId: this.account,
      databaseInput: {
        name: 'data_platform_catalog',
        description: 'Central catalog for data platform tables',
      },
    });

    // IAM role for Glue jobs
    const glueRole = new iam.Role(this, 'GlueJobRole', {
      assumedBy: new iam.ServicePrincipal('glue.amazonaws.com'),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AWSGlueServiceRole'),
      ],
    });

    // Grant Glue read/write to the data lake bucket
    dataLakeBucket.grantReadWrite(glueRole);

    // Outputs
    new cdk.CfnOutput(this, 'BucketName', {
      value: dataLakeBucket.bucketName,
    });
    new cdk.CfnOutput(this, 'GlueDbName', {
      value: glueDb.ref,
    });
  }
}

Deploy it:

cdk deploy

The CDK will show you a diff of what it plans to create. If it looks right, confirm and it builds the CloudFormation template behind the scenes and deploys it.

What I like about this approach is that dataLakeBucket.grantReadWrite(glueRole) is one line of readable code. In raw CloudFormation, the equivalent is maybe 30 lines of IAM policy JSON.

CDK vs Terraform for Data Platforms

Since we covered Terraform earlier, here is how I think about picking one:

	AWS CDK	Terraform
Language	TypeScript, Python, Java, C#	HCL (custom DSL)
Cloud coverage	AWS only (with limited Kubernetes support)	All major clouds + many SaaS providers
State management	Handled by CloudFormation	Self-managed (S3/DynamoDB backend)
Ecosystem maturity	Growing, but smaller community	Very large module registry and community
Best for	AWS-native teams that write code daily	Multi-cloud teams or orgs with existing Terraform
Learning curve	Low if you already know a supported language	Medium — HCL is not hard but is its own thing

For a team that is 100% on AWS and already writes Python or TypeScript for data pipelines, CDK is often the right call. If you have workloads across GCP and AWS, or your platform team already uses Terraform, stick with Terraform.

Practical Limitations

1. It is still CloudFormation under the hood

The CDK synthesizes your code into CloudFormation templates and then CloudFormation deploys them. That means CloudFormation’s limits are your limits. Stacks cannot exceed 500 resources, drift detection still applies, and if a stack gets stuck in UPDATE_ROLLBACK_FAILED, you have the same pain as raw CloudFormation.

2. Construct library coverage is not complete

Not every AWS service has L2 constructs (the nicer high-level ones). For some services you will use L1 constructs — auto-generated from the CloudFormation spec — which look a lot more like raw CloudFormation JSON. AWS Glue is a good example: CfnDatabase above is an L1 construct. You lose some of the fluent API when you drop to L1.

3. Multi-region and multi-account is more work

CDK has cdk bootstrap for each account/region combination. If you have a multi-account setup with a data lake in one account and consumers in another, you need to manage cross-account IAM and asset publishing yourself. Terraform’s provider model handles this more naturally in my experience.

4. State files live in CloudFormation

You cannot easily inspect or modify the state the way you can with a Terraform state file. If something goes wrong, you fix it through the CloudFormation console, not a state file edit. Most of the time this is fine. When it is not fine, it is really not fine.

What Changes in Production

The stack above is a demo. For production, here is what I would add:

Separate stacks per concern: One stack for the S3 buckets, one for Glue, one for IAM. This limits blast radius — if a Glue stack update fails, your buckets are not affected.
KMS encryption with customer-managed keys instead of S3-managed encryption. Data lakes often contain sensitive data and you want control over key rotation.
Bucket policies restricting access to specific VPC endpoints or IAM principals, not just IAM role grants.
Enable server access logging on the S3 bucket. When someone accidentally deletes a prefix, you want a trail.
Pin CDK construct library versions in package.json instead of using ^ ranges. Breaking changes between CDK v2 minor versions are rare but they happen.
CI/CD for your CDK app: Run cdk diff on every PR so the reviewer can see exactly what infrastructure changes are proposed before merging.

Wrapping Up

The AWS CDK is a solid choice for data platform teams that are deep in the AWS ecosystem and want infrastructure code that reads like the application code they already write. It reduces boilerplate, makes reuse easier, and fits into the same CI/CD pipeline your data pipelines already use.

That said, it is not a replacement for Terraform if you are multi-cloud or have an existing Terraform codebase. The right tool depends on where you sit.

If you are starting fresh on AWS and your team writes Python or TypeScript, give the CDK a try. The bootstrap-and-deploy loop we walked through here should get you going in under an hour.

Data Engineering, AWS, Infrastructure as Code

This post is licensed under CC BY 4.0 by the author.