AWS CDK for Data Platform Teams: A Practical Guide
In a previous article we looked at Terraform for managing data infrastructure on GCP. Terraform is great, but if your team works mostly in AWS and writes Python or TypeScript all day, you might find the AWS CDK fits more naturally into your workflow.
This article covers what the AWS CDK is, how to set it up, and how to build a realistic data platform stack — an S3 data lake bucket, a Glue database, and the IAM roles that tie them together. We will also look at where CDK shines vs where Terraform still wins, and what you need to think about before taking this to production.
Why CDK for Data Platforms?
If you have built data pipelines on AWS, you have probably clicked around the console to create S3 buckets, Glue crawlers, IAM roles, and Step Functions — or you have written a lot of CloudFormation YAML.
CloudFormation works, but the templates get long fast. A simple data pipeline stack can easily cross a thousand lines of YAML, and it is hard to reuse logic across stacks. You end up copy-pasting boilerplate and hoping you did not miss a property.
The CDK solves this by letting you define infrastructure in a real programming language — TypeScript, Python, Java, or C#. That means you get loops, conditionals, abstractions and autocomplete while writing infrastructure code. For data platform teams that already write Python or TypeScript for their pipelines, this is a natural fit.
Setting Up
1. Install the CDK CLI
The easiest way is via npm:
1
npm install -g aws-cdk
Verify it installed:
1
cdk --version
2. Bootstrap Your AWS Account
The CDK needs a small CloudFormation stack in your account to manage assets like Lambda code or Docker images. This is a one-time step per account/region combination:
1
cdk bootstrap aws://ACCOUNT-NUMBER/ap-southeast-2
This creates an S3 bucket and IAM roles the CDK uses during deployment. You only need to do this once.
3. Create a new CDK project
I will use TypeScript here since it has the most mature CDK support, but Python works just as well:
1
2
mkdir my-data-platform && cd my-data-platform
cdk init app --language typescript
This gives you a bin/ folder with the entry point and a lib/ folder for your stacks.
Building a Data Platform Stack
Let us build a stack that provisions the foundational pieces of a data platform:
- An S3 bucket for the data lake (with lifecycle rules)
- A Glue database for the data catalog
- An IAM role that Glue jobs can assume
- A Step Functions state machine placeholder
Here is what the stack looks like in lib/data-platform-stack.ts:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import * as cdk from 'aws-cdk-lib';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as glue from 'aws-cdk-lib/aws-glue';
import * as iam from 'aws-cdk-lib/aws-iam';
import { Construct } from 'constructs';
export class DataPlatformStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
// Data lake bucket with lifecycle
const dataLakeBucket = new s3.Bucket(this, 'DataLakeBucket', {
bucketName: `data-lake-${this.account}-${this.region}`,
versioned: true,
encryption: s3.BucketEncryption.S3_MANAGED,
lifecycleRules: [
{
id: 'TransitionToIA',
transitions: [
{
storageClass: s3.StorageClass.INFREQUENT_ACCESS,
transitionAfter: cdk.Duration.days(30),
},
{
storageClass: s3.StorageClass.GLACIER,
transitionAfter: cdk.Duration.days(90),
},
],
},
],
});
// Glue database
const glueDb = new glue.CfnDatabase(this, 'DataCatalogDb', {
catalogId: this.account,
databaseInput: {
name: 'data_platform_catalog',
description: 'Central catalog for data platform tables',
},
});
// IAM role for Glue jobs
const glueRole = new iam.Role(this, 'GlueJobRole', {
assumedBy: new iam.ServicePrincipal('glue.amazonaws.com'),
managedPolicies: [
iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AWSGlueServiceRole'),
],
});
// Grant Glue read/write to the data lake bucket
dataLakeBucket.grantReadWrite(glueRole);
// Outputs
new cdk.CfnOutput(this, 'BucketName', {
value: dataLakeBucket.bucketName,
});
new cdk.CfnOutput(this, 'GlueDbName', {
value: glueDb.ref,
});
}
}
Deploy it:
1
cdk deploy
The CDK will show you a diff of what it plans to create. If it looks right, confirm and it builds the CloudFormation template behind the scenes and deploys it.
What I like about this approach is that dataLakeBucket.grantReadWrite(glueRole) is one line of readable code. In raw CloudFormation, the equivalent is maybe 30 lines of IAM policy JSON.
CDK vs Terraform for Data Platforms
Since we covered Terraform earlier, here is how I think about picking one:
| AWS CDK | Terraform | |
|---|---|---|
| Language | TypeScript, Python, Java, C# | HCL (custom DSL) |
| Cloud coverage | AWS only (with limited Kubernetes support) | All major clouds + many SaaS providers |
| State management | Handled by CloudFormation | Self-managed (S3/DynamoDB backend) |
| Ecosystem maturity | Growing, but smaller community | Very large module registry and community |
| Best for | AWS-native teams that write code daily | Multi-cloud teams or orgs with existing Terraform |
| Learning curve | Low if you already know a supported language | Medium — HCL is not hard but is its own thing |
For a team that is 100% on AWS and already writes Python or TypeScript for data pipelines, CDK is often the right call. If you have workloads across GCP and AWS, or your platform team already uses Terraform, stick with Terraform.
Practical Limitations
1. It is still CloudFormation under the hood
The CDK synthesizes your code into CloudFormation templates and then CloudFormation deploys them. That means CloudFormation’s limits are your limits. Stacks cannot exceed 500 resources, drift detection still applies, and if a stack gets stuck in UPDATE_ROLLBACK_FAILED, you have the same pain as raw CloudFormation.
2. Construct library coverage is not complete
Not every AWS service has L2 constructs (the nicer high-level ones). For some services you will use L1 constructs — auto-generated from the CloudFormation spec — which look a lot more like raw CloudFormation JSON. AWS Glue is a good example: CfnDatabase above is an L1 construct. You lose some of the fluent API when you drop to L1.
3. Multi-region and multi-account is more work
CDK has cdk bootstrap for each account/region combination. If you have a multi-account setup with a data lake in one account and consumers in another, you need to manage cross-account IAM and asset publishing yourself. Terraform’s provider model handles this more naturally in my experience.
4. State files live in CloudFormation
You cannot easily inspect or modify the state the way you can with a Terraform state file. If something goes wrong, you fix it through the CloudFormation console, not a state file edit. Most of the time this is fine. When it is not fine, it is really not fine.
What Changes in Production
The stack above is a demo. For production, here is what I would add:
- Separate stacks per concern: One stack for the S3 buckets, one for Glue, one for IAM. This limits blast radius — if a Glue stack update fails, your buckets are not affected.
- KMS encryption with customer-managed keys instead of S3-managed encryption. Data lakes often contain sensitive data and you want control over key rotation.
- Bucket policies restricting access to specific VPC endpoints or IAM principals, not just IAM role grants.
- Enable server access logging on the S3 bucket. When someone accidentally deletes a prefix, you want a trail.
- Pin CDK construct library versions in
package.jsoninstead of using^ranges. Breaking changes between CDK v2 minor versions are rare but they happen. - CI/CD for your CDK app: Run
cdk diffon every PR so the reviewer can see exactly what infrastructure changes are proposed before merging.
Wrapping Up
The AWS CDK is a solid choice for data platform teams that are deep in the AWS ecosystem and want infrastructure code that reads like the application code they already write. It reduces boilerplate, makes reuse easier, and fits into the same CI/CD pipeline your data pipelines already use.
That said, it is not a replacement for Terraform if you are multi-cloud or have an existing Terraform codebase. The right tool depends on where you sit.
If you are starting fresh on AWS and your team writes Python or TypeScript, give the CDK a try. The bootstrap-and-deploy loop we walked through here should get you going in under an hour.
