
Getting Started with Apache Spark on Databricks
In this article let us walk through what it actually looks like to run Apache Spark on Databricks when you are coming from a self-managed Spark background. If you have been running Spark on EMR, Cl...

In this article let us walk through what it actually looks like to run Apache Spark on Databricks when you are coming from a self-managed Spark background. If you have been running Spark on EMR, Cl...

In this article, let us go through the Apache Spark transformations that come up again and again in real pipelines. I am not going to list every function in the API. Instead, I will focus on the on...

In this article, let us look at Databricks notebooks vs jobs for production work, why both exist, and where each one fits. When we start building a pipeline on Databricks, most of us begin with a n...

In this article, let us look at a simple data platform architecture on GCP that works well for a small team. This approach is useful when you need to ingest files or events, do some light to medium...

In this article, let us see how to put together a small team data platform on AWS without creating a huge platform engineering project for ourselves. This kind of setup is useful when the team want...

In this article let us see a simple CDC ingestion pattern that works well when you want to bring source system changes into a data lake or warehouse without building something too fancy on day one....

In this article let us see how to handle retries and idempotency in ETL jobs, and why this matters when a pipeline fails halfway and we need to run it again without creating bad data. Most teams st...

In this article, let us look at how to set up a proper CI/CD pipeline for Terraform using GitHub Actions. If you have been running Terraform from your local machine, you might have noticed it works...

In this article, I want to walk through how we approach partitioning for data lake tables. I have seen this done wrong enough times that I think it is worth writing down what actually works in prac...

In this article let us walk through the medallion architecture pattern — landing, bronze, silver, and gold layers — and why teams use this approach when building data lakehouses. If you are coming ...