Post

Batch vs streaming for beginner data engineers

In this article, let us understand what batch processing and streaming mean in data engineering, when to use each one, and why both still matter in real projects. If you are just getting started, these two words can sound bigger than they really are. In many cases, the decision is simply about how fast the data needs to move and how much complexity your team is ready to manage.

When I first started reading about streaming, it looked like everything modern had to be real time. But in practice, many useful data pipelines are still batch jobs running every hour or every day. Streaming is powerful, but it is not automatically the better choice. For our use case, it is better to first understand the business need and then choose the simpler approach that solves it.

What is batch processing

Batch processing means collecting data over a period of time and processing it together. You do not handle each event the moment it arrives. Instead, you wait for a schedule or a trigger and then process a group of records.

A simple example would be:

  • application logs land in cloud storage through the day
  • a job runs every night at 1 AM
  • the job reads all new files
  • it transforms the data
  • it loads the results into a warehouse table

This is still a very common pattern. It is easier to build, easier to debug, and usually cheaper to operate.

What is streaming

Streaming means processing data continuously as it arrives, or at least close to real time. Instead of waiting for a full batch, the pipeline keeps reading events and pushing updates forward.

A simple streaming example would be:

  • an ecommerce application publishes order events to Kafka
  • a consumer reads each event
  • the pipeline validates and enriches the event
  • the result is written into an analytics store within seconds

This is useful when the data needs to be acted on quickly, for example live dashboards, fraud detection, alerts, or near real-time personalization.

A quick comparison

AreaBatchStreaming
Processing patternRuns on schedule or triggerRuns continuously
LatencyMinutes to hoursSeconds to milliseconds
Operational complexityLowerHigher
Cost controlUsually easierCan be harder
DebuggingSimpler replay and backfillMore moving parts
Best forReports, daily loads, periodic syncAlerts, live metrics, event-driven use cases

For a beginner, the main thing to remember is that batch optimizes for simplicity and streaming optimizes for speed.

Batch example using SQL and a scheduled job

Let us say we receive CSV files every day in a landing bucket. We want to create a daily sales summary in the warehouse. A batch pipeline would be enough here.

The steps could look like this:

  1. ingest files into a raw table
  2. run a scheduled SQL job
  3. write results into an aggregated table

A sample query might look like this:

1
2
3
4
5
6
7
8
9
insert into analytics.daily_sales_summary
select
  order_date,
  region,
  count(*) as total_orders,
  sum(order_amount) as total_revenue
from raw.orders
where order_date = current_date - interval '1 day'
group by order_date, region;

And your orchestration could be as simple as a cron job or an Airflow DAG:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from airflow import DAG
from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator
from datetime import datetime

with DAG(
    dag_id='daily_sales_summary',
    start_date=datetime(2025, 1, 1),
    schedule='0 1 * * *',
    catchup=True
):
    build_summary = SQLExecuteQueryOperator(
        task_id='build_summary',
        conn_id='warehouse',
        sql='sql/daily_sales_summary.sql'
    )

This is straightforward. If the job fails, you can rerun it. If a source file arrives late, you can backfill the missing day. This is one reason batch is often the right starting point.

Streaming example using events

Now let us take a different use case. Suppose a payments platform wants to detect suspicious transactions quickly. Waiting until tomorrow morning is not useful. This is where streaming starts making sense.

A simplified streaming flow might look like this:

1
application -> Kafka topic -> stream processor -> fraud score table -> alert service

Pseudo-code for a streaming consumer might look like this:

1
2
3
4
5
6
7
8
9
10
for event in transaction_stream:
    if event.amount > 5000 and event.country not in trusted_countries:
        risk = 'high'
    else:
        risk = 'normal'

    write_to_analytics(event, risk)

    if risk == 'high':
        send_alert(event)

Even this tiny example shows the difference. The pipeline is always running. You now need to care about consumer lag, duplicate events, message ordering, checkpoints, and how to restart the job safely.

When batch is the better choice

For beginner data engineers, batch is often enough when:

  • reports are consumed once per day
  • source systems export files periodically
  • a few hours of delay is acceptable
  • the team is small and wants low maintenance
  • backfills are expected

There is nothing old-fashioned about this. If the business reads a dashboard every morning, a nightly job may be the cleanest solution.

When streaming is worth it

Streaming is worth the extra complexity when:

  • business value depends on low latency
  • users expect live dashboards or notifications
  • events are naturally produced one by one
  • downstream systems need immediate updates
  • late action is almost as bad as no action

The important part is to be honest about the latency requirement. Sometimes people say they need real time, but a 15 minute micro-batch would be perfectly fine. That is worth checking before designing a more complex system.

Practical limitations and caveats

There are a few things to be careful about.

1. Streaming does not remove data quality issues

If bad events come in, they will just move faster. You still need schema validation, dead-letter handling, and a plan for malformed data.

2. Batch can still be near real time

You do not have to choose between once a day and full streaming. Many teams run a job every 5 or 15 minutes. That is often a good middle ground.

3. Reprocessing is usually easier in batch

With batch, you can rerun a partition or a date range. With streaming, replay is possible, but it takes more discipline around event retention, idempotency, and state management.

4. Cost can surprise you

A streaming stack may run all the time even when traffic is low. Batch lets you pay mostly when the job runs. For low-volume workloads, this difference matters.

5. Exactly-once is harder than it sounds

Many beginner architectures assume each message is processed once and only once. In reality, duplicate processing can happen. It is safer to design for idempotency than to depend fully on perfect delivery semantics.

What changes in production

In a demo, you might use one topic, one consumer, and a simple transformation. In production, things usually expand quickly.

For batch systems, production often adds:

  • partitioning and clustering for large tables
  • retry strategy and failure alerts
  • data quality checks
  • proper lineage and logging
  • backfill tooling

For streaming systems, production often adds even more:

  • schema registry or version management
  • checkpointing and state management
  • dead-letter queues
  • monitoring for lag and throughput
  • autoscaling and capacity planning
  • idempotent writes to destinations

That is why I usually suggest beginners learn batch properly first. Once you understand ingestion, transformation, partitioning, orchestration, and warehouse modeling, streaming becomes easier to reason about.

A simple rule of thumb

If your stakeholder asks, “Can I see yesterday’s numbers every morning?” start with batch.

If they ask, “Can I know within a few seconds when this event happens?” then evaluate streaming.

And if the answer is somewhere in between, try a frequent batch or micro-batch design before jumping straight into a full streaming platform.

Conclusion

Batch and streaming are both useful, and one is not automatically more advanced or more correct than the other. Batch is usually simpler to build and maintain, while streaming is useful when low latency actually matters. If you are a beginner data engineer, start by understanding the tradeoff clearly. In many real projects, choosing the simpler option first is what keeps the platform reliable and the team sane.

This post is licensed under CC BY 4.0 by the author.