Batch vs Streaming: A Practical Guide for Beginner Data Engineers
When I started building data pipelines, I kept hearing “batch” and “streaming” thrown around like they were completely different worlds. It took me a while to realise they are not as far apart as the industry makes them sound. In this article let us walk through what batch and streaming actually mean, how they differ in practice, and — most importantly — how to decide which one your use case needs.
The Short Version
Batch processing means you process data in chunks, on a schedule. A nightly job that reads yesterday’s sales data, transforms it, and writes it to a reporting table — that is batch.
Streaming processing means you process data as it arrives, one event at a time or in micro-batches. A pipeline that reads click events from Kafka and updates a live dashboard every few seconds — that is streaming.
The distinction is simpler than people make it: batch is about when you process, streaming is about how soon you need the result.
Why This Matters More Than You Think
As a data engineer, the batch-vs-streaming choice affects everything downstream — the tools you pick, how you handle failures, even the structure of your data model. I have seen teams default to streaming because it sounds cool, only to spend weeks debugging exactly-once semantics when a nightly batch job would have been fine.
Let us break it down with a real scenario.
Example: E-Commerce Order Pipeline
Imagine you work for an e-commerce company. Orders come in through a web app, get stored in a PostgreSQL database, and various teams need the data:
- Finance wants a daily revenue report. They check it once in the morning.
- Operations wants to know which items are running low on stock so they can reorder.
- Marketing wants to trigger a discount offer the moment someone abandons their cart.
Three teams, three different latency needs. Let us see how batch and streaming serve each one.
Finance: Batch Makes Perfect Sense
Finance checks numbers once a day. No one is refreshing a revenue dashboard at 2 AM. A scheduled job that runs every night at midnight, aggregates sales by product category, and writes a summary table is exactly what they need.
1
2
3
4
5
6
7
8
9
-- Simple batch aggregation, runs nightly
INSERT INTO daily_revenue_summary (report_date, category, total_sales)
SELECT
DATE(order_timestamp) AS report_date,
product_category,
SUM(order_amount) AS total_sales
FROM orders
WHERE DATE(order_timestamp) = CURRENT_DATE - 1
GROUP BY 1, 2;
This is classic batch. You can run it in Apache Spark, Dataflow (batch mode), dbt, or even plain SQL on a cron job. It is simple, reliable, and cheap.
Operations: Near-Real-Time Might Be Enough
Operations needs to know stock levels, but they do not need sub-second updates. If the dashboard refreshes every 5 minutes, that is fine. This is where micro-batch processing shines — process small chunks of data on a short interval rather than waiting a full day.
Tools like Spark Structured Streaming (which under the hood is micro-batch) or Dataflow with a short window work well here. You get freshness that feels real-time without the operational complexity of a true event-by-event system.
1
2
3
4
5
6
7
8
9
10
11
12
# Spark Structured Streaming reading from a Delta table in micro-batches
inventory_stream = spark \\
.readStream \\
.format("delta") \\
.option("maxFilesPerTrigger", 1) \\
.load("/data/inventory") \\
.groupBy("product_id") \\
.agg({"quantity": "sum"}) \\
.writeStream \\
.trigger(processingTime="5 minutes") \\
.outputMode("complete") \\
.start()
Marketing: Genuine Streaming
Cart abandonment is time-sensitive — you want to trigger a message within seconds of the user leaving. A 5-minute micro-batch will not cut it. This is where you need an event-driven, low-latency streaming system like Apache Kafka + Apache Flink, or Kafka Streams.
The pipeline looks different: events flow from the web app to Kafka, a Flink job picks them up in real time, detects idle carts using a session window, and pushes a message to the notification service.
Comparison Table
Here is a side-by-side look at how batch and streaming stack up across the dimensions that matter day to day:
| Dimension | Batch | Streaming |
|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds |
| Throughput | Very high (optimised for bulk) | Moderate (more overhead per event) |
| Cost | Cheaper — run and shut down | Higher — infrastructure stays up |
| Failure handling | Retry the whole job | Harder — stateful, exactly-once is tricky |
| Tooling maturity | Very mature (Spark, dbt, Airflow) | Growing fast but more complex |
| Debugging | Easier — reprocess fixed data | Harder — events move fast, logs are noisy |
| Data ordering | Easy — sort before processing | Hard — events arrive out of order |
When to Choose What: A Practical Framework
After building pipelines on both sides for a few years, this is how I think about the decision:
- Does anyone need the result in under a minute? If no, batch is probably fine.
- Is the business SLA under 5 minutes? If yes, streaming — no question.
- Are you joining multiple data sources with different arrival times? Streaming joins are painful. If you can tolerate a delay, batch or micro-batch makes joins far simpler.
- How much data? Terabytes a day, hourly batch might be cheaper and easier than streaming every event.
- What does your team know? A working batch pipeline built by your team beats a broken streaming pipeline you cannot debug.
Things That Catch Beginners Out
Streaming Is Not Just “Fast Batch”
The biggest mistake I made was thinking streaming was just batch with a shorter interval. It is not. Streaming introduces state management, watermarks, event-time skew, and backpressure — concepts that simply do not exist in batch.
For example, in batch you can easily compute a running total because you have all the rows. In streaming, you need to maintain state, decide how long to hold it, and handle late-arriving data that might update a total you already emitted.
Exactly-Once is Hard
Every streaming framework talks about exactly-once guarantees, but in practice it is nuanced. You need source and sink connectors that support it, checkpointing that works reliably, and idempotent writes at the destination. In batch, you just delete the output partition and rerun the job.
Monitoring is Different
Batch pipelines have a clear start and end. You check if the job succeeded. With streaming, the job runs forever. You need to monitor throughput, consumer lag, watermark progress, and checkpoint health — not just up-or-down status.
Production vs Demo
Here is something I wish someone told me earlier: the streaming demos you see online — reading from a socket, printing to console — look easy because they skip the hard parts. In production you need to think about:
- Schema evolution: Your events will change structure over time. How do you handle that without breaking the pipeline?
- Dead letter queues: What happens when a malformed event arrives? You cannot just crash the job.
- Scaling: Streaming jobs handle variable load. Auto-scaling with Kafka consumer groups or Flink’s adaptive scheduling needs tuning.
- Cost: Leaving a Flink cluster or Dataflow streaming job running 24/7 costs money. Compare that to a batch job that runs for 20 minutes a day.
For a demo, none of this matters. For a production system, all of it does.
The Middle Ground: Lambda and Kappa
You may have heard of Lambda Architecture (batch layer + speed layer) or Kappa Architecture (streaming only, replay for reprocessing). In practice, most teams I have worked with do not adopt either name formally — they just mix approaches as needed. A common pattern: use streaming for the hot path (real-time dashboards, alerts) and batch for the cold path (daily aggregates, ML feature engineering, data quality checks). Both read from the same source of truth, like a Kafka topic or a data lake.
What I Would Tell My Younger Self
When you are starting out, focus on batch first. Master data modelling, transformation logic, scheduling, and monitoring. These skills transfer directly. Once you are comfortable, pick up streaming with a small project — maybe a simple Kafka consumer that enriches events and writes to a database. The concepts layer on top of what you already know.
Do not let anyone make you feel like batch is legacy and streaming is the future. They solve different problems, and a senior engineer knows which problem they are solving before they pick the tool.
In the next article we could look at building a simple streaming pipeline end to end with Kafka and Flink. Until then, take a real pipeline you maintain and ask yourself: does this actually need to be real-time? The answer is probably no — and that is a good thing.
