RAG Data Pipelines: Chunking, Metadata, and Freshness
RAG systems are often evaluated by model choice, but retrieval quality usually matters more.
And retrieval quality is mostly a data pipeline design problem.
Three pipeline decisions that drive RAG quality
- how you chunk documents
- which metadata you keep
- how fresh your index is
Chunking strategy
Avoid one-size-fits-all chunking.
Good practice:
- keep semantic units together
- avoid splitting tables/code blocks badly
- use overlap carefully
Chunk too small -> weak context. Chunk too large -> noisy retrieval.
Metadata design
Each chunk should carry useful metadata:
- source id
- source type
- owner/team
- last updated timestamp
- domain tags
Metadata improves filtering and ranking quality significantly.
Freshness pipeline
Retrieval freshness often drifts because index updates are treated as a side task.
Build explicit freshness SLAs:
- ingestion lag threshold
- index update lag threshold
- stale-document alerts
Minimal RAG pipeline shape
flowchart LR
A[Source docs/data] --> B[Normalize]
B --> C[Chunk + Tag]
C --> D[Embed]
D --> E[Index]
E --> F[Query + Retrieve]
Final take
RAG quality is not just prompt tuning.
It is pipeline quality.
If chunking, metadata, and freshness are designed well, your model will look smarter without changing the model.
This post is licensed under CC BY 4.0 by the author.