Post

RAG Data Pipelines: Chunking, Metadata, and Freshness

RAG systems are often evaluated by model choice, but retrieval quality usually matters more.

And retrieval quality is mostly a data pipeline design problem.

Three pipeline decisions that drive RAG quality

  1. how you chunk documents
  2. which metadata you keep
  3. how fresh your index is

Chunking strategy

Avoid one-size-fits-all chunking.

Good practice:

  • keep semantic units together
  • avoid splitting tables/code blocks badly
  • use overlap carefully

Chunk too small -> weak context. Chunk too large -> noisy retrieval.

Metadata design

Each chunk should carry useful metadata:

  • source id
  • source type
  • owner/team
  • last updated timestamp
  • domain tags

Metadata improves filtering and ranking quality significantly.

Freshness pipeline

Retrieval freshness often drifts because index updates are treated as a side task.

Build explicit freshness SLAs:

  • ingestion lag threshold
  • index update lag threshold
  • stale-document alerts

Minimal RAG pipeline shape

flowchart LR
    A[Source docs/data] --> B[Normalize]
    B --> C[Chunk + Tag]
    C --> D[Embed]
    D --> E[Index]
    E --> F[Query + Retrieve]

Final take

RAG quality is not just prompt tuning.

It is pipeline quality.

If chunking, metadata, and freshness are designed well, your model will look smarter without changing the model.

This post is licensed under CC BY 4.0 by the author.