WHY RAG PIPELINES FAIL IN PRODUCTION
Most retrieval-augmented generation systems fail not because the embeddings are wrong, but because the chunking strategy misunderstands how humans ask questions.
The distance between the query and the relevant passage isn't semantic — it's structural. When you optimize for retrieval accuracy in isolation, you often destroy the contextual coherence that makes an answer actually useful. Fix the chunking before you tune the embeddings.
Most engineers building RAG systems reach for the most sophisticated embedding model they can find. It's the wrong instinct. The embedding model is downstream of the chunking strategy, and a perfect embedding of a badly chunked document retrieves precisely the wrong thing.
The Chunking Problem
Documents aren't written the way questions are asked. A technical manual structures information hierarchically — title, section, subsection, procedure. A user asking 'how do I reset my password' is searching for a specific action buried inside a larger security section. Fixed-size chunking treats both identically, and that's where retrieval breaks.
Semantic chunking — splitting on topic boundaries rather than character counts — dramatically improves retrieval precision. It's slower to index but the trade-off is almost always worth it in production. Pair this with hierarchical retrieval (chunk + parent document) and you eliminate most of the 'partially relevant' failures that show up in evaluation.
What Actually Works
Three things matter more than embedding model choice: chunk overlap strategy, metadata filtering, and query rewriting. Overlap ensures context isn't severed at arbitrary boundaries. Metadata filtering lets you eliminate irrelevant documents before they reach the reranker. Query rewriting — expanding or decomposing the user's question — bridges the vocabulary gap between how users ask and how documents are written.
The systems I've seen fail in production all had the same shape: a strong embedding model, minimal retrieval logic, and no evaluation harness to catch degradation. Fix the chunking first. Evaluate the retrieval separately from the generation. Ship incrementally.