Blog/rag-implementation

The Prototype Works. Production Does Not.

A basic RAG pipeline takes about an afternoon to build. Chunk your documents, embed them, store them in a vector database, retrieve the top-k chunks for a query, pass them to an LLM with a prompt. The demo produces answers that look correct. The first real users find the answers that are confidently wrong.

The problem is not the architecture. The problem is that the decisions that determine retrieval quality, chunking strategy, embedding model selection, retrieval scoring, context assembly, and what happens when relevant content is not found, are treated as defaults when they should be treated as the core of the system.

Most RAG implementations underperform because the document processing layer was not designed; it was copied from a tutorial.

Chunking Is the Highest-Leverage Decision

How you split documents determines what the retrieval system can find. A chunk that starts mid-sentence because it hit a character limit carries no semantic coherence. A chunk that contains three unrelated topics scores well for three different queries and provides signal for none of them.

The chunking strategy should be derived from the document structure and the query patterns you expect. Technical documentation with defined sections chunks differently than legal contracts. Customer support transcripts chunk differently than product manuals. A flat character-count split applied uniformly across all document types is a sign that the document processing layer was not actually designed.

Practical approaches that consistently outperform naive chunking:

Semantic chunking: split on meaning boundaries detected by a small model, not on character counts
Structural chunking: use document structure (headings, paragraphs, tables) to define chunk boundaries where structure exists
Hierarchical chunking: store both fine-grained chunks for retrieval and their parent context for generation, so retrieved content lands in the LLM with surrounding context intact

The right approach depends on your documents. What is not right is picking one approach without examining your documents first.

Retrieval Quality Is Not Just Vector Similarity

Top-k cosine similarity retrieval is a starting point, not a complete solution. The limitations are predictable: queries that are phrased differently from the document language score poorly despite being semantically equivalent; documents with high lexical overlap but low relevance score well; multi-part queries retrieve chunks that address one part and miss the others.

Hybrid retrieval, combining dense vector search with sparse BM25 keyword search and reranking the combined results, consistently outperforms either method alone. The dense search catches semantic matches; the sparse search catches exact terminology that embedding models sometimes miss. A cross-encoder reranker applied to the merged results produces significantly better relevance ordering than either retrieval method's native scoring.

This is not theoretical. On real production document corpora, hybrid retrieval with reranking typically doubles the proportion of queries where the most relevant chunk ranks in the top three compared to pure vector search. That difference is the difference between a system users trust and one they stop using.

The Context Assembly Problem

What you put in the LLM's context window matters as much as what you retrieve. Retrieved chunks that are individually relevant may combine into a context that is contradictory, redundant, or missing the connective tissue needed to answer the query.

Context assembly decisions:

Ordering: chunks closer to the query should generally come earlier in the context, but you need to verify this against your specific model. Some models attend more strongly to the beginning and end of context.

Deduplication: documents with multiple similar sections will produce similar chunks. Retrieving several near-duplicate chunks wastes context window and can bias the response toward over-represented content.

Metadata injection: source document, section title, date. Giving the LLM grounding information alongside the content reduces hallucination and allows the model to reason about the provenance of what it is citing.

Missing content handling: the system needs a defined behavior when retrieval does not find relevant content. An LLM instructed to answer from context will often answer confidently from training data instead, producing responses that are plausible but ungrounded. Explicit not-found handling and confidence scoring prevent this failure mode.

What Grounding Actually Requires

A grounded RAG system does not just retrieve and generate. It attributes. Every claim in the output traces to a specific source chunk. The UI surfaces those sources so users can verify. The system has a defined policy for queries where grounding cannot be established.

Attribution is not a UX feature. It is a system architecture requirement. It means the retrieval system must preserve source metadata through every processing step. It means the generation prompt must instruct the model to cite sources in a parseable format. It means the output layer must render those citations in a way that is actually useful.

Most implementations skip this because it adds complexity. It is also the feature that determines whether knowledge workers trust the system enough to rely on it for anything that matters.

The Documents You Did Not Design For

Every RAG system eventually encounters a document type it was not designed for. A new file format, an unusually structured contract, a scanned document with imperfect OCR, a table that needs different handling than prose. How the system handles these cases, whether it degrades gracefully or produces confident nonsense, is a function of how robust the document processing pipeline is.

The pipeline should classify documents before processing them, apply processing strategies appropriate to each type, flag documents that fall outside known patterns for human review, and log processing decisions in a way that makes debugging possible when outputs are wrong.

Document processing is not glamorous. It is also where the majority of production RAG failures originate.

Why Most RAG Implementations Underperform (And What to Do About It)