RAG is the workhorse of every serious LLM product in 2026 — chatbots, copilots, search, agents. It's also where most teams get stuck, because the gap between a Notebook demo and a production system that answers correctly 95%+ of the time is enormous.
The minimum viable production stack
- Chunking: semantic + structural (not naive 512-token splits)
- Embeddings: text-embedding-3-large or Voyage v3 for domain content
- Vector store: pgvector for <10M chunks, Qdrant or Pinecone beyond
- Hybrid search: BM25 + vector, fused with Reciprocal Rank Fusion
- Reranker: Cohere Rerank 3 or a fine-tuned cross-encoder
- Generator: GPT-4o or Claude 3.5 Sonnet with citation forcing
Where naive RAG breaks
- Retrieves the wrong chunk because the question is phrased differently than the doc
- Hallucinates an answer when no relevant chunk exists
- Cites the right doc but summarizes it incorrectly
- Costs explode as the corpus grows because no caching layer exists
What actually fixes it
The four-step upgrade path
1) Add a query rewriter to expand short user questions. 2) Add hybrid search with reranking. 3) Add a 'no answer' classifier so the model abstains when retrieval is weak. 4) Build an eval set of 100+ real questions and run it on every prompt change.
Cost control
- Cache embeddings — never re-embed unchanged docs
- Use a small model (Haiku, GPT-4o-mini) for rewriting and classification
- Stream responses to cut perceived latency without changing cost
- Set per-tenant rate limits before the bill, not after