Back to all posts
AI Engineering 10 min read April 28, 2026

RAG in 2026: The Architecture That Actually Works in Production

Most RAG demos look great and fail in production. Here's the chunking, retrieval, and eval setup that ships — with the gotchas we learned the hard way.

🧠

RAG is the workhorse of every serious LLM product in 2026 — chatbots, copilots, search, agents. It's also where most teams get stuck, because the gap between a Notebook demo and a production system that answers correctly 95%+ of the time is enormous.

The minimum viable production stack

  • Chunking: semantic + structural (not naive 512-token splits)
  • Embeddings: text-embedding-3-large or Voyage v3 for domain content
  • Vector store: pgvector for <10M chunks, Qdrant or Pinecone beyond
  • Hybrid search: BM25 + vector, fused with Reciprocal Rank Fusion
  • Reranker: Cohere Rerank 3 or a fine-tuned cross-encoder
  • Generator: GPT-4o or Claude 3.5 Sonnet with citation forcing

Where naive RAG breaks

  1. Retrieves the wrong chunk because the question is phrased differently than the doc
  2. Hallucinates an answer when no relevant chunk exists
  3. Cites the right doc but summarizes it incorrectly
  4. Costs explode as the corpus grows because no caching layer exists

What actually fixes it

The four-step upgrade path

1) Add a query rewriter to expand short user questions. 2) Add hybrid search with reranking. 3) Add a 'no answer' classifier so the model abstains when retrieval is weak. 4) Build an eval set of 100+ real questions and run it on every prompt change.

Cost control

  • Cache embeddings — never re-embed unchanged docs
  • Use a small model (Haiku, GPT-4o-mini) for rewriting and classification
  • Stream responses to cut perceived latency without changing cost
  • Set per-tenant rate limits before the bill, not after

Need help with AI Integration Services?

Our team builds and ships this every week. Get a free 30-minute scoping call and a clear quote.

Frequently Asked Questions

Do I need a vector database or is pgvector enough?

For most companies under 10M chunks, pgvector on managed Postgres is faster to ship and cheaper to run. Move to Qdrant or Pinecone when you hit scale or need advanced filtering.

How do I know if my RAG system is actually good?

Build a labeled eval set of 100+ real user questions with expected answers. Score every change against it. Without evals, you're guessing.

Can I skip reranking?

You can, but accuracy usually drops 10–20%. Reranking is the cheapest single upgrade you can make.

Ready to Put This Into Action?

Tell us what you're working on and we'll come back with a clear plan.