Building a Private RAG System: Lessons from the Field

Retrieval-Augmented Generation (RAG) has become the default pattern for enterprise LLM deployments. But when your documents are confidential—legal briefs, patient records, proprietary research—you can’t just pipe them through a third-party API.

Here’s what we’ve learned building private RAG systems for European clients.

Architecture Fundamentals

A sovereign RAG stack has four layers:

Document ingestion — Parse PDFs, Word docs, emails, etc. into structured chunks
Embedding & indexing — Generate vector representations using a local embedding model; store in a vector database (we typically use Qdrant or Milvus on-prem)
Retrieval — Query the index to find relevant context for a user’s question
Generation — Feed retrieved chunks to a local LLM (Llama, Mistral, or similar) to synthesize an answer

All of this happens inside your infrastructure. No data leaves your network.

Key Decisions

Embedding Model

Option 1 — Use a lightweight multilingual model like paraphrase-multilingual-mpnet-base-v2 (runs on CPU, ~500ms/chunk)
Option 2 — Fine-tune a domain-specific embedder if you have labeled data (worth it for legal/medical verticals)

Vector Database

Qdrant — Great Rust performance, easy Docker deployment, solid filtering
Milvus — More features, scales to billions of vectors, but heavier operational footprint

LLM

7B models (Mistral, Llama 3) — Fast, fit on consumer GPUs, good for Q&A
13B-70B models — Better reasoning, more expensive to run, overkill for simple retrieval tasks

We’ve found that a 7B model + good retrieval beats a 70B model + poor retrieval every time.

Challenges

Chunking strategy — Naive fixed-size chunks break mid-sentence or split tables. Use semantic chunking (paragraph-aware) or hierarchical indexing.
Query-document mismatch — User questions often don’t match document phrasing. Add query rewriting or hypothetical document generation (HyDE).
Context window limits — Even with 8k-32k context windows, you can’t just dump 50 pages. Rank and filter retrieved chunks aggressively.
Hallucination — Local models hallucinate less than you’d think if they’re given good context, but citation/source tracking is essential for trust.

Performance Benchmarks

For a typical on-prem setup (single RTX 4090):

Embedding: ~2-5 docs/sec
Retrieval: <100ms for top-k=10
Generation: 30-50 tokens/sec (7B model)
End-to-end latency: 2-4 seconds for a typical query

Scale horizontally by sharding the vector index or load-balancing inference across multiple GPUs.

When to Choose RAG vs. Fine-Tuning

RAG — When your knowledge base updates frequently, or you need source citations
Fine-tuning — When you need to teach the model domain-specific reasoning or style

Most organizations start with RAG and add fine-tuning later if needed.

Next Steps

If you’re evaluating RAG for your organization:

Start with a small pilot (single use case, 1000-10000 docs)
Measure retrieval precision (are the right docs surfacing?) before optimizing generation
Plan for continuous improvement: embedding models and LLMs evolve fast

Need help scoping or implementing? We’d love to chat.