Building a Private RAG System: Lessons from the Field
Retrieval-Augmented Generation (RAG) has become the default pattern for enterprise LLM deployments. But when your documents are confidential—legal briefs, patient records, proprietary research—you can’t just pipe them through a third-party API.
Here’s what we’ve learned building private RAG systems for European clients.
Architecture Fundamentals
A sovereign RAG stack has four layers:
- Document ingestion — Parse PDFs, Word docs, emails, etc. into structured chunks
- Embedding & indexing — Generate vector representations using a local embedding model; store in a vector database (we typically use Qdrant or Milvus on-prem)
- Retrieval — Query the index to find relevant context for a user’s question
- Generation — Feed retrieved chunks to a local LLM (Llama, Mistral, or similar) to synthesize an answer
All of this happens inside your infrastructure. No data leaves your network.
Key Decisions
Embedding Model
- Option 1 — Use a lightweight multilingual model like
paraphrase-multilingual-mpnet-base-v2(runs on CPU, ~500ms/chunk) - Option 2 — Fine-tune a domain-specific embedder if you have labeled data (worth it for legal/medical verticals)
Vector Database
- Qdrant — Great Rust performance, easy Docker deployment, solid filtering
- Milvus — More features, scales to billions of vectors, but heavier operational footprint
LLM
- 7B models (Mistral, Llama 3) — Fast, fit on consumer GPUs, good for Q&A
- 13B-70B models — Better reasoning, more expensive to run, overkill for simple retrieval tasks
We’ve found that a 7B model + good retrieval beats a 70B model + poor retrieval every time.
Challenges
- Chunking strategy — Naive fixed-size chunks break mid-sentence or split tables. Use semantic chunking (paragraph-aware) or hierarchical indexing.
- Query-document mismatch — User questions often don’t match document phrasing. Add query rewriting or hypothetical document generation (HyDE).
- Context window limits — Even with 8k-32k context windows, you can’t just dump 50 pages. Rank and filter retrieved chunks aggressively.
- Hallucination — Local models hallucinate less than you’d think if they’re given good context, but citation/source tracking is essential for trust.
Performance Benchmarks
For a typical on-prem setup (single RTX 4090):
- Embedding: ~2-5 docs/sec
- Retrieval: <100ms for top-k=10
- Generation: 30-50 tokens/sec (7B model)
- End-to-end latency: 2-4 seconds for a typical query
Scale horizontally by sharding the vector index or load-balancing inference across multiple GPUs.
When to Choose RAG vs. Fine-Tuning
- RAG — When your knowledge base updates frequently, or you need source citations
- Fine-tuning — When you need to teach the model domain-specific reasoning or style
Most organizations start with RAG and add fine-tuning later if needed.
Next Steps
If you’re evaluating RAG for your organization:
- Start with a small pilot (single use case, 1000-10000 docs)
- Measure retrieval precision (are the right docs surfacing?) before optimizing generation
- Plan for continuous improvement: embedding models and LLMs evolve fast
Need help scoping or implementing? We’d love to chat.