Onyx Logo Onyx
Menu

Building a Private RAG System: Lessons from the Field

Onyx Team
RAG LLM On-Premise Privacy
Building a Private RAG System: Lessons from the Field

Retrieval-Augmented Generation (RAG) has become the default pattern for enterprise LLM deployments. But when your documents are confidential—legal briefs, patient records, proprietary research—you can’t just pipe them through a third-party API.

Here’s what we’ve learned building private RAG systems for European clients.

Architecture Fundamentals

A sovereign RAG stack has four layers:

  1. Document ingestion — Parse PDFs, Word docs, emails, etc. into structured chunks
  2. Embedding & indexing — Generate vector representations using a local embedding model; store in a vector database (we typically use Qdrant or Milvus on-prem)
  3. Retrieval — Query the index to find relevant context for a user’s question
  4. Generation — Feed retrieved chunks to a local LLM (Llama, Mistral, or similar) to synthesize an answer

All of this happens inside your infrastructure. No data leaves your network.

Key Decisions

Embedding Model

  • Option 1 — Use a lightweight multilingual model like paraphrase-multilingual-mpnet-base-v2 (runs on CPU, ~500ms/chunk)
  • Option 2 — Fine-tune a domain-specific embedder if you have labeled data (worth it for legal/medical verticals)

Vector Database

  • Qdrant — Great Rust performance, easy Docker deployment, solid filtering
  • Milvus — More features, scales to billions of vectors, but heavier operational footprint

LLM

  • 7B models (Mistral, Llama 3) — Fast, fit on consumer GPUs, good for Q&A
  • 13B-70B models — Better reasoning, more expensive to run, overkill for simple retrieval tasks

We’ve found that a 7B model + good retrieval beats a 70B model + poor retrieval every time.

Challenges

  1. Chunking strategy — Naive fixed-size chunks break mid-sentence or split tables. Use semantic chunking (paragraph-aware) or hierarchical indexing.
  2. Query-document mismatch — User questions often don’t match document phrasing. Add query rewriting or hypothetical document generation (HyDE).
  3. Context window limits — Even with 8k-32k context windows, you can’t just dump 50 pages. Rank and filter retrieved chunks aggressively.
  4. Hallucination — Local models hallucinate less than you’d think if they’re given good context, but citation/source tracking is essential for trust.

Performance Benchmarks

For a typical on-prem setup (single RTX 4090):

  • Embedding: ~2-5 docs/sec
  • Retrieval: <100ms for top-k=10
  • Generation: 30-50 tokens/sec (7B model)
  • End-to-end latency: 2-4 seconds for a typical query

Scale horizontally by sharding the vector index or load-balancing inference across multiple GPUs.

When to Choose RAG vs. Fine-Tuning

  • RAG — When your knowledge base updates frequently, or you need source citations
  • Fine-tuning — When you need to teach the model domain-specific reasoning or style

Most organizations start with RAG and add fine-tuning later if needed.

Next Steps

If you’re evaluating RAG for your organization:

  • Start with a small pilot (single use case, 1000-10000 docs)
  • Measure retrieval precision (are the right docs surfacing?) before optimizing generation
  • Plan for continuous improvement: embedding models and LLMs evolve fast

Need help scoping or implementing? We’d love to chat.