What is RAG and why use it locally?

RAG (Retrieval-Augmented Generation) combines LLMs with external knowledge retrieval. Instead of relying solely on training data, RAG fetches relevant documents to answer queries accurately. Running RAG locally means: zero API costs, complete data privacy (documents never leave your machine), no rate limits, and offline capability. Local RAG is ideal for sensitive documents, enterprise data, and high-volume applications.

Which vector database is best for local RAG?

For local RAG, Chroma is the easiest to set up—it runs in-memory or persists to disk with no external dependencies. FAISS (Facebook) offers faster similarity search for large datasets but requires more setup. Qdrant provides a full-featured database with filtering and persistence. For beginners, start with Chroma. For production with millions of documents, consider Qdrant or Milvus.

What embedding model should I use locally?

The best local embedding models are: nomic-embed-text (768 dimensions, excellent quality, runs on Ollama), all-MiniLM-L6-v2 (384 dimensions, fast, runs on CPU), and mxbai-embed-large (1024 dimensions, highest quality). For most use cases, nomic-embed-text via Ollama provides the best balance of quality and ease of use. It processes ~100 documents/second on a modern GPU.

How much VRAM do I need for local RAG?

RAG has two VRAM consumers: embedding model and LLM. Embedding models like nomic-embed-text need only 1-2GB VRAM. The LLM is the main requirement—8GB for 7B models, 16GB for 13B-14B, 24GB for 32B-70B quantized. Total: 16GB VRAM handles RAG with Llama 3.1 8B well. 24GB enables larger models like 32B for better response quality.

What is the best chunking strategy for documents?

Chunk size depends on your documents and queries. For technical docs: 500-1000 tokens with 100 token overlap. For conversational content: 200-500 tokens. For code: split by functions/classes. Key principles: keep semantic units together, use overlap to preserve context at boundaries, and test different sizes with your actual queries. RecursiveCharacterTextSplitter handles most cases well.

How do I handle PDFs and complex documents?

For PDFs: use PyPDF2 or pdfplumber for text extraction, or unstructured for complex layouts with tables and images. For Word docs: python-docx. For web pages: BeautifulSoup or trafilatura. For mixed formats, LangChain's document loaders handle most types automatically. Process images in PDFs separately using vision models if needed.

Can RAG work with private or sensitive documents?

Yes, local RAG is ideal for sensitive documents because data never leaves your machine. Documents are embedded and stored locally, queries are processed locally, and no external APIs see your data. For enterprise deployments, run the entire stack (embeddings, vector DB, LLM) on-premise or in a private cloud with proper access controls.

How do I improve RAG response accuracy?

Improve accuracy by: 1) Using better embeddings (nomic-embed-text over smaller models), 2) Optimizing chunk sizes for your content, 3) Increasing retrieved context (top_k from 3 to 5-10), 4) Using reranking to prioritize relevant chunks, 5) Adding metadata filters for precision, 6) Using larger LLMs (32B+ for complex queries), 7) Implementing hybrid search (dense + sparse/BM25).

What is hybrid search in RAG and when should I use it?

Hybrid search combines dense retrieval (vector similarity) with sparse retrieval (keyword matching like BM25). Dense search finds semantically similar content; sparse search finds exact keyword matches. Use hybrid when: your queries contain specific terms (product names, technical jargon), you need both conceptual and literal matches, or dense search alone misses relevant documents. LangChain's EnsembleRetriever makes this easy—typically use 70% dense, 30% sparse weighting.

How do I handle multilingual documents in RAG?

For multilingual RAG: 1) Use multilingual embedding models like multilingual-e5-large or mxbai-embed-large which support 100+ languages, 2) Keep documents in their original language (embeddings capture cross-lingual semantics), 3) Use multilingual LLMs like Qwen or Llama 4 for generation, 4) Consider language-specific chunking (different languages have different token densities). Query in any language—good embeddings will retrieve relevant documents regardless of language.

What is the difference between naive RAG and advanced RAG?

Naive RAG: embed → retrieve → generate. Works for simple cases but struggles with complex queries. Advanced RAG adds: query expansion (reformulate queries for better retrieval), hypothetical document embeddings (HyDE), multi-query retrieval (generate multiple query variants), reranking (cross-encoder scoring), self-reflection (verify answers against sources), and iterative retrieval (follow-up searches based on initial results). Start naive, add techniques as needed.

How do I evaluate RAG system performance?

Key RAG metrics: 1) Retrieval metrics: Recall@K (are relevant docs retrieved?), MRR (is the most relevant doc ranked first?), 2) Generation metrics: Faithfulness (is the answer grounded in retrieved docs?), Answer relevance (does it address the question?), 3) End-to-end: RAGAS framework combines these automatically. Create a test set of question-answer pairs from your documents. Aim for >0.8 faithfulness and >0.7 answer relevance scores.

RAG Local Setup: Build Retrieval-Augmented Generation Without APIs

RAG Stack Quick Start

Core Components:

Embeddings

nomic-embed-text

Vector DB

Chroma

LLM

Llama 3.1 70B

Framework

LangChain

Quick Install:
pip install langchain langchain-ollama chromadb
ollama pull nomic-embed-text && ollama pull llama3.1:70b

What is RAG?

Retrieval-Augmented Generation (RAG) enhances LLMs with external knowledge. Instead of relying only on training data, RAG:

Retrieves relevant documents from a knowledge base
Augments the prompt with this context
Generates accurate, grounded responses

RAG Architecture

Query → Embedding → Vector Search → Relevant Docs → LLM + Context → Response
                         ↑
                   Vector Database
                   (Your Documents)

Why Local RAG?

Cloud RAG	Local RAG
$0.0001+ per embedding	$0 after hardware
Data sent to cloud	100% private
Rate limits	Unlimited
Internet required	Works offline
OpenAI/Cohere lock-in	Open source

Complete Local RAG Setup

Step 1: Install Dependencies

# Core packages
pip install langchain langchain-ollama langchain-community
pip install chromadb sentence-transformers

# Document loaders
pip install pypdf unstructured python-docx

# Pull models
ollama pull nomic-embed-text
ollama pull llama3.1:70b

Step 2: Create Vector Store

from langchain_community.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize embeddings
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    base_url="http://localhost:11434"
)

# Load documents
loader = DirectoryLoader(
    "./documents",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader
)
documents = loader.load()

# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)
chunks = text_splitter.split_documents(documents)

# Create vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

print(f"Created vector store with {len(chunks)} chunks")

Step 3: Build RAG Chain

from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Initialize LLM
llm = ChatOllama(
    model="llama3.1:70b",
    temperature=0.3
)

# Create retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

# RAG prompt template
template = """Answer the question based only on the following context.
If you cannot answer from the context, say "I don't have enough information."

Context:
{context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

# Build chain
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Query
response = rag_chain.invoke("What are the key findings in the report?")
print(response)

Step 4: Add Document Processing

from langchain_community.document_loaders import (
    PyPDFLoader,
    Docx2txtLoader,
    TextLoader,
    UnstructuredMarkdownLoader
)

def load_document(file_path: str):
    """Load document based on file extension"""
    if file_path.endswith('.pdf'):
        return PyPDFLoader(file_path).load()
    elif file_path.endswith('.docx'):
        return Docx2txtLoader(file_path).load()
    elif file_path.endswith('.txt'):
        return TextLoader(file_path).load()
    elif file_path.endswith('.md'):
        return UnstructuredMarkdownLoader(file_path).load()
    else:
        raise ValueError(f"Unsupported file type: {file_path}")

def add_documents_to_vectorstore(file_paths: list):
    """Add new documents to existing vector store"""
    all_chunks = []

    for path in file_paths:
        docs = load_document(path)
        chunks = text_splitter.split_documents(docs)
        all_chunks.extend(chunks)

    vectorstore.add_documents(all_chunks)
    print(f"Added {len(all_chunks)} chunks from {len(file_paths)} documents")

Vector Database Comparison

Database	Setup	Speed	Features	Best For
Chroma	Easy	Good	Persistent, filters	Beginners, prototypes
FAISS	Medium	Fastest	GPU support	Large datasets
Qdrant	Medium	Good	Full-featured	Production
Weaviate	Complex	Good	GraphQL, modules	Enterprise
Milvus	Complex	Excellent	Distributed	Massive scale

Chroma Setup (Recommended Start)

import chromadb
from chromadb.config import Settings

# Persistent storage
client = chromadb.PersistentClient(path="./chroma_db")

# Create collection
collection = client.get_or_create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
)

FAISS Setup (High Performance)

from langchain_community.vectorstores import FAISS

# Create FAISS index
vectorstore = FAISS.from_documents(
    documents=chunks,
    embedding=embeddings
)

# Save to disk
vectorstore.save_local("./faiss_index")

# Load later
vectorstore = FAISS.load_local(
    "./faiss_index",
    embeddings,
    allow_dangerous_deserialization=True
)

Embedding Models Comparison

Model	Dimensions	Quality	Speed	VRAM
nomic-embed-text	768	Excellent	Fast	1GB
mxbai-embed-large	1024	Best	Medium	2GB
all-MiniLM-L6-v2	384	Good	Fastest	CPU
bge-large-en	1024	Excellent	Medium	2GB
e5-large-v2	1024	Excellent	Medium	2GB

Using Different Embedding Models

# Ollama embeddings (recommended)
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")

# HuggingFace embeddings (CPU-friendly)
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

Advanced RAG Techniques

Hybrid Search (Dense + Sparse)

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# BM25 for keyword matching
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5

# Dense retriever
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Combine with ensemble
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, dense_retriever],
    weights=[0.3, 0.7]  # Favor dense search
)

Reranking for Better Relevance

from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors import CohereRerank

# For local reranking, use cross-encoder
from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_documents(query: str, docs: list, top_k: int = 3):
    """Rerank documents using cross-encoder"""
    pairs = [[query, doc.page_content] for doc in docs]
    scores = reranker.predict(pairs)

    # Sort by score
    scored_docs = list(zip(docs, scores))
    scored_docs.sort(key=lambda x: x[1], reverse=True)

    return [doc for doc, score in scored_docs[:top_k]]

Metadata Filtering

# Add metadata during indexing
chunks_with_metadata = []
for chunk in chunks:
    chunk.metadata["source_type"] = "technical_doc"
    chunk.metadata["date"] = "2026-01"
    chunks_with_metadata.append(chunk)

# Filter during retrieval
retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 5,
        "filter": {"source_type": "technical_doc"}
    }
)

Production RAG Setup

Complete RAG Application

import os
from pathlib import Path
from typing import List, Optional

class LocalRAG:
    def __init__(
        self,
        persist_dir: str = "./rag_db",
        embedding_model: str = "nomic-embed-text",
        llm_model: str = "llama3.1:70b"
    ):
        self.persist_dir = persist_dir

        # Initialize embeddings
        self.embeddings = OllamaEmbeddings(model=embedding_model)

        # Initialize LLM
        self.llm = ChatOllama(model=llm_model, temperature=0.3)

        # Initialize or load vector store
        if os.path.exists(persist_dir):
            self.vectorstore = Chroma(
                persist_directory=persist_dir,
                embedding_function=self.embeddings
            )
        else:
            self.vectorstore = None

        # Text splitter
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200
        )

    def add_documents(self, file_paths: List[str]):
        """Add documents to the knowledge base"""
        all_chunks = []

        for path in file_paths:
            docs = load_document(path)
            chunks = self.text_splitter.split_documents(docs)
            all_chunks.extend(chunks)

        if self.vectorstore is None:
            self.vectorstore = Chroma.from_documents(
                documents=all_chunks,
                embedding=self.embeddings,
                persist_directory=self.persist_dir
            )
        else:
            self.vectorstore.add_documents(all_chunks)

        return len(all_chunks)

    def query(
        self,
        question: str,
        k: int = 5,
        filter: Optional[dict] = None
    ) -> str:
        """Query the knowledge base"""
        if self.vectorstore is None:
            return "No documents in knowledge base"

        # Retrieve relevant docs
        retriever = self.vectorstore.as_retriever(
            search_kwargs={"k": k, "filter": filter}
        )
        docs = retriever.get_relevant_documents(question)

        # Format context
        context = "\n\n".join(doc.page_content for doc in docs)

        # Generate response
        prompt = f"""Answer based on the context below.

Context:
{context}

Question: {question}

Answer:"""

        response = self.llm.invoke(prompt)
        return response.content

    def get_sources(self, question: str, k: int = 3) -> List[dict]:
        """Get source documents for a query"""
        docs = self.vectorstore.similarity_search(question, k=k)
        return [
            {
                "content": doc.page_content[:200],
                "source": doc.metadata.get("source", "unknown")
            }
            for doc in docs
        ]

# Usage
rag = LocalRAG()
rag.add_documents(["./docs/report.pdf", "./docs/manual.pdf"])
answer = rag.query("What are the main conclusions?")
print(answer)

Hardware Requirements

Setup	VRAM	Performance	Use Case
CPU Only	0GB	5 docs/sec	Small datasets
8GB GPU	8GB	50 docs/sec	Personal use
16GB GPU	16GB	100 docs/sec	Professional
24GB GPU	24GB	150 docs/sec	Enterprise

Key Takeaways

Local RAG is fully viable with Ollama, Chroma, and LangChain
nomic-embed-text is the best local embedding model for most cases
Chroma is easiest for beginners; FAISS for performance
Chunk size matters—experiment with 500-1500 tokens
Hybrid search improves accuracy over dense-only retrieval
16GB+ VRAM recommended for smooth production use

Next Steps

Build AI agents that use RAG for knowledge
Set up MCP servers for file access
Compare vector databases in depth
Optimize your GPU for RAG workloads

RAG transforms LLMs from general assistants into experts on your specific documents—all running locally, privately, and without ongoing costs.

RAG Local Setup: Build Retrieval-Augmented Generation Without APIs

Before we dive deeper...

Get your free AI Starter Kit

RAG Stack Quick Start

What is RAG?

RAG Architecture

Why Local RAG?

Complete Local RAG Setup

Step 1: Install Dependencies

Step 2: Create Vector Store

Step 3: Build RAG Chain

Step 4: Add Document Processing

Vector Database Comparison

Chroma Setup (Recommended Start)

FAISS Setup (High Performance)

Embedding Models Comparison

Using Different Embedding Models

Advanced RAG Techniques

Hybrid Search (Dense + Sparse)

Reranking for Better Relevance

Metadata Filtering

Production RAG Setup

Complete RAG Application

Hardware Requirements

Key Takeaways

Next Steps

Want to go from beginner to AI engineer?

Ready to start your AI career?

Get the complete roadmap

Local AI Master Research Team

My 77K Dataset Insights Delivered Weekly

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Master RAG Development

Related Guides

AI Agents Local Guide

Vector Databases Comparison

DeepSeek R1 Local Setup

Best GPUs for Local AI

Written by Pattanaik Ramswarup