RAG

RAG Local Setup: Build Retrieval-Augmented Generation Without APIs

February 4, 2026
18 min read
Local AI Master Research Team
🎁 4 PDFs included
Newsletter

Before we dive deeper...

Get your free AI Starter Kit

Join 12,000+ developers. Instant download: Career Roadmap + Fundamentals Cheat Sheets.

No spam, everUnsubscribe anytime
12,000+ downloads

RAG Stack Quick Start

Core Components:

Embeddings
nomic-embed-text
Vector DB
Chroma
LLM
Llama 3.1 70B
Framework
LangChain

Quick Install:
pip install langchain langchain-ollama chromadb
ollama pull nomic-embed-text && ollama pull llama3.1:70b

What is RAG?

Retrieval-Augmented Generation (RAG) enhances LLMs with external knowledge. Instead of relying only on training data, RAG:

  1. Retrieves relevant documents from a knowledge base
  2. Augments the prompt with this context
  3. Generates accurate, grounded responses

RAG Architecture

Query → Embedding → Vector Search → Relevant Docs → LLM + Context → Response
                         ↑
                   Vector Database
                   (Your Documents)

Why Local RAG?

Cloud RAGLocal RAG
$0.0001+ per embedding$0 after hardware
Data sent to cloud100% private
Rate limitsUnlimited
Internet requiredWorks offline
OpenAI/Cohere lock-inOpen source

Complete Local RAG Setup

Step 1: Install Dependencies

# Core packages
pip install langchain langchain-ollama langchain-community
pip install chromadb sentence-transformers

# Document loaders
pip install pypdf unstructured python-docx

# Pull models
ollama pull nomic-embed-text
ollama pull llama3.1:70b

Step 2: Create Vector Store

from langchain_community.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize embeddings
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    base_url="http://localhost:11434"
)

# Load documents
loader = DirectoryLoader(
    "./documents",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader
)
documents = loader.load()

# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)
chunks = text_splitter.split_documents(documents)

# Create vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

print(f"Created vector store with {len(chunks)} chunks")

Step 3: Build RAG Chain

from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Initialize LLM
llm = ChatOllama(
    model="llama3.1:70b",
    temperature=0.3
)

# Create retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

# RAG prompt template
template = """Answer the question based only on the following context.
If you cannot answer from the context, say "I don't have enough information."

Context:
{context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

# Build chain
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Query
response = rag_chain.invoke("What are the key findings in the report?")
print(response)

Step 4: Add Document Processing

from langchain_community.document_loaders import (
    PyPDFLoader,
    Docx2txtLoader,
    TextLoader,
    UnstructuredMarkdownLoader
)

def load_document(file_path: str):
    """Load document based on file extension"""
    if file_path.endswith('.pdf'):
        return PyPDFLoader(file_path).load()
    elif file_path.endswith('.docx'):
        return Docx2txtLoader(file_path).load()
    elif file_path.endswith('.txt'):
        return TextLoader(file_path).load()
    elif file_path.endswith('.md'):
        return UnstructuredMarkdownLoader(file_path).load()
    else:
        raise ValueError(f"Unsupported file type: {file_path}")

def add_documents_to_vectorstore(file_paths: list):
    """Add new documents to existing vector store"""
    all_chunks = []

    for path in file_paths:
        docs = load_document(path)
        chunks = text_splitter.split_documents(docs)
        all_chunks.extend(chunks)

    vectorstore.add_documents(all_chunks)
    print(f"Added {len(all_chunks)} chunks from {len(file_paths)} documents")

Vector Database Comparison

DatabaseSetupSpeedFeaturesBest For
ChromaEasyGoodPersistent, filtersBeginners, prototypes
FAISSMediumFastestGPU supportLarge datasets
QdrantMediumGoodFull-featuredProduction
WeaviateComplexGoodGraphQL, modulesEnterprise
MilvusComplexExcellentDistributedMassive scale
import chromadb
from chromadb.config import Settings

# Persistent storage
client = chromadb.PersistentClient(path="./chroma_db")

# Create collection
collection = client.get_or_create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
)

FAISS Setup (High Performance)

from langchain_community.vectorstores import FAISS

# Create FAISS index
vectorstore = FAISS.from_documents(
    documents=chunks,
    embedding=embeddings
)

# Save to disk
vectorstore.save_local("./faiss_index")

# Load later
vectorstore = FAISS.load_local(
    "./faiss_index",
    embeddings,
    allow_dangerous_deserialization=True
)

Embedding Models Comparison

ModelDimensionsQualitySpeedVRAM
nomic-embed-text768ExcellentFast1GB
mxbai-embed-large1024BestMedium2GB
all-MiniLM-L6-v2384GoodFastestCPU
bge-large-en1024ExcellentMedium2GB
e5-large-v21024ExcellentMedium2GB

Using Different Embedding Models

# Ollama embeddings (recommended)
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")

# HuggingFace embeddings (CPU-friendly)
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

Advanced RAG Techniques

Hybrid Search (Dense + Sparse)

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# BM25 for keyword matching
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5

# Dense retriever
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Combine with ensemble
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, dense_retriever],
    weights=[0.3, 0.7]  # Favor dense search
)

Reranking for Better Relevance

from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors import CohereRerank

# For local reranking, use cross-encoder
from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_documents(query: str, docs: list, top_k: int = 3):
    """Rerank documents using cross-encoder"""
    pairs = [[query, doc.page_content] for doc in docs]
    scores = reranker.predict(pairs)

    # Sort by score
    scored_docs = list(zip(docs, scores))
    scored_docs.sort(key=lambda x: x[1], reverse=True)

    return [doc for doc, score in scored_docs[:top_k]]

Metadata Filtering

# Add metadata during indexing
chunks_with_metadata = []
for chunk in chunks:
    chunk.metadata["source_type"] = "technical_doc"
    chunk.metadata["date"] = "2026-01"
    chunks_with_metadata.append(chunk)

# Filter during retrieval
retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 5,
        "filter": {"source_type": "technical_doc"}
    }
)

Production RAG Setup

Complete RAG Application

import os
from pathlib import Path
from typing import List, Optional

class LocalRAG:
    def __init__(
        self,
        persist_dir: str = "./rag_db",
        embedding_model: str = "nomic-embed-text",
        llm_model: str = "llama3.1:70b"
    ):
        self.persist_dir = persist_dir

        # Initialize embeddings
        self.embeddings = OllamaEmbeddings(model=embedding_model)

        # Initialize LLM
        self.llm = ChatOllama(model=llm_model, temperature=0.3)

        # Initialize or load vector store
        if os.path.exists(persist_dir):
            self.vectorstore = Chroma(
                persist_directory=persist_dir,
                embedding_function=self.embeddings
            )
        else:
            self.vectorstore = None

        # Text splitter
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200
        )

    def add_documents(self, file_paths: List[str]):
        """Add documents to the knowledge base"""
        all_chunks = []

        for path in file_paths:
            docs = load_document(path)
            chunks = self.text_splitter.split_documents(docs)
            all_chunks.extend(chunks)

        if self.vectorstore is None:
            self.vectorstore = Chroma.from_documents(
                documents=all_chunks,
                embedding=self.embeddings,
                persist_directory=self.persist_dir
            )
        else:
            self.vectorstore.add_documents(all_chunks)

        return len(all_chunks)

    def query(
        self,
        question: str,
        k: int = 5,
        filter: Optional[dict] = None
    ) -> str:
        """Query the knowledge base"""
        if self.vectorstore is None:
            return "No documents in knowledge base"

        # Retrieve relevant docs
        retriever = self.vectorstore.as_retriever(
            search_kwargs={"k": k, "filter": filter}
        )
        docs = retriever.get_relevant_documents(question)

        # Format context
        context = "\n\n".join(doc.page_content for doc in docs)

        # Generate response
        prompt = f"""Answer based on the context below.

Context:
{context}

Question: {question}

Answer:"""

        response = self.llm.invoke(prompt)
        return response.content

    def get_sources(self, question: str, k: int = 3) -> List[dict]:
        """Get source documents for a query"""
        docs = self.vectorstore.similarity_search(question, k=k)
        return [
            {
                "content": doc.page_content[:200],
                "source": doc.metadata.get("source", "unknown")
            }
            for doc in docs
        ]

# Usage
rag = LocalRAG()
rag.add_documents(["./docs/report.pdf", "./docs/manual.pdf"])
answer = rag.query("What are the main conclusions?")
print(answer)

Hardware Requirements

SetupVRAMPerformanceUse Case
CPU Only0GB5 docs/secSmall datasets
8GB GPU8GB50 docs/secPersonal use
16GB GPU16GB100 docs/secProfessional
24GB GPU24GB150 docs/secEnterprise

Key Takeaways

  1. Local RAG is fully viable with Ollama, Chroma, and LangChain
  2. nomic-embed-text is the best local embedding model for most cases
  3. Chroma is easiest for beginners; FAISS for performance
  4. Chunk size matters—experiment with 500-1500 tokens
  5. Hybrid search improves accuracy over dense-only retrieval
  6. 16GB+ VRAM recommended for smooth production use

Next Steps

  1. Build AI agents that use RAG for knowledge
  2. Set up MCP servers for file access
  3. Compare vector databases in depth
  4. Optimize your GPU for RAG workloads

RAG transforms LLMs from general assistants into experts on your specific documents—all running locally, privately, and without ongoing costs.

🚀 Join 12K+ developers
Newsletter

Ready to start your AI career?

Get the complete roadmap

Download the AI Starter Kit: Career path, fundamentals, and cheat sheets used by 12K+ developers.

No spam, everUnsubscribe anytime
12,000+ downloads
Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

My 77K Dataset Insights Delivered Weekly

Get exclusive access to real dataset optimization strategies and AI model performance tips.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: February 4, 2026🔄 Last Updated: February 4, 2026✓ Manually Reviewed

Master RAG Development

Get weekly tutorials on RAG, vector databases, and document AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset

I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.

✓ 10+ Years in ML/AI✓ 77K Dataset Creator✓ Open Source Contributor
Free Tools & Calculators