RAG Local Setup: Build Retrieval-Augmented Generation Without APIs
Before we dive deeper...
Get your free AI Starter Kit
Join 12,000+ developers. Instant download: Career Roadmap + Fundamentals Cheat Sheets.
RAG Stack Quick Start
Core Components:
Quick Install:
pip install langchain langchain-ollama chromadb
ollama pull nomic-embed-text && ollama pull llama3.1:70b
What is RAG?
Retrieval-Augmented Generation (RAG) enhances LLMs with external knowledge. Instead of relying only on training data, RAG:
- Retrieves relevant documents from a knowledge base
- Augments the prompt with this context
- Generates accurate, grounded responses
RAG Architecture
Query → Embedding → Vector Search → Relevant Docs → LLM + Context → Response
↑
Vector Database
(Your Documents)
Why Local RAG?
| Cloud RAG | Local RAG |
|---|---|
| $0.0001+ per embedding | $0 after hardware |
| Data sent to cloud | 100% private |
| Rate limits | Unlimited |
| Internet required | Works offline |
| OpenAI/Cohere lock-in | Open source |
Complete Local RAG Setup
Step 1: Install Dependencies
# Core packages
pip install langchain langchain-ollama langchain-community
pip install chromadb sentence-transformers
# Document loaders
pip install pypdf unstructured python-docx
# Pull models
ollama pull nomic-embed-text
ollama pull llama3.1:70b
Step 2: Create Vector Store
from langchain_community.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Initialize embeddings
embeddings = OllamaEmbeddings(
model="nomic-embed-text",
base_url="http://localhost:11434"
)
# Load documents
loader = DirectoryLoader(
"./documents",
glob="**/*.pdf",
loader_cls=PyPDFLoader
)
documents = loader.load()
# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = text_splitter.split_documents(documents)
# Create vector store
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
print(f"Created vector store with {len(chunks)} chunks")
Step 3: Build RAG Chain
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
# Initialize LLM
llm = ChatOllama(
model="llama3.1:70b",
temperature=0.3
)
# Create retriever
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 5}
)
# RAG prompt template
template = """Answer the question based only on the following context.
If you cannot answer from the context, say "I don't have enough information."
Context:
{context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(template)
# Build chain
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# Query
response = rag_chain.invoke("What are the key findings in the report?")
print(response)
Step 4: Add Document Processing
from langchain_community.document_loaders import (
PyPDFLoader,
Docx2txtLoader,
TextLoader,
UnstructuredMarkdownLoader
)
def load_document(file_path: str):
"""Load document based on file extension"""
if file_path.endswith('.pdf'):
return PyPDFLoader(file_path).load()
elif file_path.endswith('.docx'):
return Docx2txtLoader(file_path).load()
elif file_path.endswith('.txt'):
return TextLoader(file_path).load()
elif file_path.endswith('.md'):
return UnstructuredMarkdownLoader(file_path).load()
else:
raise ValueError(f"Unsupported file type: {file_path}")
def add_documents_to_vectorstore(file_paths: list):
"""Add new documents to existing vector store"""
all_chunks = []
for path in file_paths:
docs = load_document(path)
chunks = text_splitter.split_documents(docs)
all_chunks.extend(chunks)
vectorstore.add_documents(all_chunks)
print(f"Added {len(all_chunks)} chunks from {len(file_paths)} documents")
Vector Database Comparison
| Database | Setup | Speed | Features | Best For |
|---|---|---|---|---|
| Chroma | Easy | Good | Persistent, filters | Beginners, prototypes |
| FAISS | Medium | Fastest | GPU support | Large datasets |
| Qdrant | Medium | Good | Full-featured | Production |
| Weaviate | Complex | Good | GraphQL, modules | Enterprise |
| Milvus | Complex | Excellent | Distributed | Massive scale |
Chroma Setup (Recommended Start)
import chromadb
from chromadb.config import Settings
# Persistent storage
client = chromadb.PersistentClient(path="./chroma_db")
# Create collection
collection = client.get_or_create_collection(
name="documents",
metadata={"hnsw:space": "cosine"}
)
FAISS Setup (High Performance)
from langchain_community.vectorstores import FAISS
# Create FAISS index
vectorstore = FAISS.from_documents(
documents=chunks,
embedding=embeddings
)
# Save to disk
vectorstore.save_local("./faiss_index")
# Load later
vectorstore = FAISS.load_local(
"./faiss_index",
embeddings,
allow_dangerous_deserialization=True
)
Embedding Models Comparison
| Model | Dimensions | Quality | Speed | VRAM |
|---|---|---|---|---|
| nomic-embed-text | 768 | Excellent | Fast | 1GB |
| mxbai-embed-large | 1024 | Best | Medium | 2GB |
| all-MiniLM-L6-v2 | 384 | Good | Fastest | CPU |
| bge-large-en | 1024 | Excellent | Medium | 2GB |
| e5-large-v2 | 1024 | Excellent | Medium | 2GB |
Using Different Embedding Models
# Ollama embeddings (recommended)
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text")
# HuggingFace embeddings (CPU-friendly)
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
Advanced RAG Techniques
Hybrid Search (Dense + Sparse)
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
# BM25 for keyword matching
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
# Dense retriever
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Combine with ensemble
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, dense_retriever],
weights=[0.3, 0.7] # Favor dense search
)
Reranking for Better Relevance
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors import CohereRerank
# For local reranking, use cross-encoder
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank_documents(query: str, docs: list, top_k: int = 3):
"""Rerank documents using cross-encoder"""
pairs = [[query, doc.page_content] for doc in docs]
scores = reranker.predict(pairs)
# Sort by score
scored_docs = list(zip(docs, scores))
scored_docs.sort(key=lambda x: x[1], reverse=True)
return [doc for doc, score in scored_docs[:top_k]]
Metadata Filtering
# Add metadata during indexing
chunks_with_metadata = []
for chunk in chunks:
chunk.metadata["source_type"] = "technical_doc"
chunk.metadata["date"] = "2026-01"
chunks_with_metadata.append(chunk)
# Filter during retrieval
retriever = vectorstore.as_retriever(
search_kwargs={
"k": 5,
"filter": {"source_type": "technical_doc"}
}
)
Production RAG Setup
Complete RAG Application
import os
from pathlib import Path
from typing import List, Optional
class LocalRAG:
def __init__(
self,
persist_dir: str = "./rag_db",
embedding_model: str = "nomic-embed-text",
llm_model: str = "llama3.1:70b"
):
self.persist_dir = persist_dir
# Initialize embeddings
self.embeddings = OllamaEmbeddings(model=embedding_model)
# Initialize LLM
self.llm = ChatOllama(model=llm_model, temperature=0.3)
# Initialize or load vector store
if os.path.exists(persist_dir):
self.vectorstore = Chroma(
persist_directory=persist_dir,
embedding_function=self.embeddings
)
else:
self.vectorstore = None
# Text splitter
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
def add_documents(self, file_paths: List[str]):
"""Add documents to the knowledge base"""
all_chunks = []
for path in file_paths:
docs = load_document(path)
chunks = self.text_splitter.split_documents(docs)
all_chunks.extend(chunks)
if self.vectorstore is None:
self.vectorstore = Chroma.from_documents(
documents=all_chunks,
embedding=self.embeddings,
persist_directory=self.persist_dir
)
else:
self.vectorstore.add_documents(all_chunks)
return len(all_chunks)
def query(
self,
question: str,
k: int = 5,
filter: Optional[dict] = None
) -> str:
"""Query the knowledge base"""
if self.vectorstore is None:
return "No documents in knowledge base"
# Retrieve relevant docs
retriever = self.vectorstore.as_retriever(
search_kwargs={"k": k, "filter": filter}
)
docs = retriever.get_relevant_documents(question)
# Format context
context = "\n\n".join(doc.page_content for doc in docs)
# Generate response
prompt = f"""Answer based on the context below.
Context:
{context}
Question: {question}
Answer:"""
response = self.llm.invoke(prompt)
return response.content
def get_sources(self, question: str, k: int = 3) -> List[dict]:
"""Get source documents for a query"""
docs = self.vectorstore.similarity_search(question, k=k)
return [
{
"content": doc.page_content[:200],
"source": doc.metadata.get("source", "unknown")
}
for doc in docs
]
# Usage
rag = LocalRAG()
rag.add_documents(["./docs/report.pdf", "./docs/manual.pdf"])
answer = rag.query("What are the main conclusions?")
print(answer)
Hardware Requirements
| Setup | VRAM | Performance | Use Case |
|---|---|---|---|
| CPU Only | 0GB | 5 docs/sec | Small datasets |
| 8GB GPU | 8GB | 50 docs/sec | Personal use |
| 16GB GPU | 16GB | 100 docs/sec | Professional |
| 24GB GPU | 24GB | 150 docs/sec | Enterprise |
Key Takeaways
- Local RAG is fully viable with Ollama, Chroma, and LangChain
- nomic-embed-text is the best local embedding model for most cases
- Chroma is easiest for beginners; FAISS for performance
- Chunk size matters—experiment with 500-1500 tokens
- Hybrid search improves accuracy over dense-only retrieval
- 16GB+ VRAM recommended for smooth production use
Next Steps
- Build AI agents that use RAG for knowledge
- Set up MCP servers for file access
- Compare vector databases in depth
- Optimize your GPU for RAG workloads
RAG transforms LLMs from general assistants into experts on your specific documents—all running locally, privately, and without ongoing costs.
Ready to start your AI career?
Get the complete roadmap
Download the AI Starter Kit: Career path, fundamentals, and cheat sheets used by 12K+ developers.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!