Do I need a GPU to run this RAG pipeline?

No. nomic-embed-text and llama3.2:8b run on CPU at usable speeds — about 250 embeddings per second and 1.4-2.1 seconds per end-to-end query on a MacBook Pro M3. A GPU helps if you use larger LLMs (13B+) or need higher embedding throughput. For most internal team RAG, an Apple Silicon Mac or modern x86 laptop with 16GB RAM is enough.

Why ChromaDB instead of Qdrant, Weaviate, or pgvector?

ChromaDB persists to a single directory and requires no infrastructure, which is ideal for local development and most internal team deployments under 1M chunks. Qdrant scales further and offers richer filtering. Weaviate adds hybrid search natively. pgvector wins if you already run PostgreSQL. The pipeline pattern is the same — only the storage layer changes.

Which embedding model should I pick for a private RAG?

For English-only documents, nomic-embed-text is fast and high-quality. For multilingual or technical text, bge-m3 wins on MTEB benchmarks. For prototypes or memory-constrained environments, all-MiniLM-L6-v2 is fast but lower-recall. Avoid mxbai-embed-large unless you have specifically benchmarked it as better for your domain — the size jump rarely justifies itself.

How big can the index get before ChromaDB struggles?

ChromaDB handles 1-2 million chunks comfortably with cosine HNSW. Above that, query latency creeps from 10ms to 100ms+. At 5M+ chunks, switch to Qdrant or build sharded ChromaDB instances. For a typical company knowledge base of 100,000 chunks, ChromaDB is genuinely fast.

What is the most common reason a RAG system gives wrong answers?

Bad retrieval. The LLM produces a fluent, confident answer regardless of whether the retrieved chunks contain the truth. The fix order is: chunking strategy, embedding model, hybrid search, re-ranking, then prompt engineering. People reach for prompt fixes first because they are easy. The retrieval pipeline is where 80% of wins live.

How often should I re-index my documents?

Stable corpora (HR handbook, security policies): re-index when documents change. Active corpora (engineering wiki, support tickets): nightly incremental re-index. Use stable doc IDs and ChromaDB upsert so you only re-embed changed chunks. A "watch documents folder" daemon is the simplest reliable approach.

Can this scale to multiple users at once?

Yes. Ollama serves concurrent requests, ChromaDB handles concurrent reads, and FastAPI is async by default. On a 16GB Mac, expect 3-5 concurrent users at acceptable latency. For more, run Ollama with multiple workers or scale to a small server. See our load-balancing and rate-limiting guides for production patterns.

Is local RAG actually faster or cheaper than cloud RAG?

Slower per query (1.4-2.1 sec vs 1.0-1.6 sec for cloud), but free at usage volume above ~5,000 queries per month after hardware payback. The bigger benefit is privacy: no document chunks ever leave your machine. For regulated industries, that single property is the entire reason to build local RAG even if the latency is slightly worse.

Build a Local RAG Pipeline: Ollama + ChromaDB Step-by-Step

Published on April 23, 2026 • 19 min read

I have shipped four RAG systems into production over the last 18 months: an internal company knowledge bot, a legal document Q&A tool, a medical reference assistant, and a personal "search my entire life" archive. The first one was a disaster. The fourth one is good. The difference between them came down to four boring decisions: how I chunked, which embedding model I used, how I measured retrieval quality, and where I drew the line between "fix the prompt" and "fix the data."

This guide is the playbook I wish I had on day one. Ollama for the LLM, ChromaDB for the vector store, sentence-transformers or nomic-embed-text for embeddings — every component runs on your machine. No OpenAI key. No vendor lock-in. No data leaving your laptop.

The result: a working private RAG system you can deploy as an internal API in about 90 minutes.

Quick Start: A Working RAG in 5 Commands {#quick-start}

# 1. Install Ollama and pull models
brew install ollama
ollama pull llama3.2:8b           # the LLM
ollama pull nomic-embed-text       # embeddings model

# 2. Install Python deps
pip install chromadb llama-index llama-index-llms-ollama \
  llama-index-embeddings-ollama langchain-text-splitters

# 3. Drop your documents into ./docs/
mkdir docs && cp ~/Downloads/*.pdf docs/

# 4. Run the indexer (script provided below)
python rag_index.py

# 5. Ask a question
python rag_query.py "What is our company's PTO policy?"

That works. It is not production-quality yet — but it answers questions about your own documents, fully offline, in five commands. The rest of this guide explains what to fix before you trust it for anything important.

What RAG Actually Is (and What It Is Not) {#what-rag-is}

RAG — Retrieval Augmented Generation — is three boring components in a trench coat:

A retriever that finds relevant document chunks for a question.
A vector store that holds embeddings of your chunks.
An LLM that reads the retrieved chunks and writes an answer.

That's it. The magic is in the joins, not the components.

What RAG is not: a magic way to get the LLM to "know" your documents. The LLM still hallucinates. It just hallucinates less because it sees relevant text in its context window. The retrieval quality is the ceiling on the whole system.

User question
   │
   ▼
[ Retriever ] ──► [ Embeddings ] ──► [ ChromaDB ]
                                          │
                              top-k chunks│
                                          ▼
                                  [ Ollama LLM ]
                                          │
                                          ▼
                                       Answer

If the retriever returns garbage, the LLM produces a confident, well-written answer to the wrong question. This is the failure mode 80% of buggy RAG systems share.

Architecture: Why These Components {#architecture}

Component	Choice	Why
LLM runtime	Ollama 0.4+	One-line install, OpenAI-compatible API, models hot-swap
Embedding model	nomic-embed-text or BGE-M3	Best open embedding quality / cost ratio
Vector store	ChromaDB 0.5+	Local-first, persistent, easy to back up, no Docker required
Orchestration	LlamaIndex 0.12+ or LangChain	Both work — LlamaIndex is simpler for retrieval-only
API layer	FastAPI	Production-grade async HTTP in 30 lines

You can swap the vector store for Qdrant, Weaviate, or pgvector. ChromaDB wins for local development because it persists to a single directory and requires no infrastructure. For team-scale deployments Qdrant scales further, but most internal RAG projects never outgrow ChromaDB.

Step 1: Choose the Right Embedding Model {#embeddings}

This is the most underrated decision in any RAG system. Your embedding model determines whether the right chunk gets retrieved.

Model	Size	Dim	English	Multilingual	Speed
nomic-embed-text	137 MB	768	Excellent	Decent	Fast
mxbai-embed-large	670 MB	1024	Best	Decent	Medium
bge-m3	2.27 GB	1024	Excellent	Excellent	Medium
all-MiniLM-L6-v2	91 MB	384	Good	Poor	Very Fast
snowflake-arctic-embed:l	670 MB	1024	Excellent	Decent	Medium

My recommendations:

For English-only documents: nomic-embed-text (fast, 768-dim, very high recall on standard benchmarks)
For multilingual or technical documents: bge-m3
For prototypes and tiny corpora: all-MiniLM-L6-v2 (fast, but lower recall)

Pull the embedding model into Ollama:

ollama pull nomic-embed-text

Verify dimensions:

import requests
r = requests.post("http://localhost:11434/api/embeddings",
    json={"model": "nomic-embed-text", "prompt": "test string"})
print(len(r.json()["embedding"]))   # 768

Match this dimension when you create your ChromaDB collection. Mismatched dims is the source of half the "RAG returns nothing" bug reports.

Step 2: Chunking — The Most Important Decision {#chunking}

If you only get one thing right in your RAG system, get chunking right.

Bad chunking — splitting on arbitrary character counts — destroys semantic boundaries and gives the retriever incoherent fragments. Good chunking respects the structure of your documents.

Chunking strategy by document type:

Document Type	Strategy	Chunk Size	Overlap
Long-form articles, books	Recursive character split	1000 chars	200
Code	Symbol-aware (function/class)	800 chars	100
Markdown	Header-aware	1500 chars	200
HTML	Tag-aware	1200 chars	150
Tables / structured data	Row-based	1 row	0
Short snippets (chat logs)	Whole-document	n/a	n/a

A working chunker for mixed text:

# rag_chunker.py
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
    is_separator_regex=False,
)

def chunk(text: str, metadata: dict) -> list[dict]:
    chunks = splitter.split_text(text)
    return [
        {"text": c, "metadata": {**metadata, "chunk_idx": i, "len": len(c)}}
        for i, c in enumerate(chunks)
    ]

The 200-character overlap matters more than people realize. Without overlap, a question whose answer spans a chunk boundary is unanswerable.

For PDFs specifically, prefer open-parse or Docling over PyMuPDF when you need table preservation. PDF chunking is its own rabbit hole — bad PDF parsing is the second-most common cause of bad RAG.

Step 3: Build the Index {#build-index}

Here is the full indexer. It loads documents, chunks them, embeds with Ollama, and persists to ChromaDB.

# rag_index.py
import chromadb
import os
from pathlib import Path
from chromadb.utils.embedding_functions import OllamaEmbeddingFunction
from langchain_text_splitters import RecursiveCharacterTextSplitter
import pypdf

DB_DIR = "./chroma_db"
DOCS_DIR = "./docs"
COLLECTION_NAME = "kb"
OLLAMA_URL = "http://localhost:11434"
EMBED_MODEL = "nomic-embed-text"

client = chromadb.PersistentClient(path=DB_DIR)

embed_fn = OllamaEmbeddingFunction(
    url=f"{OLLAMA_URL}/api/embeddings",
    model_name=EMBED_MODEL,
)

collection = client.get_or_create_collection(
    name=COLLECTION_NAME,
    embedding_function=embed_fn,
    metadata={"hnsw:space": "cosine"},
)

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
)

def read_pdf(path: Path) -> str:
    reader = pypdf.PdfReader(path)
    return "\n\n".join(p.extract_text() for p in reader.pages if p.extract_text())

def read_text(path: Path) -> str:
    return path.read_text(encoding="utf-8", errors="ignore")

READERS = {".pdf": read_pdf, ".md": read_text, ".txt": read_text}

def index_all():
    docs_added = 0
    for path in Path(DOCS_DIR).rglob("*"):
        if path.suffix.lower() not in READERS:
            continue
        text = READERS[path.suffix.lower()](path)
        chunks = splitter.split_text(text)
        ids = [f"{path.stem}-{i}" for i in range(len(chunks))]
        metas = [{"source": str(path), "chunk": i} for i in range(len(chunks))]
        collection.upsert(documents=chunks, ids=ids, metadatas=metas)
        docs_added += len(chunks)
        print(f"Indexed {path.name}: {len(chunks)} chunks")
    print(f"\nDone. {docs_added} chunks total.")

if __name__ == "__main__":
    index_all()

Run it:

python rag_index.py

Index size on disk for ChromaDB: roughly 10 KB per chunk including the embedding vector. A 100,000-chunk index is about 1 GB.

Step 4: Build the Query Service {#query-service}

# rag_query.py
import chromadb
import sys
import requests
from chromadb.utils.embedding_functions import OllamaEmbeddingFunction

DB_DIR = "./chroma_db"
COLLECTION_NAME = "kb"
EMBED_MODEL = "nomic-embed-text"
LLM_MODEL = "llama3.2:8b"
TOP_K = 5

client = chromadb.PersistentClient(path=DB_DIR)
embed_fn = OllamaEmbeddingFunction(
    url="http://localhost:11434/api/embeddings",
    model_name=EMBED_MODEL,
)
collection = client.get_collection(COLLECTION_NAME, embedding_function=embed_fn)

PROMPT_TEMPLATE = """You are a careful assistant. Answer ONLY using the context below.
If the answer is not in the context, say "I do not know based on the provided documents."

CONTEXT:
{context}

QUESTION: {question}

ANSWER:"""

def ask(question: str) -> dict:
    result = collection.query(query_texts=[question], n_results=TOP_K)
    chunks = result["documents"][0]
    sources = [m["source"] for m in result["metadatas"][0]]

    context = "\n\n---\n\n".join(
        f"[Source: {s}]\n{c}" for s, c in zip(sources, chunks)
    )

    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": LLM_MODEL,
            "prompt": PROMPT_TEMPLATE.format(context=context, question=question),
            "stream": False,
            "options": {"temperature": 0.1, "num_ctx": 4096},
        },
        timeout=120,
    )
    answer = response.json()["response"].strip()
    return {"answer": answer, "sources": list(set(sources))}

if __name__ == "__main__":
    q = " ".join(sys.argv[1:]) or "What is in these documents?"
    out = ask(q)
    print(f"\nANSWER:\n{out['answer']}\n\nSOURCES:")
    for s in out["sources"]:
        print(f"  - {s}")

Run it:

python rag_query.py "What does the security policy say about laptop encryption?"

Output includes the answer plus the source files used. Source attribution is mandatory in any RAG system you let anyone trust.

Step 5: Wrap It in a FastAPI Service {#api}

For internal team use, expose the query layer as an HTTP API:

# rag_api.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from rag_query import ask

app = FastAPI(title="Local RAG")

class Query(BaseModel):
    question: str
    top_k: int | None = 5

@app.post("/api/ask")
def api_ask(q: Query):
    if not q.question.strip():
        raise HTTPException(400, "empty question")
    return ask(q.question)

@app.get("/api/health")
def health():
    return {"status": "ok"}

pip install fastapi uvicorn
uvicorn rag_api:app --host 0.0.0.0 --port 8000

Now curl -X POST http://localhost:8000/api/ask -H 'Content-Type: application/json' -d '{"question":"..."}' is your private knowledge endpoint.

Step 6: Evaluate Retrieval Quality {#evaluation}

If you cannot measure your RAG system, you cannot improve it. Build a tiny evaluation harness with a hand-curated test set.

# rag_eval.py
import json
from rag_query import ask, collection

# Hand-curated test cases
EVAL = [
    {
        "question": "What is our PTO policy for new hires?",
        "must_contain": ["10 days", "first year"],
        "must_cite": ["hr_handbook.pdf"],
    },
    {
        "question": "Who is the security incident lead?",
        "must_contain": ["security@", "incident"],
        "must_cite": ["security_policy.pdf"],
    },
]

def evaluate():
    pass_count = 0
    for case in EVAL:
        out = ask(case["question"])
        ans = out["answer"].lower()
        srcs = " ".join(out["sources"]).lower()

        contain_ok = all(s.lower() in ans for s in case["must_contain"])
        cite_ok = all(s.lower() in srcs for s in case["must_cite"])

        ok = contain_ok and cite_ok
        pass_count += int(ok)
        status = "PASS" if ok else "FAIL"
        print(f"[{status}] {case['question']}")
        if not ok:
            print(f"  Answer: {out['answer'][:200]}")
            print(f"  Sources: {out['sources']}")
    print(f"\n{pass_count}/{len(EVAL)} passed")

if __name__ == "__main__":
    evaluate()

This is not fancy. It is also the difference between a RAG you tune by guessing and one you tune by measuring. Aim for 80%+ pass rate before deploying to a team.

For deeper RAG evaluation, look at RAGAS for automated metrics like faithfulness and context precision. RAGAS works with local LLMs as the judge model — useful for fully self-hosted eval.

Step 7: Tune Retrieval — The Knobs That Matter {#tune-retrieval}

After your first eval run, you will see failures. Here is the order I tune in:

1. Top-k. Default is 5. For dense, narrowly scoped corpora, 3 may be enough. For diverse corpora, try 8-10.

2. Chunk size. If the LLM keeps saying "context insufficient," your chunks are too small. If retrieval brings back irrelevant content, they are too big.

3. Hybrid search. Pure vector search misses exact-match keywords (names, IDs, codes). Combine with BM25:

# Add BM25 alongside vector retrieval
pip install rank-bm25
from rank_bm25 import BM25Okapi
# tokenize chunks, build BM25 index, return weighted union of vector + BM25 top-k

4. Re-ranking. After retrieving top-20 with vectors, re-rank with a cross-encoder (e.g., mixedbread-ai/mxbai-rerank-large-v1) and keep top-5. This single change fixed our company knowledge bot from 60% to 87% retrieval accuracy.

5. Metadata filtering. ChromaDB supports where filters. Filter by document type, date, or department before vector search. Smaller search space = better results.

collection.query(
    query_texts=[q],
    n_results=5,
    where={"department": "engineering"},
)

6. Prompt engineering. Move "ground answers in context" instructions higher. Add: "If the question requires a number or date, quote it directly from the source."

Benchmarks: How Fast Is This? {#benchmarks}

On a MacBook Pro M3 (16GB unified memory):

Operation	Throughput
Indexing (PDFs)	~120 chunks/sec
Embedding (nomic-embed-text)	~250 embeddings/sec
ChromaDB vector search (100K chunks)	8-12 ms per query
End-to-end query (5 chunks, llama3.2:8b)	1.4-2.1 sec
End-to-end query (5 chunks, llama3.1:70b on 96GB Mac)	6.8-9.2 sec

For comparison: an OpenAI Ada-002 + GPT-4 Turbo cloud RAG completes in 1.0-1.6 sec but at $0.012-$0.020 per query, plus you ship every document chunk to OpenAI. At 10,000 queries per month, the cloud bill is $120-$200. Local is free after the hardware.

For our cost analysis, see Ollama vs ChatGPT API cost breakdown.

The 12 Mistakes That Break Local RAG {#pitfalls}

Wrong embedding dimension. Pull the model, hardcode the dim, or read it dynamically. Mismatch = silent wrong answers.
Tiny chunk overlap. 0% overlap loses cross-boundary answers. 200 chars is the sweet spot.
No source attribution. Users will not trust answers without citations. Bake source IDs into every response.
Mixing chunk sizes across formats. PDFs and code need different splitters. One-size-fits-all hurts retrieval.
Skipping evaluation. Without an eval set you are tuning by vibes.
Trusting the LLM to refuse. Add a "I do not know" instruction in the prompt. Local LLMs often confabulate without it.
Forgetting hybrid search. Vector-only misses literal IDs, names, and codes.
Indexing duplicates. Use upsert with stable IDs; never let duplicate chunks accumulate.
No incremental re-index. Build a "watch documents folder" daemon or your index goes stale.
Ignoring token budgets. Llama 3.2 8B has 128K context, but practical accuracy drops above 16K. Truncate context.
Letting users dump huge questions. Long, multi-part questions confuse retrieval. Split or pre-rewrite.
No backups. Back up the chroma_db directory. Rebuilding from raw documents takes hours on a real corpus.

Going to Production: Hardening Checklist {#production}

If you are about to deploy this internally, walk through these:

Stable doc IDs and incremental re-indexing
Source attribution in every response
Eval set of 20-50 questions with expected answers
Re-ranker enabled (top-20 vector → top-5 cross-encoder)
Hybrid search (vector + BM25)
Metadata filtering wired to your document taxonomy
Backups of the ChromaDB persistent directory
FastAPI behind authentication (keys, OIDC, or mTLS)
Rate limiting (60 req/min per user)
Audit logging of every question and answer
Health check endpoint and Prometheus metrics
Run the full pipeline against Ollama production deployment checklist

For the auth, monitoring, and audit pieces, our securing Ollama guide covers the patterns we use.

What I Would Build Next {#next-steps}

The pipeline above is solid for single-tenant team RAG. Three additions worth considering:

1. Streaming responses. Switch stream: True in the Ollama call and yield chunks via FastAPI's StreamingResponse. Massive UX improvement.

2. Conversation memory. Store chat history per user and feed last 2-3 turns plus retrieved context. The model gives much better follow-up answers.

3. Tool calling. Let the model trigger live searches, calendar lookups, or SQL queries when retrieval is insufficient. See our Ollama function calling guide.

For the broader RAG ecosystem context, the canonical reference is the ChromaDB official documentation. For embedding model evaluation, the MTEB leaderboard is the most rigorous public benchmark.

Closing Thoughts {#closing}

The first time I built a local RAG, I obsessed over which LLM to use. It barely mattered. The 8B and 70B models gave roughly equivalent answers when retrieval was good, and both gave terrible answers when retrieval was bad. The whole game is the retrieval pipeline — embeddings, chunking, hybrid search, re-ranking. The LLM is the last 10% of the work.

If you take one thing away: build the eval harness before you tune anything. Without measurement you will spend weeks "improving" your RAG with no idea whether it is getting better or worse.

The code in this guide is a complete starting point. Drop your documents in ./docs/, run the indexer, and you have a private RAG that runs without ever calling out to the internet.

Build a Local RAG Pipeline: Ollama + ChromaDB Step-by-Step

Want to go deeper than this article?

Build a Local RAG Pipeline: Ollama + ChromaDB Step-by-Step

Quick Start: A Working RAG in 5 Commands {#quick-start}

What RAG Actually Is (and What It Is Not) {#what-rag-is}

Architecture: Why These Components {#architecture}

Step 1: Choose the Right Embedding Model {#embeddings}

Step 2: Chunking — The Most Important Decision {#chunking}

Step 3: Build the Index {#build-index}

Step 4: Build the Query Service {#query-service}

Step 5: Wrap It in a FastAPI Service {#api}

Step 6: Evaluate Retrieval Quality {#evaluation}

Step 7: Tune Retrieval — The Knobs That Matter {#tune-retrieval}

Benchmarks: How Fast Is This? {#benchmarks}

The 12 Mistakes That Break Local RAG {#pitfalls}

Going to Production: Hardening Checklist {#production}

What I Would Build Next {#next-steps}

Closing Thoughts {#closing}

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

Get the Full RAG Tuning Playbook

Build Real AI on Your Machine

🎓 Continue Learning

Related Guides

Ollama Function Calling and Tool Use

Ollama + JavaScript / TypeScript

Private AI Knowledge Base

Ollama Production Deployment

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI