Free course — 2 free chapters of every course. No credit card.Start learning free
Developer Guide

Build a Local RAG Pipeline: Ollama + ChromaDB Step-by-Step

April 23, 2026
19 min read
Local AI Master Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

Build a Local RAG Pipeline: Ollama + ChromaDB Step-by-Step

Published on April 23, 2026 • 19 min read

I have shipped four RAG systems into production over the last 18 months: an internal company knowledge bot, a legal document Q&A tool, a medical reference assistant, and a personal "search my entire life" archive. The first one was a disaster. The fourth one is good. The difference between them came down to four boring decisions: how I chunked, which embedding model I used, how I measured retrieval quality, and where I drew the line between "fix the prompt" and "fix the data."

This guide is the playbook I wish I had on day one. Ollama for the LLM, ChromaDB for the vector store, sentence-transformers or nomic-embed-text for embeddings — every component runs on your machine. No OpenAI key. No vendor lock-in. No data leaving your laptop.

The result: a working private RAG system you can deploy as an internal API in about 90 minutes.


Quick Start: A Working RAG in 5 Commands {#quick-start}

# 1. Install Ollama and pull models
brew install ollama
ollama pull llama3.2:8b           # the LLM
ollama pull nomic-embed-text       # embeddings model

# 2. Install Python deps
pip install chromadb llama-index llama-index-llms-ollama \
  llama-index-embeddings-ollama langchain-text-splitters

# 3. Drop your documents into ./docs/
mkdir docs && cp ~/Downloads/*.pdf docs/

# 4. Run the indexer (script provided below)
python rag_index.py

# 5. Ask a question
python rag_query.py "What is our company's PTO policy?"

That works. It is not production-quality yet — but it answers questions about your own documents, fully offline, in five commands. The rest of this guide explains what to fix before you trust it for anything important.


What RAG Actually Is (and What It Is Not) {#what-rag-is}

RAG — Retrieval Augmented Generation — is three boring components in a trench coat:

  1. A retriever that finds relevant document chunks for a question.
  2. A vector store that holds embeddings of your chunks.
  3. An LLM that reads the retrieved chunks and writes an answer.

That's it. The magic is in the joins, not the components.

What RAG is not: a magic way to get the LLM to "know" your documents. The LLM still hallucinates. It just hallucinates less because it sees relevant text in its context window. The retrieval quality is the ceiling on the whole system.

User question
   │
   ▼
[ Retriever ] ──► [ Embeddings ] ──► [ ChromaDB ]
                                          │
                              top-k chunks│
                                          ▼
                                  [ Ollama LLM ]
                                          │
                                          ▼
                                       Answer

If the retriever returns garbage, the LLM produces a confident, well-written answer to the wrong question. This is the failure mode 80% of buggy RAG systems share.


Architecture: Why These Components {#architecture}

ComponentChoiceWhy
LLM runtimeOllama 0.4+One-line install, OpenAI-compatible API, models hot-swap
Embedding modelnomic-embed-text or BGE-M3Best open embedding quality / cost ratio
Vector storeChromaDB 0.5+Local-first, persistent, easy to back up, no Docker required
OrchestrationLlamaIndex 0.12+ or LangChainBoth work — LlamaIndex is simpler for retrieval-only
API layerFastAPIProduction-grade async HTTP in 30 lines

You can swap the vector store for Qdrant, Weaviate, or pgvector. ChromaDB wins for local development because it persists to a single directory and requires no infrastructure. For team-scale deployments Qdrant scales further, but most internal RAG projects never outgrow ChromaDB.


Step 1: Choose the Right Embedding Model {#embeddings}

This is the most underrated decision in any RAG system. Your embedding model determines whether the right chunk gets retrieved.

ModelSizeDimEnglishMultilingualSpeed
nomic-embed-text137 MB768ExcellentDecentFast
mxbai-embed-large670 MB1024BestDecentMedium
bge-m32.27 GB1024ExcellentExcellentMedium
all-MiniLM-L6-v291 MB384GoodPoorVery Fast
snowflake-arctic-embed:l670 MB1024ExcellentDecentMedium

My recommendations:

  • For English-only documents: nomic-embed-text (fast, 768-dim, very high recall on standard benchmarks)
  • For multilingual or technical documents: bge-m3
  • For prototypes and tiny corpora: all-MiniLM-L6-v2 (fast, but lower recall)

Pull the embedding model into Ollama:

ollama pull nomic-embed-text

Verify dimensions:

import requests
r = requests.post("http://localhost:11434/api/embeddings",
    json={"model": "nomic-embed-text", "prompt": "test string"})
print(len(r.json()["embedding"]))   # 768

Match this dimension when you create your ChromaDB collection. Mismatched dims is the source of half the "RAG returns nothing" bug reports.


Step 2: Chunking — The Most Important Decision {#chunking}

If you only get one thing right in your RAG system, get chunking right.

Bad chunking — splitting on arbitrary character counts — destroys semantic boundaries and gives the retriever incoherent fragments. Good chunking respects the structure of your documents.

Chunking strategy by document type:

Document TypeStrategyChunk SizeOverlap
Long-form articles, booksRecursive character split1000 chars200
CodeSymbol-aware (function/class)800 chars100
MarkdownHeader-aware1500 chars200
HTMLTag-aware1200 chars150
Tables / structured dataRow-based1 row0
Short snippets (chat logs)Whole-documentn/an/a

A working chunker for mixed text:

# rag_chunker.py
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
    is_separator_regex=False,
)

def chunk(text: str, metadata: dict) -> list[dict]:
    chunks = splitter.split_text(text)
    return [
        {"text": c, "metadata": {**metadata, "chunk_idx": i, "len": len(c)}}
        for i, c in enumerate(chunks)
    ]

The 200-character overlap matters more than people realize. Without overlap, a question whose answer spans a chunk boundary is unanswerable.

For PDFs specifically, prefer open-parse or Docling over PyMuPDF when you need table preservation. PDF chunking is its own rabbit hole — bad PDF parsing is the second-most common cause of bad RAG.


Step 3: Build the Index {#build-index}

Here is the full indexer. It loads documents, chunks them, embeds with Ollama, and persists to ChromaDB.

# rag_index.py
import chromadb
import os
from pathlib import Path
from chromadb.utils.embedding_functions import OllamaEmbeddingFunction
from langchain_text_splitters import RecursiveCharacterTextSplitter
import pypdf

DB_DIR = "./chroma_db"
DOCS_DIR = "./docs"
COLLECTION_NAME = "kb"
OLLAMA_URL = "http://localhost:11434"
EMBED_MODEL = "nomic-embed-text"

client = chromadb.PersistentClient(path=DB_DIR)

embed_fn = OllamaEmbeddingFunction(
    url=f"{OLLAMA_URL}/api/embeddings",
    model_name=EMBED_MODEL,
)

collection = client.get_or_create_collection(
    name=COLLECTION_NAME,
    embedding_function=embed_fn,
    metadata={"hnsw:space": "cosine"},
)

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
)

def read_pdf(path: Path) -> str:
    reader = pypdf.PdfReader(path)
    return "\n\n".join(p.extract_text() for p in reader.pages if p.extract_text())

def read_text(path: Path) -> str:
    return path.read_text(encoding="utf-8", errors="ignore")

READERS = {".pdf": read_pdf, ".md": read_text, ".txt": read_text}

def index_all():
    docs_added = 0
    for path in Path(DOCS_DIR).rglob("*"):
        if path.suffix.lower() not in READERS:
            continue
        text = READERS[path.suffix.lower()](path)
        chunks = splitter.split_text(text)
        ids = [f"{path.stem}-{i}" for i in range(len(chunks))]
        metas = [{"source": str(path), "chunk": i} for i in range(len(chunks))]
        collection.upsert(documents=chunks, ids=ids, metadatas=metas)
        docs_added += len(chunks)
        print(f"Indexed {path.name}: {len(chunks)} chunks")
    print(f"\nDone. {docs_added} chunks total.")

if __name__ == "__main__":
    index_all()

Run it:

python rag_index.py

Index size on disk for ChromaDB: roughly 10 KB per chunk including the embedding vector. A 100,000-chunk index is about 1 GB.


Step 4: Build the Query Service {#query-service}

# rag_query.py
import chromadb
import sys
import requests
from chromadb.utils.embedding_functions import OllamaEmbeddingFunction

DB_DIR = "./chroma_db"
COLLECTION_NAME = "kb"
EMBED_MODEL = "nomic-embed-text"
LLM_MODEL = "llama3.2:8b"
TOP_K = 5

client = chromadb.PersistentClient(path=DB_DIR)
embed_fn = OllamaEmbeddingFunction(
    url="http://localhost:11434/api/embeddings",
    model_name=EMBED_MODEL,
)
collection = client.get_collection(COLLECTION_NAME, embedding_function=embed_fn)

PROMPT_TEMPLATE = """You are a careful assistant. Answer ONLY using the context below.
If the answer is not in the context, say "I do not know based on the provided documents."

CONTEXT:
{context}

QUESTION: {question}

ANSWER:"""

def ask(question: str) -> dict:
    result = collection.query(query_texts=[question], n_results=TOP_K)
    chunks = result["documents"][0]
    sources = [m["source"] for m in result["metadatas"][0]]

    context = "\n\n---\n\n".join(
        f"[Source: {s}]\n{c}" for s, c in zip(sources, chunks)
    )

    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": LLM_MODEL,
            "prompt": PROMPT_TEMPLATE.format(context=context, question=question),
            "stream": False,
            "options": {"temperature": 0.1, "num_ctx": 4096},
        },
        timeout=120,
    )
    answer = response.json()["response"].strip()
    return {"answer": answer, "sources": list(set(sources))}

if __name__ == "__main__":
    q = " ".join(sys.argv[1:]) or "What is in these documents?"
    out = ask(q)
    print(f"\nANSWER:\n{out['answer']}\n\nSOURCES:")
    for s in out["sources"]:
        print(f"  - {s}")

Run it:

python rag_query.py "What does the security policy say about laptop encryption?"

Output includes the answer plus the source files used. Source attribution is mandatory in any RAG system you let anyone trust.


Step 5: Wrap It in a FastAPI Service {#api}

For internal team use, expose the query layer as an HTTP API:

# rag_api.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from rag_query import ask

app = FastAPI(title="Local RAG")

class Query(BaseModel):
    question: str
    top_k: int | None = 5

@app.post("/api/ask")
def api_ask(q: Query):
    if not q.question.strip():
        raise HTTPException(400, "empty question")
    return ask(q.question)

@app.get("/api/health")
def health():
    return {"status": "ok"}
pip install fastapi uvicorn
uvicorn rag_api:app --host 0.0.0.0 --port 8000

Now curl -X POST http://localhost:8000/api/ask -H 'Content-Type: application/json' -d '{"question":"..."}' is your private knowledge endpoint.


Step 6: Evaluate Retrieval Quality {#evaluation}

If you cannot measure your RAG system, you cannot improve it. Build a tiny evaluation harness with a hand-curated test set.

# rag_eval.py
import json
from rag_query import ask, collection

# Hand-curated test cases
EVAL = [
    {
        "question": "What is our PTO policy for new hires?",
        "must_contain": ["10 days", "first year"],
        "must_cite": ["hr_handbook.pdf"],
    },
    {
        "question": "Who is the security incident lead?",
        "must_contain": ["security@", "incident"],
        "must_cite": ["security_policy.pdf"],
    },
]

def evaluate():
    pass_count = 0
    for case in EVAL:
        out = ask(case["question"])
        ans = out["answer"].lower()
        srcs = " ".join(out["sources"]).lower()

        contain_ok = all(s.lower() in ans for s in case["must_contain"])
        cite_ok = all(s.lower() in srcs for s in case["must_cite"])

        ok = contain_ok and cite_ok
        pass_count += int(ok)
        status = "PASS" if ok else "FAIL"
        print(f"[{status}] {case['question']}")
        if not ok:
            print(f"  Answer: {out['answer'][:200]}")
            print(f"  Sources: {out['sources']}")
    print(f"\n{pass_count}/{len(EVAL)} passed")

if __name__ == "__main__":
    evaluate()

This is not fancy. It is also the difference between a RAG you tune by guessing and one you tune by measuring. Aim for 80%+ pass rate before deploying to a team.

For deeper RAG evaluation, look at RAGAS for automated metrics like faithfulness and context precision. RAGAS works with local LLMs as the judge model — useful for fully self-hosted eval.


Step 7: Tune Retrieval — The Knobs That Matter {#tune-retrieval}

After your first eval run, you will see failures. Here is the order I tune in:

1. Top-k. Default is 5. For dense, narrowly scoped corpora, 3 may be enough. For diverse corpora, try 8-10.

2. Chunk size. If the LLM keeps saying "context insufficient," your chunks are too small. If retrieval brings back irrelevant content, they are too big.

3. Hybrid search. Pure vector search misses exact-match keywords (names, IDs, codes). Combine with BM25:

# Add BM25 alongside vector retrieval
pip install rank-bm25
from rank_bm25 import BM25Okapi
# tokenize chunks, build BM25 index, return weighted union of vector + BM25 top-k

4. Re-ranking. After retrieving top-20 with vectors, re-rank with a cross-encoder (e.g., mixedbread-ai/mxbai-rerank-large-v1) and keep top-5. This single change fixed our company knowledge bot from 60% to 87% retrieval accuracy.

5. Metadata filtering. ChromaDB supports where filters. Filter by document type, date, or department before vector search. Smaller search space = better results.

collection.query(
    query_texts=[q],
    n_results=5,
    where={"department": "engineering"},
)

6. Prompt engineering. Move "ground answers in context" instructions higher. Add: "If the question requires a number or date, quote it directly from the source."


Benchmarks: How Fast Is This? {#benchmarks}

On a MacBook Pro M3 (16GB unified memory):

OperationThroughput
Indexing (PDFs)~120 chunks/sec
Embedding (nomic-embed-text)~250 embeddings/sec
ChromaDB vector search (100K chunks)8-12 ms per query
End-to-end query (5 chunks, llama3.2:8b)1.4-2.1 sec
End-to-end query (5 chunks, llama3.1:70b on 96GB Mac)6.8-9.2 sec

For comparison: an OpenAI Ada-002 + GPT-4 Turbo cloud RAG completes in 1.0-1.6 sec but at $0.012-$0.020 per query, plus you ship every document chunk to OpenAI. At 10,000 queries per month, the cloud bill is $120-$200. Local is free after the hardware.

For our cost analysis, see Ollama vs ChatGPT API cost breakdown.


The 12 Mistakes That Break Local RAG {#pitfalls}

  1. Wrong embedding dimension. Pull the model, hardcode the dim, or read it dynamically. Mismatch = silent wrong answers.
  2. Tiny chunk overlap. 0% overlap loses cross-boundary answers. 200 chars is the sweet spot.
  3. No source attribution. Users will not trust answers without citations. Bake source IDs into every response.
  4. Mixing chunk sizes across formats. PDFs and code need different splitters. One-size-fits-all hurts retrieval.
  5. Skipping evaluation. Without an eval set you are tuning by vibes.
  6. Trusting the LLM to refuse. Add a "I do not know" instruction in the prompt. Local LLMs often confabulate without it.
  7. Forgetting hybrid search. Vector-only misses literal IDs, names, and codes.
  8. Indexing duplicates. Use upsert with stable IDs; never let duplicate chunks accumulate.
  9. No incremental re-index. Build a "watch documents folder" daemon or your index goes stale.
  10. Ignoring token budgets. Llama 3.2 8B has 128K context, but practical accuracy drops above 16K. Truncate context.
  11. Letting users dump huge questions. Long, multi-part questions confuse retrieval. Split or pre-rewrite.
  12. No backups. Back up the chroma_db directory. Rebuilding from raw documents takes hours on a real corpus.

Going to Production: Hardening Checklist {#production}

If you are about to deploy this internally, walk through these:

  • Stable doc IDs and incremental re-indexing
  • Source attribution in every response
  • Eval set of 20-50 questions with expected answers
  • Re-ranker enabled (top-20 vector → top-5 cross-encoder)
  • Hybrid search (vector + BM25)
  • Metadata filtering wired to your document taxonomy
  • Backups of the ChromaDB persistent directory
  • FastAPI behind authentication (keys, OIDC, or mTLS)
  • Rate limiting (60 req/min per user)
  • Audit logging of every question and answer
  • Health check endpoint and Prometheus metrics
  • Run the full pipeline against Ollama production deployment checklist

For the auth, monitoring, and audit pieces, our securing Ollama guide covers the patterns we use.


What I Would Build Next {#next-steps}

The pipeline above is solid for single-tenant team RAG. Three additions worth considering:

1. Streaming responses. Switch stream: True in the Ollama call and yield chunks via FastAPI's StreamingResponse. Massive UX improvement.

2. Conversation memory. Store chat history per user and feed last 2-3 turns plus retrieved context. The model gives much better follow-up answers.

3. Tool calling. Let the model trigger live searches, calendar lookups, or SQL queries when retrieval is insufficient. See our Ollama function calling guide.

For the broader RAG ecosystem context, the canonical reference is the ChromaDB official documentation. For embedding model evaluation, the MTEB leaderboard is the most rigorous public benchmark.


Closing Thoughts {#closing}

The first time I built a local RAG, I obsessed over which LLM to use. It barely mattered. The 8B and 70B models gave roughly equivalent answers when retrieval was good, and both gave terrible answers when retrieval was bad. The whole game is the retrieval pipeline — embeddings, chunking, hybrid search, re-ranking. The LLM is the last 10% of the work.

If you take one thing away: build the eval harness before you tune anything. Without measurement you will spend weeks "improving" your RAG with no idea whether it is getting better or worse.

The code in this guide is a complete starting point. Drop your documents in ./docs/, run the indexer, and you have a private RAG that runs without ever calling out to the internet.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: April 23, 2026🔄 Last Updated: April 23, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

Get the Full RAG Tuning Playbook

One developer-focused email per week: chunking experiments, retrieval benchmarks, and production patterns. No fluff.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Was this helpful?

📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators