Free course — 2 free chapters of every course. No credit card.Start learning free
Developer Integration

Ollama Semantic Search: Build a Private Document Search Engine

April 23, 2026
23 min read
LocalAimaster Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

Ollama Semantic Search: Build a Private Document Search Engine

Published on April 23, 2026 -- 23 min read

Imagine searching a company drive of 40,000 PDFs and getting an answer that understands "the legal review of the 2024 Acme acquisition" even when no document contains those exact words. That is semantic search, and the only thing standing between you and having it on your own hardware is a few hundred lines of Python and a working Ollama install.

Cloud semantic search exists. It also costs $0.10 per 1000 documents indexed plus a per-query fee, sends every document to OpenAI or Cohere or Voyage, and can be subpoenaed. None of those are acceptable for legal, healthcare, finance, R&D, or any organisation where the document contents are the actual asset. Local Ollama embeddings remove the bill, the privacy concern, and the dependency on an external service all at once.

This guide walks through the full pipeline: model selection, chunking, vector store choice, indexing, hybrid search, reranking, and benchmarks on a real 50K-document corpus.

Quick Start: ollama pull nomic-embed-text followed by the 30-line indexer below produces a working semantic search engine over a folder of PDFs. Everything else in this guide is about making it good.


Table of Contents

  1. What Semantic Search Actually Does
  2. Choosing an Embedding Model
  3. Chunking Strategy
  4. Vector Store Comparison
  5. Building the Indexer
  6. Building the Search API
  7. Hybrid Search and Reranking
  8. Benchmarks on a 50K-Document Corpus
  9. Pitfalls
  10. Frequently Asked Questions

What Semantic Search Actually Does {#what-it-does}

A keyword search engine indexes the words in each document and matches the words in the query. It works perfectly when you remember the exact phrasing — and badly when you do not. "Quarterly numbers" misses a document titled "Q3 financials."

Semantic search indexes the meaning of each chunk of text as a vector — a list of 768 or 1024 floating-point numbers. At query time, the query is also embedded into a vector, and you find the document chunks whose vectors are closest by cosine similarity. "Quarterly numbers" and "Q3 financials" produce nearby vectors because the embedding model has learned that they describe similar things.

The architecture in three parts:

  1. Indexing pipeline — read documents, split into chunks, embed each chunk with Ollama, store the vector in a vector database alongside the source text and metadata.
  2. Query pipeline — embed the query, find the top-K nearest vectors, retrieve the source chunks.
  3. Optional reranker / hybrid layer — re-score the top-K with a more accurate (slower) model, or combine with keyword search.

Everything else is implementation detail. The interesting choices are which embedding model, which chunk size, which vector store, and whether you need the reranker.

For background on what local AI provides over cloud services, the private AI knowledge base guide walks through the broader use case.


Choosing an Embedding Model {#embedding-models}

Ollama hosts several embedding models. Three are worth using; the rest are mostly historical.

ModelDimensionsMax tokensSize on diskMTEB avgSpeed (4090)
nomic-embed-text v1.57688192 (effective ~512)274 MB62.45800 docs/min
mxbai-embed-large1024512670 MB64.74200 docs/min
bge-m3102481922.3 GB66.52100 docs/min
snowflake-arctic-embed1024512670 MB64.04500 docs/min
all-minilm38425690 MB56.312000 docs/min

nomic-embed-text is the right default. Fast, small, MTEB scores are competitive, and it handles documents up to 512 tokens cleanly. For most semantic search workloads it is indistinguishable from the heavier models.

bge-m3 is the right pick for multilingual collections or long documents. It supports up to 8K input tokens natively and produces three vectors per input (dense, sparse, multi-vector) which lets you build hybrid retrieval inside a single model.

mxbai-embed-large is the accuracy-first choice for English-only collections where indexing time is not a concern.

Pull all three and benchmark on your own data — MTEB averages do not always predict performance on a specific domain.

ollama pull nomic-embed-text
ollama pull mxbai-embed-large
ollama pull bge-m3

For a deeper comparison of local vs cloud embedding quality, the local embeddings vs OpenAI embeddings analysis covers head-to-head retrieval quality.


Chunking Strategy {#chunking}

Chunking is where most homegrown semantic search engines lose 20 points of recall they did not need to lose.

Rule 1: Match Chunk Size to Model Limit

Most Ollama embedding models cap at 512 effective tokens despite documenting longer limits. Chunks larger than that are silently truncated. Use 200-400 token chunks for nomic, mxbai, snowflake. Use 800-1200 for bge-m3.

Rule 2: Add Overlap

A 300-token chunk with 30 tokens of overlap with the next chunk costs 10 percent more storage and prevents queries that span chunk boundaries from missing matches.

Rule 3: Respect Document Structure

Splitting in the middle of a sentence destroys the semantic signal. Use a recursive splitter that prefers paragraph breaks, then sentence breaks, then word breaks.

from typing import List

def chunk_text(text: str, target_tokens: int = 300, overlap: int = 30) -> List[str]:
    # Approximate tokens as words * 1.3
    target_words = int(target_tokens / 1.3)
    overlap_words = int(overlap / 1.3)

    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
    chunks, buffer = [], []
    word_count = 0

    for para in paragraphs:
        words = para.split()
        if word_count + len(words) <= target_words:
            buffer.append(para)
            word_count += len(words)
        else:
            if buffer:
                chunks.append("\n\n".join(buffer))
            # Start next buffer with overlap from previous
            tail = " ".join(" ".join(buffer).split()[-overlap_words:])
            buffer = [tail, para]
            word_count = len(tail.split()) + len(words)

    if buffer:
        chunks.append("\n\n".join(buffer))
    return chunks

Rule 4: Prepend Document Title

Adding the document title to the start of every chunk improves retrieval for queries that mention the document by name. f"Document: {title}\n\n{chunk_text}" is the simplest version.

Rule 5: Test Two or Three Chunk Sizes

Build a small evaluation set (50 queries with known correct chunks) and measure recall@5 for chunk sizes of 200, 300, 500. The best size for your data may differ from common defaults.


Vector Store Comparison {#vector-stores}

StoreSetup timeFilter supportIdeal scaleNotes
ChromaDB2 minExcellent< 5M vectorsBest DX, embedded mode
FAISS10 minManual1M-1B vectorsPure speed, no metadata server
pgvector15 minExcellent (SQL)< 50M vectorsIf you already run Postgres
Qdrant5 min (Docker)Excellent1M-1B vectorsProduction-grade, good API
Weaviate10 min (Docker)Excellent1M-1B vectorsBuilt-in modules, heavier
Milvus30 minExcellent100M+ vectorsEnterprise-scale, complex

For most Ollama-based semantic search projects: ChromaDB to prototype, Qdrant or pgvector for production, FAISS only when you need raw speed with millions of vectors.

ChromaDB also has the smallest learning curve — three method calls and you have a working store.

import chromadb
client = chromadb.PersistentClient(path="./chroma_data")
collection = client.create_collection("docs")

Building the Indexer {#indexer}

A working end-to-end indexer for PDFs in a folder.

import os
import hashlib
from pathlib import Path
import chromadb
import ollama
from pypdf import PdfReader

EMBED_MODEL = "nomic-embed-text"
CHUNK_TOKENS = 300
OVERLAP_TOKENS = 30

client = chromadb.PersistentClient(path="./chroma_data")
collection = client.get_or_create_collection(
    "docs",
    metadata={"hnsw:space": "cosine"},
)

def extract_text(pdf_path: Path) -> str:
    reader = PdfReader(str(pdf_path))
    return "\n\n".join((p.extract_text() or "") for p in reader.pages)

def chunk_text(text, target=300, overlap=30):
    target_words = int(target / 1.3)
    overlap_words = int(overlap / 1.3)
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
    chunks, buffer, count = [], [], 0
    for para in paragraphs:
        words = para.split()
        if count + len(words) <= target_words:
            buffer.append(para); count += len(words)
        else:
            if buffer: chunks.append("\n\n".join(buffer))
            tail = " ".join(" ".join(buffer).split()[-overlap_words:]) if buffer else ""
            buffer = [tail, para] if tail else [para]
            count = len(tail.split()) + len(words)
    if buffer: chunks.append("\n\n".join(buffer))
    return chunks

def index_folder(folder: str):
    folder = Path(folder)
    pdfs = list(folder.rglob("*.pdf"))
    print(f"Indexing {len(pdfs)} PDFs from {folder}")

    batch_texts, batch_ids, batch_meta = [], [], []
    for pdf in pdfs:
        try:
            text = extract_text(pdf)
        except Exception as e:
            print(f"skip {pdf}: {e}"); continue

        title = pdf.stem
        for i, chunk in enumerate(chunk_text(text)):
            doc_id = hashlib.sha1(f"{pdf}::{i}".encode()).hexdigest()[:16]
            payload = f"Document: {title}\n\n{chunk}"
            batch_texts.append(payload)
            batch_ids.append(doc_id)
            batch_meta.append({"path": str(pdf), "title": title, "chunk": i})

            if len(batch_texts) >= 64:
                embed_and_store(batch_texts, batch_ids, batch_meta)
                batch_texts, batch_ids, batch_meta = [], [], []

    if batch_texts:
        embed_and_store(batch_texts, batch_ids, batch_meta)

def embed_and_store(texts, ids, metas):
    result = ollama.embed(model=EMBED_MODEL, input=texts)
    collection.upsert(
        ids=ids,
        embeddings=result.embeddings,
        documents=texts,
        metadatas=metas,
    )
    print(f"  +{len(texts)} chunks indexed")

if __name__ == "__main__":
    index_folder("./documents")

Three details matter here. The batch size of 64 minimises Ollama API overhead. The metadata stores the original path and chunk index so query results can show context. The cosine HNSW space matches normalised embeddings, which is what nomic-embed-text produces.

For a deeper look at production indexing patterns, the official ChromaDB docs at docs.trychroma.com cover collection sharding, persistence, and migrations.


Building the Search API {#search-api}

A FastAPI service that accepts a query and returns ranked results.

from fastapi import FastAPI, Query
import chromadb
import ollama

app = FastAPI()
client = chromadb.PersistentClient(path="./chroma_data")
collection = client.get_collection("docs")

@app.get("/search")
def search(q: str = Query(..., min_length=2), k: int = 5):
    query_emb = ollama.embed(model="nomic-embed-text", input=[q]).embeddings[0]
    results = collection.query(
        query_embeddings=[query_emb],
        n_results=k,
        include=["documents", "metadatas", "distances"],
    )
    hits = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0],
    ):
        hits.append({
            "title": meta["title"],
            "path": meta["path"],
            "chunk": meta["chunk"],
            "score": 1 - dist,         # convert cosine distance to similarity
            "preview": doc[:280],
        })
    return {"query": q, "results": hits}

Run with uvicorn search_api:app --port 8090 and curl http://localhost:8090/search?q=quarterly+revenue.

This is already a usable search engine. Everything below makes it better.


Hybrid Search and Reranking {#hybrid}

Pure vector search misses exact matches (product codes, names, acronyms) and produces less reliable rankings than the combination of dense + sparse + reranker.

from rank_bm25 import BM25Okapi
import nltk; nltk.download("punkt", quiet=True)
from nltk.tokenize import word_tokenize

# Build the BM25 index alongside the vector index
all_docs = collection.get(include=["documents"])
tokenised = [word_tokenize(d.lower()) for d in all_docs["documents"]]
bm25 = BM25Okapi(tokenised)
ids = all_docs["ids"]

Step 2: Reciprocal Rank Fusion

def rrf(rankings: list[list[str]], k: int = 60):
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

def hybrid_search(q: str, top_k: int = 50):
    # Dense
    q_emb = ollama.embed(model="nomic-embed-text", input=[q]).embeddings[0]
    dense = collection.query(query_embeddings=[q_emb], n_results=top_k)
    dense_ids = dense["ids"][0]

    # Sparse
    bm25_scores = bm25.get_scores(word_tokenize(q.lower()))
    bm25_top = sorted(range(len(bm25_scores)), key=lambda i: -bm25_scores[i])[:top_k]
    sparse_ids = [ids[i] for i in bm25_top]

    return rrf([dense_ids, sparse_ids])[:10]

Step 3: Cross-Encoder Reranker (Optional)

from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")

def rerank(q: str, candidates: list[tuple[str, float]]):
    docs = collection.get(ids=[c[0] for c in candidates])["documents"]
    pairs = [[q, d] for d in docs]
    scores = reranker.predict(pairs)
    out = sorted(zip(candidates, scores), key=lambda x: -x[1])
    return [c[0] for c, s in out]

The cross-encoder runs on CPU at acceptable speed for top-50 reranking — about 200-400 ms wall time per query.

Quality Improvement (Internal Benchmark)

Measured on a 50-query evaluation set against a 50K-chunk legal corpus:

PipelineRecall@5MRRP95 latency
Vector only0.610.48110 ms
Vector + BM25 (RRF)0.740.58145 ms
Vector + BM25 + Reranker0.830.69380 ms

The reranker adds 200 ms but lifts recall by 9 points. For interactive search, that tradeoff is almost always worth it.


Benchmarks on a 50K-Document Corpus {#benchmarks}

Hardware: single RTX 4090, AMD 7950X, 64 GB DDR5, ChromaDB on local NVMe.

Indexing Throughput

Embedding modelChunks/minTotal time (50K chunks)
nomic-embed-text58008 min 38 s
mxbai-embed-large420011 min 54 s
bge-m3 (dense only)210023 min 49 s
nomic on CPU only2403 h 28 min

Query Latency (p50 / p95)

PipelineEmbeddingSearchTotal
Vector only32 / 58 ms8 / 14 ms41 / 78 ms
Vector + BM2532 / 58 ms22 / 41 ms56 / 105 ms
Vector + BM25 + reranker (CPU)32 / 58 ms22 / 41 ms250 / 400 ms

Storage Footprint

50K chunks at 768 dimensions: ~150 MB for vectors + ~600 MB for source text in ChromaDB. Total under 1 GB. The HNSW index in memory is ~200 MB. Even a 500K-chunk corpus is comfortable on 16 GB RAM.

For an even broader RAG pipeline including chat over the retrieved chunks, the Ollama + ChromaDB RAG pipeline is the natural follow-up.


Pitfalls {#pitfalls}

Pitfall 1: Forgetting to Normalise Vectors

Symptom: Cosine similarity returns wildly inconsistent scores.

Cause: ChromaDB with hnsw:space=cosine expects unit vectors. Some embedding models normalise output; some do not.

Fix: Either use hnsw:space=l2 (works regardless), or explicitly normalise:

import numpy as np
arr = np.array(embeddings)
arr = arr / np.linalg.norm(arr, axis=1, keepdims=True)

Pitfall 2: PDF Extraction Producing Garbage

Symptom: Search returns nonsense — fragments of footers, page numbers, headers.

Cause: pypdf and pdfplumber both extract per-page text, including running headers.

Fix: Strip top/bottom 10 percent of each page, or use a proper layout-aware extractor like Unstructured.io or Marker.

Pitfall 3: Indexing the Same Document Twice

Symptom: Top-5 results for some queries are five copies of the same chunk.

Cause: Re-running the indexer without deduplication, or hash collision on document IDs.

Fix: Use upsert() instead of add() so re-indexing replaces rather than duplicates.

Pitfall 4: Embedding Model Mismatch

Symptom: Search relevance collapses after a "small" change.

Cause: Indexed with nomic-embed-text, querying with mxbai-embed-large (or vice versa). Vectors from different models are not comparable.

Fix: Store the embedding model name in collection metadata. Refuse to query if the requested model differs.

Pitfall 5: No Evaluation Set

Symptom: Cannot tell whether changes are improving or regressing quality.

Cause: No held-out queries with known correct answers.

Fix: Build 50-100 queries with manually-judged correct chunk IDs. Run them after every meaningful change. Without this, every "improvement" is a guess.

For broader troubleshooting beyond semantic search specifics, the Ollama troubleshooting guide covers the underlying server-side issues.


Final Notes

A working private semantic search engine is a weekend project. A great one — fast, accurate, hybrid, with a reliable reranker, evaluation set, and an indexer that handles dirty PDFs cleanly — is two to three weeks. Both are dramatically less work than negotiating with a cloud vendor over a sensitive data agreement, and both leave you with infrastructure you fully control.

Pull nomic-embed-text. Index a folder of PDFs with the script above. Query it from the FastAPI service. Notice that even a naive implementation works surprisingly well. Then add BM25, then a reranker, then an evaluation set, in that order. By the time you reach the end of this guide, the search engine you have built is competitive with paid tools that charge per query, runs in your VPC, and never sends a document to anyone.

The cloud semantic search market exists because building this from scratch used to require ML expertise and dedicated infrastructure. Ollama plus ChromaDB plus 200 lines of Python removed that moat. The interesting work is no longer how to build it — it is what to do with the search engine once you have it.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: April 23, 2026🔄 Last Updated: April 23, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

Was this helpful?

Build Private Local AI Apps

Get one practical local AI build per week — semantic search, RAG, agents, automation. No cloud APIs, no fluff.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Related Guides

Continue your local AI journey with these comprehensive guides

Continue Learning

📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators