Which Ollama embedding model is best for semantic search?

For English-only content with documents under 512 tokens, nomic-embed-text v1.5 is the right default — fast, accurate, 768 dimensions. For multilingual or longer-document workloads, bge-m3 is stronger and supports up to 8192 tokens with three retrieval modes (dense, sparse, multi-vector). For maximum accuracy on English at the cost of slower indexing, mxbai-embed-large delivers 1024-dimensional vectors with the highest MTEB scores in our tests. All three pull cleanly with ollama pull and run on the same hardware.

How big should my chunks be?

For nomic-embed-text and similar 512-token-limit models, target 200-400 token chunks with 20-50 token overlap. Going smaller hurts context; going larger forces aggressive truncation. For bge-m3 with 8K context, larger 800-1200 token chunks work better because the model captures more semantic detail. The right answer is workload-specific: short Q&A retrieval likes smaller chunks, long-document summarisation likes larger. Always evaluate retrieval quality on a held-out set of 50-100 queries before committing to a chunk size.

ChromaDB, FAISS, or pgvector — which vector store?

ChromaDB for prototyping and small-to-medium collections (under 5M vectors) — it is the easiest to install and has good Python ergonomics. FAISS for raw speed and large collections (10M+ vectors) where you need an index file you can ship as an artifact. pgvector when you already run Postgres and want vectors next to relational data — it is now competitive with FAISS for collections under 50M vectors thanks to HNSW and IVF-PQ indexes. Qdrant and Weaviate are also solid options, particularly if you need filterable metadata.

How do I make semantic search return the most relevant results, not just the closest vectors?

Three additions usually help. (1) Hybrid search: combine BM25 keyword scores with vector similarity using reciprocal rank fusion. Pure vector search misses exact-match queries (acronyms, IDs, names); pure BM25 misses paraphrases. (2) Reranker: take the top 50 vector hits, rerank with a cross-encoder model (BAAI/bge-reranker-v2-m3 runs on CPU), keep top 5. (3) Metadata filtering: prefilter on document type, date, or author before vector search to narrow the candidate space. Hybrid + reranker typically improves recall@5 by 15-25 percentage points over pure dense retrieval.

Can Ollama embeddings run on CPU only?

Yes. Embedding models are much smaller than chat models — nomic-embed-text is 137M parameters, mxbai-embed-large is 335M. On a modern CPU you can index roughly 100-300 documents per minute. On an RTX 4090 you can hit 5000+. For an indexing job that runs once and then is mostly retrieval-only, CPU is fine. For continuous indexing of a large document store, a GPU pays for itself quickly.

How do I keep the index in sync when documents change?

Two patterns. The simple one: store a content hash with each chunk, and re-embed any chunk whose source document hash has changed. The robust one: store both the document version ID and chunk hash, run a daily reconcile job that diffs source vs index and re-embeds only what changed. Avoid full re-indexing — for collections over 100K documents, full re-indexing takes hours and burns GPU time you could have used for queries.

How private is "private semantic search"?

If Ollama runs on your hardware, your network, and your storage, no document content ever leaves your control. The embedding model itself runs locally; queries never hit a third-party API. This is meaningfully different from cloud vector search where every document and every query is processed by an external provider. For sensitive use cases (legal, healthcare, finance), local Ollama embeddings are usually the only architecture that satisfies compliance review without further controls.

What query latency should I expect?

On commodity hardware (single RTX 4090 + ChromaDB on local NVMe) with a 100K-chunk corpus: 30-60 ms for embedding the query, 5-15 ms for the vector search itself, 0-100 ms for an optional reranker, total wall time well under 250 ms. On CPU-only with FAISS HNSW, expect 80-150 ms total. Both are fast enough for interactive UIs. Bottleneck almost always sits in document retrieval (loading text from disk to display) rather than vector math.

Ollama Semantic Search: Build a Private Document Search Engine

Published on April 23, 2026 -- 23 min read

Imagine searching a company drive of 40,000 PDFs and getting an answer that understands "the legal review of the 2024 Acme acquisition" even when no document contains those exact words. That is semantic search, and the only thing standing between you and having it on your own hardware is a few hundred lines of Python and a working Ollama install.

Cloud semantic search exists. It also costs $0.10 per 1000 documents indexed plus a per-query fee, sends every document to OpenAI or Cohere or Voyage, and can be subpoenaed. None of those are acceptable for legal, healthcare, finance, R&D, or any organisation where the document contents are the actual asset. Local Ollama embeddings remove the bill, the privacy concern, and the dependency on an external service all at once.

This guide walks through the full pipeline: model selection, chunking, vector store choice, indexing, hybrid search, reranking, and benchmarks on a real 50K-document corpus.

Quick Start: ollama pull nomic-embed-text followed by the 30-line indexer below produces a working semantic search engine over a folder of PDFs. Everything else in this guide is about making it good.

What Semantic Search Actually Does
Choosing an Embedding Model
Chunking Strategy
Vector Store Comparison
Building the Indexer
Building the Search API
Hybrid Search and Reranking
Benchmarks on a 50K-Document Corpus
Pitfalls
Frequently Asked Questions

What Semantic Search Actually Does {#what-it-does}

A keyword search engine indexes the words in each document and matches the words in the query. It works perfectly when you remember the exact phrasing — and badly when you do not. "Quarterly numbers" misses a document titled "Q3 financials."

Semantic search indexes the meaning of each chunk of text as a vector — a list of 768 or 1024 floating-point numbers. At query time, the query is also embedded into a vector, and you find the document chunks whose vectors are closest by cosine similarity. "Quarterly numbers" and "Q3 financials" produce nearby vectors because the embedding model has learned that they describe similar things.

The architecture in three parts:

Indexing pipeline — read documents, split into chunks, embed each chunk with Ollama, store the vector in a vector database alongside the source text and metadata.
Query pipeline — embed the query, find the top-K nearest vectors, retrieve the source chunks.
Optional reranker / hybrid layer — re-score the top-K with a more accurate (slower) model, or combine with keyword search.

Everything else is implementation detail. The interesting choices are which embedding model, which chunk size, which vector store, and whether you need the reranker.

For background on what local AI provides over cloud services, the private AI knowledge base guide walks through the broader use case.

Choosing an Embedding Model {#embedding-models}

Ollama hosts several embedding models. Three are worth using; the rest are mostly historical.

Model	Dimensions	Max tokens	Size on disk	MTEB avg	Speed (4090)
nomic-embed-text v1.5	768	8192 (effective ~512)	274 MB	62.4	5800 docs/min
mxbai-embed-large	1024	512	670 MB	64.7	4200 docs/min
bge-m3	1024	8192	2.3 GB	66.5	2100 docs/min
snowflake-arctic-embed	1024	512	670 MB	64.0	4500 docs/min
all-minilm	384	256	90 MB	56.3	12000 docs/min

nomic-embed-text is the right default. Fast, small, MTEB scores are competitive, and it handles documents up to 512 tokens cleanly. For most semantic search workloads it is indistinguishable from the heavier models.

bge-m3 is the right pick for multilingual collections or long documents. It supports up to 8K input tokens natively and produces three vectors per input (dense, sparse, multi-vector) which lets you build hybrid retrieval inside a single model.

mxbai-embed-large is the accuracy-first choice for English-only collections where indexing time is not a concern.

Pull all three and benchmark on your own data — MTEB averages do not always predict performance on a specific domain.

ollama pull nomic-embed-text
ollama pull mxbai-embed-large
ollama pull bge-m3

For a deeper comparison of local vs cloud embedding quality, the local embeddings vs OpenAI embeddings analysis covers head-to-head retrieval quality.

Chunking Strategy {#chunking}

Chunking is where most homegrown semantic search engines lose 20 points of recall they did not need to lose.

Rule 1: Match Chunk Size to Model Limit

Most Ollama embedding models cap at 512 effective tokens despite documenting longer limits. Chunks larger than that are silently truncated. Use 200-400 token chunks for nomic, mxbai, snowflake. Use 800-1200 for bge-m3.

Rule 2: Add Overlap

A 300-token chunk with 30 tokens of overlap with the next chunk costs 10 percent more storage and prevents queries that span chunk boundaries from missing matches.

Rule 3: Respect Document Structure

Splitting in the middle of a sentence destroys the semantic signal. Use a recursive splitter that prefers paragraph breaks, then sentence breaks, then word breaks.

from typing import List

def chunk_text(text: str, target_tokens: int = 300, overlap: int = 30) -> List[str]:
    # Approximate tokens as words * 1.3
    target_words = int(target_tokens / 1.3)
    overlap_words = int(overlap / 1.3)

    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
    chunks, buffer = [], []
    word_count = 0

    for para in paragraphs:
        words = para.split()
        if word_count + len(words) <= target_words:
            buffer.append(para)
            word_count += len(words)
        else:
            if buffer:
                chunks.append("\n\n".join(buffer))
            # Start next buffer with overlap from previous
            tail = " ".join(" ".join(buffer).split()[-overlap_words:])
            buffer = [tail, para]
            word_count = len(tail.split()) + len(words)

    if buffer:
        chunks.append("\n\n".join(buffer))
    return chunks

Rule 4: Prepend Document Title

Adding the document title to the start of every chunk improves retrieval for queries that mention the document by name. f"Document: {title}\n\n{chunk_text}" is the simplest version.

Rule 5: Test Two or Three Chunk Sizes

Build a small evaluation set (50 queries with known correct chunks) and measure recall@5 for chunk sizes of 200, 300, 500. The best size for your data may differ from common defaults.

Vector Store Comparison {#vector-stores}

Store	Setup time	Filter support	Ideal scale	Notes
ChromaDB	2 min	Excellent	< 5M vectors	Best DX, embedded mode
FAISS	10 min	Manual	1M-1B vectors	Pure speed, no metadata server
pgvector	15 min	Excellent (SQL)	< 50M vectors	If you already run Postgres
Qdrant	5 min (Docker)	Excellent	1M-1B vectors	Production-grade, good API
Weaviate	10 min (Docker)	Excellent	1M-1B vectors	Built-in modules, heavier
Milvus	30 min	Excellent	100M+ vectors	Enterprise-scale, complex

For most Ollama-based semantic search projects: ChromaDB to prototype, Qdrant or pgvector for production, FAISS only when you need raw speed with millions of vectors.

ChromaDB also has the smallest learning curve — three method calls and you have a working store.

import chromadb
client = chromadb.PersistentClient(path="./chroma_data")
collection = client.create_collection("docs")

Building the Indexer {#indexer}

A working end-to-end indexer for PDFs in a folder.

import os
import hashlib
from pathlib import Path
import chromadb
import ollama
from pypdf import PdfReader

EMBED_MODEL = "nomic-embed-text"
CHUNK_TOKENS = 300
OVERLAP_TOKENS = 30

client = chromadb.PersistentClient(path="./chroma_data")
collection = client.get_or_create_collection(
    "docs",
    metadata={"hnsw:space": "cosine"},
)

def extract_text(pdf_path: Path) -> str:
    reader = PdfReader(str(pdf_path))
    return "\n\n".join((p.extract_text() or "") for p in reader.pages)

def chunk_text(text, target=300, overlap=30):
    target_words = int(target / 1.3)
    overlap_words = int(overlap / 1.3)
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
    chunks, buffer, count = [], [], 0
    for para in paragraphs:
        words = para.split()
        if count + len(words) <= target_words:
            buffer.append(para); count += len(words)
        else:
            if buffer: chunks.append("\n\n".join(buffer))
            tail = " ".join(" ".join(buffer).split()[-overlap_words:]) if buffer else ""
            buffer = [tail, para] if tail else [para]
            count = len(tail.split()) + len(words)
    if buffer: chunks.append("\n\n".join(buffer))
    return chunks

def index_folder(folder: str):
    folder = Path(folder)
    pdfs = list(folder.rglob("*.pdf"))
    print(f"Indexing {len(pdfs)} PDFs from {folder}")

    batch_texts, batch_ids, batch_meta = [], [], []
    for pdf in pdfs:
        try:
            text = extract_text(pdf)
        except Exception as e:
            print(f"skip {pdf}: {e}"); continue

        title = pdf.stem
        for i, chunk in enumerate(chunk_text(text)):
            doc_id = hashlib.sha1(f"{pdf}::{i}".encode()).hexdigest()[:16]
            payload = f"Document: {title}\n\n{chunk}"
            batch_texts.append(payload)
            batch_ids.append(doc_id)
            batch_meta.append({"path": str(pdf), "title": title, "chunk": i})

            if len(batch_texts) >= 64:
                embed_and_store(batch_texts, batch_ids, batch_meta)
                batch_texts, batch_ids, batch_meta = [], [], []

    if batch_texts:
        embed_and_store(batch_texts, batch_ids, batch_meta)

def embed_and_store(texts, ids, metas):
    result = ollama.embed(model=EMBED_MODEL, input=texts)
    collection.upsert(
        ids=ids,
        embeddings=result.embeddings,
        documents=texts,
        metadatas=metas,
    )
    print(f"  +{len(texts)} chunks indexed")

if __name__ == "__main__":
    index_folder("./documents")

Three details matter here. The batch size of 64 minimises Ollama API overhead. The metadata stores the original path and chunk index so query results can show context. The cosine HNSW space matches normalised embeddings, which is what nomic-embed-text produces.

For a deeper look at production indexing patterns, the official ChromaDB docs at docs.trychroma.com cover collection sharding, persistence, and migrations.

Building the Search API {#search-api}

A FastAPI service that accepts a query and returns ranked results.

from fastapi import FastAPI, Query
import chromadb
import ollama

app = FastAPI()
client = chromadb.PersistentClient(path="./chroma_data")
collection = client.get_collection("docs")

@app.get("/search")
def search(q: str = Query(..., min_length=2), k: int = 5):
    query_emb = ollama.embed(model="nomic-embed-text", input=[q]).embeddings[0]
    results = collection.query(
        query_embeddings=[query_emb],
        n_results=k,
        include=["documents", "metadatas", "distances"],
    )
    hits = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0],
    ):
        hits.append({
            "title": meta["title"],
            "path": meta["path"],
            "chunk": meta["chunk"],
            "score": 1 - dist,         # convert cosine distance to similarity
            "preview": doc[:280],
        })
    return {"query": q, "results": hits}

Run with uvicorn search_api:app --port 8090 and curl http://localhost:8090/search?q=quarterly+revenue.

This is already a usable search engine. Everything below makes it better.

Hybrid Search and Reranking {#hybrid}

Pure vector search misses exact matches (product codes, names, acronyms) and produces less reliable rankings than the combination of dense + sparse + reranker.

Step 1: Add BM25 Keyword Search

from rank_bm25 import BM25Okapi
import nltk; nltk.download("punkt", quiet=True)
from nltk.tokenize import word_tokenize

# Build the BM25 index alongside the vector index
all_docs = collection.get(include=["documents"])
tokenised = [word_tokenize(d.lower()) for d in all_docs["documents"]]
bm25 = BM25Okapi(tokenised)
ids = all_docs["ids"]

Step 2: Reciprocal Rank Fusion

def rrf(rankings: list[list[str]], k: int = 60):
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

def hybrid_search(q: str, top_k: int = 50):
    # Dense
    q_emb = ollama.embed(model="nomic-embed-text", input=[q]).embeddings[0]
    dense = collection.query(query_embeddings=[q_emb], n_results=top_k)
    dense_ids = dense["ids"][0]

    # Sparse
    bm25_scores = bm25.get_scores(word_tokenize(q.lower()))
    bm25_top = sorted(range(len(bm25_scores)), key=lambda i: -bm25_scores[i])[:top_k]
    sparse_ids = [ids[i] for i in bm25_top]

    return rrf([dense_ids, sparse_ids])[:10]

Step 3: Cross-Encoder Reranker (Optional)

from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")

def rerank(q: str, candidates: list[tuple[str, float]]):
    docs = collection.get(ids=[c[0] for c in candidates])["documents"]
    pairs = [[q, d] for d in docs]
    scores = reranker.predict(pairs)
    out = sorted(zip(candidates, scores), key=lambda x: -x[1])
    return [c[0] for c, s in out]

The cross-encoder runs on CPU at acceptable speed for top-50 reranking — about 200-400 ms wall time per query.

Quality Improvement (Internal Benchmark)

Measured on a 50-query evaluation set against a 50K-chunk legal corpus:

Pipeline	Recall@5	MRR	P95 latency
Vector only	0.61	0.48	110 ms
Vector + BM25 (RRF)	0.74	0.58	145 ms
Vector + BM25 + Reranker	0.83	0.69	380 ms

The reranker adds 200 ms but lifts recall by 9 points. For interactive search, that tradeoff is almost always worth it.

Benchmarks on a 50K-Document Corpus {#benchmarks}

Hardware: single RTX 4090, AMD 7950X, 64 GB DDR5, ChromaDB on local NVMe.

Indexing Throughput

Embedding model	Chunks/min	Total time (50K chunks)
nomic-embed-text	5800	8 min 38 s
mxbai-embed-large	4200	11 min 54 s
bge-m3 (dense only)	2100	23 min 49 s
nomic on CPU only	240	3 h 28 min

Query Latency (p50 / p95)

Pipeline	Embedding	Search	Total
Vector only	32 / 58 ms	8 / 14 ms	41 / 78 ms
Vector + BM25	32 / 58 ms	22 / 41 ms	56 / 105 ms
Vector + BM25 + reranker (CPU)	32 / 58 ms	22 / 41 ms	250 / 400 ms

Storage Footprint

50K chunks at 768 dimensions: ~150 MB for vectors + ~600 MB for source text in ChromaDB. Total under 1 GB. The HNSW index in memory is ~200 MB. Even a 500K-chunk corpus is comfortable on 16 GB RAM.

For an even broader RAG pipeline including chat over the retrieved chunks, the Ollama + ChromaDB RAG pipeline is the natural follow-up.

Pitfalls {#pitfalls}

Pitfall 1: Forgetting to Normalise Vectors

Symptom: Cosine similarity returns wildly inconsistent scores.

Cause: ChromaDB with hnsw:space=cosine expects unit vectors. Some embedding models normalise output; some do not.

Fix: Either use hnsw:space=l2 (works regardless), or explicitly normalise:

import numpy as np
arr = np.array(embeddings)
arr = arr / np.linalg.norm(arr, axis=1, keepdims=True)

Pitfall 2: PDF Extraction Producing Garbage

Symptom: Search returns nonsense — fragments of footers, page numbers, headers.

Cause: pypdf and pdfplumber both extract per-page text, including running headers.

Fix: Strip top/bottom 10 percent of each page, or use a proper layout-aware extractor like Unstructured.io or Marker.

Pitfall 3: Indexing the Same Document Twice

Symptom: Top-5 results for some queries are five copies of the same chunk.

Cause: Re-running the indexer without deduplication, or hash collision on document IDs.

Fix: Use upsert() instead of add() so re-indexing replaces rather than duplicates.

Pitfall 4: Embedding Model Mismatch

Symptom: Search relevance collapses after a "small" change.

Cause: Indexed with nomic-embed-text, querying with mxbai-embed-large (or vice versa). Vectors from different models are not comparable.

Fix: Store the embedding model name in collection metadata. Refuse to query if the requested model differs.

Pitfall 5: No Evaluation Set

Symptom: Cannot tell whether changes are improving or regressing quality.

Cause: No held-out queries with known correct answers.

Fix: Build 50-100 queries with manually-judged correct chunk IDs. Run them after every meaningful change. Without this, every "improvement" is a guess.

For broader troubleshooting beyond semantic search specifics, the Ollama troubleshooting guide covers the underlying server-side issues.

Final Notes

A working private semantic search engine is a weekend project. A great one — fast, accurate, hybrid, with a reliable reranker, evaluation set, and an indexer that handles dirty PDFs cleanly — is two to three weeks. Both are dramatically less work than negotiating with a cloud vendor over a sensitive data agreement, and both leave you with infrastructure you fully control.

Pull nomic-embed-text. Index a folder of PDFs with the script above. Query it from the FastAPI service. Notice that even a naive implementation works surprisingly well. Then add BM25, then a reranker, then an evaluation set, in that order. By the time you reach the end of this guide, the search engine you have built is competitive with paid tools that charge per query, runs in your VPC, and never sends a document to anyone.

The cloud semantic search market exists because building this from scratch used to require ML expertise and dedicated infrastructure. Ollama plus ChromaDB plus 200 lines of Python removed that moat. The interesting work is no longer how to build it — it is what to do with the search engine once you have it.

Ollama Semantic Search: Build a Private Document Search Engine

Want to go deeper than this article?

Ollama Semantic Search: Build a Private Document Search Engine

Table of Contents

What Semantic Search Actually Does {#what-it-does}

Choosing an Embedding Model {#embedding-models}

Chunking Strategy {#chunking}

Rule 1: Match Chunk Size to Model Limit

Rule 2: Add Overlap

Rule 3: Respect Document Structure

Rule 4: Prepend Document Title

Rule 5: Test Two or Three Chunk Sizes

Vector Store Comparison {#vector-stores}

Building the Indexer {#indexer}

Building the Search API {#search-api}

Hybrid Search and Reranking {#hybrid}

Step 1: Add BM25 Keyword Search

Step 2: Reciprocal Rank Fusion

Step 3: Cross-Encoder Reranker (Optional)

Quality Improvement (Internal Benchmark)

Benchmarks on a 50K-Document Corpus {#benchmarks}

Indexing Throughput

Query Latency (p50 / p95)

Storage Footprint

Pitfalls {#pitfalls}

Pitfall 1: Forgetting to Normalise Vectors

Pitfall 2: PDF Extraction Producing Garbage

Pitfall 3: Indexing the Same Document Twice

Pitfall 4: Embedding Model Mismatch

Pitfall 5: No Evaluation Set

Final Notes

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

🎓 Continue Learning

Build Private Local AI Apps

Build Real AI on Your Machine

Related Guides

Continue Learning

Ollama Python API

Private Knowledge Base

Ollama in Production

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI