Ollama Semantic Search: Build a Private Document Search Engine
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Ollama Semantic Search: Build a Private Document Search Engine
Published on April 23, 2026 -- 23 min read
Imagine searching a company drive of 40,000 PDFs and getting an answer that understands "the legal review of the 2024 Acme acquisition" even when no document contains those exact words. That is semantic search, and the only thing standing between you and having it on your own hardware is a few hundred lines of Python and a working Ollama install.
Cloud semantic search exists. It also costs $0.10 per 1000 documents indexed plus a per-query fee, sends every document to OpenAI or Cohere or Voyage, and can be subpoenaed. None of those are acceptable for legal, healthcare, finance, R&D, or any organisation where the document contents are the actual asset. Local Ollama embeddings remove the bill, the privacy concern, and the dependency on an external service all at once.
This guide walks through the full pipeline: model selection, chunking, vector store choice, indexing, hybrid search, reranking, and benchmarks on a real 50K-document corpus.
Quick Start:
ollama pull nomic-embed-textfollowed by the 30-line indexer below produces a working semantic search engine over a folder of PDFs. Everything else in this guide is about making it good.
Table of Contents
- What Semantic Search Actually Does
- Choosing an Embedding Model
- Chunking Strategy
- Vector Store Comparison
- Building the Indexer
- Building the Search API
- Hybrid Search and Reranking
- Benchmarks on a 50K-Document Corpus
- Pitfalls
- Frequently Asked Questions
What Semantic Search Actually Does {#what-it-does}
A keyword search engine indexes the words in each document and matches the words in the query. It works perfectly when you remember the exact phrasing — and badly when you do not. "Quarterly numbers" misses a document titled "Q3 financials."
Semantic search indexes the meaning of each chunk of text as a vector — a list of 768 or 1024 floating-point numbers. At query time, the query is also embedded into a vector, and you find the document chunks whose vectors are closest by cosine similarity. "Quarterly numbers" and "Q3 financials" produce nearby vectors because the embedding model has learned that they describe similar things.
The architecture in three parts:
- Indexing pipeline — read documents, split into chunks, embed each chunk with Ollama, store the vector in a vector database alongside the source text and metadata.
- Query pipeline — embed the query, find the top-K nearest vectors, retrieve the source chunks.
- Optional reranker / hybrid layer — re-score the top-K with a more accurate (slower) model, or combine with keyword search.
Everything else is implementation detail. The interesting choices are which embedding model, which chunk size, which vector store, and whether you need the reranker.
For background on what local AI provides over cloud services, the private AI knowledge base guide walks through the broader use case.
Choosing an Embedding Model {#embedding-models}
Ollama hosts several embedding models. Three are worth using; the rest are mostly historical.
| Model | Dimensions | Max tokens | Size on disk | MTEB avg | Speed (4090) |
|---|---|---|---|---|---|
| nomic-embed-text v1.5 | 768 | 8192 (effective ~512) | 274 MB | 62.4 | 5800 docs/min |
| mxbai-embed-large | 1024 | 512 | 670 MB | 64.7 | 4200 docs/min |
| bge-m3 | 1024 | 8192 | 2.3 GB | 66.5 | 2100 docs/min |
| snowflake-arctic-embed | 1024 | 512 | 670 MB | 64.0 | 4500 docs/min |
| all-minilm | 384 | 256 | 90 MB | 56.3 | 12000 docs/min |
nomic-embed-text is the right default. Fast, small, MTEB scores are competitive, and it handles documents up to 512 tokens cleanly. For most semantic search workloads it is indistinguishable from the heavier models.
bge-m3 is the right pick for multilingual collections or long documents. It supports up to 8K input tokens natively and produces three vectors per input (dense, sparse, multi-vector) which lets you build hybrid retrieval inside a single model.
mxbai-embed-large is the accuracy-first choice for English-only collections where indexing time is not a concern.
Pull all three and benchmark on your own data — MTEB averages do not always predict performance on a specific domain.
ollama pull nomic-embed-text
ollama pull mxbai-embed-large
ollama pull bge-m3
For a deeper comparison of local vs cloud embedding quality, the local embeddings vs OpenAI embeddings analysis covers head-to-head retrieval quality.
Chunking Strategy {#chunking}
Chunking is where most homegrown semantic search engines lose 20 points of recall they did not need to lose.
Rule 1: Match Chunk Size to Model Limit
Most Ollama embedding models cap at 512 effective tokens despite documenting longer limits. Chunks larger than that are silently truncated. Use 200-400 token chunks for nomic, mxbai, snowflake. Use 800-1200 for bge-m3.
Rule 2: Add Overlap
A 300-token chunk with 30 tokens of overlap with the next chunk costs 10 percent more storage and prevents queries that span chunk boundaries from missing matches.
Rule 3: Respect Document Structure
Splitting in the middle of a sentence destroys the semantic signal. Use a recursive splitter that prefers paragraph breaks, then sentence breaks, then word breaks.
from typing import List
def chunk_text(text: str, target_tokens: int = 300, overlap: int = 30) -> List[str]:
# Approximate tokens as words * 1.3
target_words = int(target_tokens / 1.3)
overlap_words = int(overlap / 1.3)
paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
chunks, buffer = [], []
word_count = 0
for para in paragraphs:
words = para.split()
if word_count + len(words) <= target_words:
buffer.append(para)
word_count += len(words)
else:
if buffer:
chunks.append("\n\n".join(buffer))
# Start next buffer with overlap from previous
tail = " ".join(" ".join(buffer).split()[-overlap_words:])
buffer = [tail, para]
word_count = len(tail.split()) + len(words)
if buffer:
chunks.append("\n\n".join(buffer))
return chunks
Rule 4: Prepend Document Title
Adding the document title to the start of every chunk improves retrieval for queries that mention the document by name. f"Document: {title}\n\n{chunk_text}" is the simplest version.
Rule 5: Test Two or Three Chunk Sizes
Build a small evaluation set (50 queries with known correct chunks) and measure recall@5 for chunk sizes of 200, 300, 500. The best size for your data may differ from common defaults.
Vector Store Comparison {#vector-stores}
| Store | Setup time | Filter support | Ideal scale | Notes |
|---|---|---|---|---|
| ChromaDB | 2 min | Excellent | < 5M vectors | Best DX, embedded mode |
| FAISS | 10 min | Manual | 1M-1B vectors | Pure speed, no metadata server |
| pgvector | 15 min | Excellent (SQL) | < 50M vectors | If you already run Postgres |
| Qdrant | 5 min (Docker) | Excellent | 1M-1B vectors | Production-grade, good API |
| Weaviate | 10 min (Docker) | Excellent | 1M-1B vectors | Built-in modules, heavier |
| Milvus | 30 min | Excellent | 100M+ vectors | Enterprise-scale, complex |
For most Ollama-based semantic search projects: ChromaDB to prototype, Qdrant or pgvector for production, FAISS only when you need raw speed with millions of vectors.
ChromaDB also has the smallest learning curve — three method calls and you have a working store.
import chromadb
client = chromadb.PersistentClient(path="./chroma_data")
collection = client.create_collection("docs")
Building the Indexer {#indexer}
A working end-to-end indexer for PDFs in a folder.
import os
import hashlib
from pathlib import Path
import chromadb
import ollama
from pypdf import PdfReader
EMBED_MODEL = "nomic-embed-text"
CHUNK_TOKENS = 300
OVERLAP_TOKENS = 30
client = chromadb.PersistentClient(path="./chroma_data")
collection = client.get_or_create_collection(
"docs",
metadata={"hnsw:space": "cosine"},
)
def extract_text(pdf_path: Path) -> str:
reader = PdfReader(str(pdf_path))
return "\n\n".join((p.extract_text() or "") for p in reader.pages)
def chunk_text(text, target=300, overlap=30):
target_words = int(target / 1.3)
overlap_words = int(overlap / 1.3)
paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
chunks, buffer, count = [], [], 0
for para in paragraphs:
words = para.split()
if count + len(words) <= target_words:
buffer.append(para); count += len(words)
else:
if buffer: chunks.append("\n\n".join(buffer))
tail = " ".join(" ".join(buffer).split()[-overlap_words:]) if buffer else ""
buffer = [tail, para] if tail else [para]
count = len(tail.split()) + len(words)
if buffer: chunks.append("\n\n".join(buffer))
return chunks
def index_folder(folder: str):
folder = Path(folder)
pdfs = list(folder.rglob("*.pdf"))
print(f"Indexing {len(pdfs)} PDFs from {folder}")
batch_texts, batch_ids, batch_meta = [], [], []
for pdf in pdfs:
try:
text = extract_text(pdf)
except Exception as e:
print(f"skip {pdf}: {e}"); continue
title = pdf.stem
for i, chunk in enumerate(chunk_text(text)):
doc_id = hashlib.sha1(f"{pdf}::{i}".encode()).hexdigest()[:16]
payload = f"Document: {title}\n\n{chunk}"
batch_texts.append(payload)
batch_ids.append(doc_id)
batch_meta.append({"path": str(pdf), "title": title, "chunk": i})
if len(batch_texts) >= 64:
embed_and_store(batch_texts, batch_ids, batch_meta)
batch_texts, batch_ids, batch_meta = [], [], []
if batch_texts:
embed_and_store(batch_texts, batch_ids, batch_meta)
def embed_and_store(texts, ids, metas):
result = ollama.embed(model=EMBED_MODEL, input=texts)
collection.upsert(
ids=ids,
embeddings=result.embeddings,
documents=texts,
metadatas=metas,
)
print(f" +{len(texts)} chunks indexed")
if __name__ == "__main__":
index_folder("./documents")
Three details matter here. The batch size of 64 minimises Ollama API overhead. The metadata stores the original path and chunk index so query results can show context. The cosine HNSW space matches normalised embeddings, which is what nomic-embed-text produces.
For a deeper look at production indexing patterns, the official ChromaDB docs at docs.trychroma.com cover collection sharding, persistence, and migrations.
Building the Search API {#search-api}
A FastAPI service that accepts a query and returns ranked results.
from fastapi import FastAPI, Query
import chromadb
import ollama
app = FastAPI()
client = chromadb.PersistentClient(path="./chroma_data")
collection = client.get_collection("docs")
@app.get("/search")
def search(q: str = Query(..., min_length=2), k: int = 5):
query_emb = ollama.embed(model="nomic-embed-text", input=[q]).embeddings[0]
results = collection.query(
query_embeddings=[query_emb],
n_results=k,
include=["documents", "metadatas", "distances"],
)
hits = []
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0],
):
hits.append({
"title": meta["title"],
"path": meta["path"],
"chunk": meta["chunk"],
"score": 1 - dist, # convert cosine distance to similarity
"preview": doc[:280],
})
return {"query": q, "results": hits}
Run with uvicorn search_api:app --port 8090 and curl http://localhost:8090/search?q=quarterly+revenue.
This is already a usable search engine. Everything below makes it better.
Hybrid Search and Reranking {#hybrid}
Pure vector search misses exact matches (product codes, names, acronyms) and produces less reliable rankings than the combination of dense + sparse + reranker.
Step 1: Add BM25 Keyword Search
from rank_bm25 import BM25Okapi
import nltk; nltk.download("punkt", quiet=True)
from nltk.tokenize import word_tokenize
# Build the BM25 index alongside the vector index
all_docs = collection.get(include=["documents"])
tokenised = [word_tokenize(d.lower()) for d in all_docs["documents"]]
bm25 = BM25Okapi(tokenised)
ids = all_docs["ids"]
Step 2: Reciprocal Rank Fusion
def rrf(rankings: list[list[str]], k: int = 60):
scores = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores.items(), key=lambda x: -x[1])
def hybrid_search(q: str, top_k: int = 50):
# Dense
q_emb = ollama.embed(model="nomic-embed-text", input=[q]).embeddings[0]
dense = collection.query(query_embeddings=[q_emb], n_results=top_k)
dense_ids = dense["ids"][0]
# Sparse
bm25_scores = bm25.get_scores(word_tokenize(q.lower()))
bm25_top = sorted(range(len(bm25_scores)), key=lambda i: -bm25_scores[i])[:top_k]
sparse_ids = [ids[i] for i in bm25_top]
return rrf([dense_ids, sparse_ids])[:10]
Step 3: Cross-Encoder Reranker (Optional)
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")
def rerank(q: str, candidates: list[tuple[str, float]]):
docs = collection.get(ids=[c[0] for c in candidates])["documents"]
pairs = [[q, d] for d in docs]
scores = reranker.predict(pairs)
out = sorted(zip(candidates, scores), key=lambda x: -x[1])
return [c[0] for c, s in out]
The cross-encoder runs on CPU at acceptable speed for top-50 reranking — about 200-400 ms wall time per query.
Quality Improvement (Internal Benchmark)
Measured on a 50-query evaluation set against a 50K-chunk legal corpus:
| Pipeline | Recall@5 | MRR | P95 latency |
|---|---|---|---|
| Vector only | 0.61 | 0.48 | 110 ms |
| Vector + BM25 (RRF) | 0.74 | 0.58 | 145 ms |
| Vector + BM25 + Reranker | 0.83 | 0.69 | 380 ms |
The reranker adds 200 ms but lifts recall by 9 points. For interactive search, that tradeoff is almost always worth it.
Benchmarks on a 50K-Document Corpus {#benchmarks}
Hardware: single RTX 4090, AMD 7950X, 64 GB DDR5, ChromaDB on local NVMe.
Indexing Throughput
| Embedding model | Chunks/min | Total time (50K chunks) |
|---|---|---|
| nomic-embed-text | 5800 | 8 min 38 s |
| mxbai-embed-large | 4200 | 11 min 54 s |
| bge-m3 (dense only) | 2100 | 23 min 49 s |
| nomic on CPU only | 240 | 3 h 28 min |
Query Latency (p50 / p95)
| Pipeline | Embedding | Search | Total |
|---|---|---|---|
| Vector only | 32 / 58 ms | 8 / 14 ms | 41 / 78 ms |
| Vector + BM25 | 32 / 58 ms | 22 / 41 ms | 56 / 105 ms |
| Vector + BM25 + reranker (CPU) | 32 / 58 ms | 22 / 41 ms | 250 / 400 ms |
Storage Footprint
50K chunks at 768 dimensions: ~150 MB for vectors + ~600 MB for source text in ChromaDB. Total under 1 GB. The HNSW index in memory is ~200 MB. Even a 500K-chunk corpus is comfortable on 16 GB RAM.
For an even broader RAG pipeline including chat over the retrieved chunks, the Ollama + ChromaDB RAG pipeline is the natural follow-up.
Pitfalls {#pitfalls}
Pitfall 1: Forgetting to Normalise Vectors
Symptom: Cosine similarity returns wildly inconsistent scores.
Cause: ChromaDB with hnsw:space=cosine expects unit vectors. Some embedding models normalise output; some do not.
Fix: Either use hnsw:space=l2 (works regardless), or explicitly normalise:
import numpy as np
arr = np.array(embeddings)
arr = arr / np.linalg.norm(arr, axis=1, keepdims=True)
Pitfall 2: PDF Extraction Producing Garbage
Symptom: Search returns nonsense — fragments of footers, page numbers, headers.
Cause: pypdf and pdfplumber both extract per-page text, including running headers.
Fix: Strip top/bottom 10 percent of each page, or use a proper layout-aware extractor like Unstructured.io or Marker.
Pitfall 3: Indexing the Same Document Twice
Symptom: Top-5 results for some queries are five copies of the same chunk.
Cause: Re-running the indexer without deduplication, or hash collision on document IDs.
Fix: Use upsert() instead of add() so re-indexing replaces rather than duplicates.
Pitfall 4: Embedding Model Mismatch
Symptom: Search relevance collapses after a "small" change.
Cause: Indexed with nomic-embed-text, querying with mxbai-embed-large (or vice versa). Vectors from different models are not comparable.
Fix: Store the embedding model name in collection metadata. Refuse to query if the requested model differs.
Pitfall 5: No Evaluation Set
Symptom: Cannot tell whether changes are improving or regressing quality.
Cause: No held-out queries with known correct answers.
Fix: Build 50-100 queries with manually-judged correct chunk IDs. Run them after every meaningful change. Without this, every "improvement" is a guess.
For broader troubleshooting beyond semantic search specifics, the Ollama troubleshooting guide covers the underlying server-side issues.
Final Notes
A working private semantic search engine is a weekend project. A great one — fast, accurate, hybrid, with a reliable reranker, evaluation set, and an indexer that handles dirty PDFs cleanly — is two to three weeks. Both are dramatically less work than negotiating with a cloud vendor over a sensitive data agreement, and both leave you with infrastructure you fully control.
Pull nomic-embed-text. Index a folder of PDFs with the script above. Query it from the FastAPI service. Notice that even a naive implementation works surprisingly well. Then add BM25, then a reranker, then an evaluation set, in that order. By the time you reach the end of this guide, the search engine you have built is competitive with paid tools that charge per query, runs in your VPC, and never sends a document to anyone.
The cloud semantic search market exists because building this from scratch used to require ML expertise and dedicated infrastructure. Ollama plus ChromaDB plus 200 lines of Python removed that moat. The interesting work is no longer how to build it — it is what to do with the search engine once you have it.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!