Give Your Local AI Agent Memory with Mem0 (2026)
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Building agents? Do it the structured way. AutoGen, CrewAI, tool-use, planning — hands-on and running on your own hardware. First chapter free.
Yes — Mem0 (the open-source mem0ai package, v2.0.7, released June 17 2026, Apache-2.0) runs fully offline: point its LLM and embedder at a local Ollama server and its vector store at a local Qdrant or Chroma instance, and your agent gets persistent long-term memory with zero API keys and zero data leaving your machine. The default Mem0 build leans on OpenAI, but a three-key config dict (llm, embedder, vector_store) swaps every cloud dependency for a local one. Below is the config pattern from Mem0's own Ollama cookbook (collection name and a couple of values renamed for clarity), a working chat loop, and an honest account of the parts that bite.
This guide assumes you already have an agent or chat loop calling a local model. If you don't yet, start with build a local AI agent and come back — memory is the layer you bolt on once the agent works.
Why does a local agent need memory?
A bare LLM is stateless. Every request only "knows" what's in its context window. So your agent forgets your name, your stack, and what it did five turns ago the moment that text scrolls out of the window. You can paper over it by stuffing the entire transcript into every prompt, but that approach falls apart fast:
- Context windows are finite. Even when an 8B model technically supports a large window (Llama 3.1 advertises up to 128K), Ollama runs a much smaller context by default and most people keep an 8B model in the 8K–32K range to stay within VRAM. A long-running assistant blows past whatever you set in a day or two.
- Long prompts are slow and expensive (in tokens). Re-sending a 10,000-token history on every turn means the model re-reads everything each time — latency and compute scale with transcript length.
- Raw transcripts are noisy. The model has to re-derive "the user prefers TypeScript" from scratch every time instead of being handed the fact directly.
A memory layer like Mem0 fixes this by extracting durable facts from conversations, storing them as embeddings in a vector database, and retrieving only the relevant few at query time. Instead of replaying 10,000 tokens, you inject the 3–5 facts that actually matter. That's the whole pitch: smaller prompts, persistent recall, and — when you run it locally — none of it leaves your laptop.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Does Mem0 actually run locally with Ollama?
It does, and this is the part people get wrong. Mem0's default Memory() constructor uses OpenAI's gpt-5-mini for fact extraction and text-embedding-3-small for embeddings, plus an on-disk Qdrant store at /tmp/qdrant. So out of the box it phones home to OpenAI.
To make it 100% local you override three things via Memory.from_config(config):
| Component | Default (cloud) | Local replacement | Where it runs |
|---|---|---|---|
| LLM (fact extraction) | OpenAI gpt-5-mini | Ollama (e.g. llama3.1:latest) | localhost:11434 |
| Embedder | OpenAI text-embedding-3-small (1536-dim) | Ollama nomic-embed-text:latest (768-dim) | localhost:11434 |
| Vector store | Qdrant on disk | Qdrant (Docker) or Chroma (file path) | localhost:6333 / local dir |
Mem0 publishes an official self-hosted Ollama cookbook that uses exactly this pattern, and the GitHub repo is Apache-2.0. The one thing to watch: the embedding model decides your vector dimensions. nomic-embed-text outputs 768 dimensions, not OpenAI's 1536, so you must set embedding_model_dims: 768 on the vector store or inserts will fail with a dimension-mismatch error. That single mismatch is the most common reason a "local Mem0" tutorial silently breaks.
How do I install Mem0 and a local vector store?
Three prerequisites: Python 3.10+, a running Ollama, and a local vector database.
1. Install the package (this is the open-source SDK, not the hosted platform):
pip install mem0ai
2. Pull the two Ollama models — one for reasoning/extraction, one for embeddings:
ollama pull llama3.1 # LLM that extracts facts from conversations
ollama pull nomic-embed-text # embedding model for the vector store
3. Start a local vector store. The Mem0 cookbook uses Qdrant, which is one Docker command:
docker run -p 6333:6333 -p 6334:6334 \
-v "$(pwd)/qdrant_storage:/qdrant/storage" \
qdrant/qdrant
Port 6333 is Qdrant's REST API; the -v mount makes the data survive a container restart. If you'd rather not run Docker, skip ahead to the Chroma option — Chroma persists to a plain folder with no separate service.
How do I wire Mem0 to an Ollama model?
Everything local lives in one config dict. This is the canonical structure straight from Mem0's Ollama cookbook:
from mem0 import Memory
config = {
"vector_store": {
"provider": "qdrant",
"config": {
"collection_name": "agent_memory",
"host": "localhost",
"port": 6333,
"embedding_model_dims": 768, # MUST match nomic-embed-text
},
},
"llm": {
"provider": "ollama",
"config": {
"model": "llama3.1:latest",
"temperature": 0,
"max_tokens": 2000,
"ollama_base_url": "http://localhost:11434",
},
},
"embedder": {
"provider": "ollama",
"config": {
"model": "nomic-embed-text:latest",
"ollama_base_url": "http://localhost:11434",
},
},
}
memory = Memory.from_config(config)
That memory object now does all its reasoning and embedding through your local Ollama and writes vectors to your local Qdrant. No API key is ever read.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
How do I store and retrieve memories?
The API is deliberately tiny. You add conversation turns and Mem0 decides what's worth remembering; you search with a query and it returns the relevant facts.
# Store — pass the raw turns; Mem0 extracts the durable facts itself.
memory.add(
[
{"role": "user", "content": "I'm building a Rust CLI and I prefer terse code."},
{"role": "assistant", "content": "Got it — concise Rust it is."},
],
user_id="alex",
)
# Retrieve — semantic search, scoped to a user.
results = memory.search("What language is the user working in?", user_id="alex")
for m in results["results"]:
print(m["memory"]) # -> "User is building a Rust CLI"
# -> "User prefers terse/concise code"
# Dump everything you know about a user.
everything = memory.get_all(user_id="alex")
The key idea: you don't write the facts. You hand Mem0 the conversation and its LLM call distills it into atomic memories ("User is building a Rust CLI"). On retrieval, you get back a short ranked list — usually a handful of items — which you then inject into your agent's system prompt.
Note on current Mem0 behavior: the v2.x line (the token-efficient algorithm Mem0 shipped in 2026) uses single-pass ADD-only extraction — one LLM call per
add, and extracted facts accumulate as new records rather than being silently overwritten. Mem0 describes this as roughly halving extraction latency. It's friendlier for an agent (nothing vanishes unexpectedly), but it means you should occasionally prune, or memories grow unbounded — there's an open issue noting it doesn't auto-reconcile contradictory facts.
Full example: a CLI chat loop with memory
Here's a complete, runnable loop that ties it together: retrieve relevant memories, build a prompt, call Ollama directly for the reply, then write the new turn back to memory.
import ollama
from mem0 import Memory
config = {
"vector_store": {"provider": "qdrant", "config": {
"collection_name": "agent_memory", "host": "localhost",
"port": 6333, "embedding_model_dims": 768}},
"llm": {"provider": "ollama", "config": {
"model": "llama3.1:latest", "temperature": 0,
"max_tokens": 2000, "ollama_base_url": "http://localhost:11434"}},
"embedder": {"provider": "ollama", "config": {
"model": "nomic-embed-text:latest",
"ollama_base_url": "http://localhost:11434"}},
}
memory = Memory.from_config(config)
USER = "alex"
def chat(message: str) -> str:
# 1. Pull relevant long-term memories
hits = memory.search(message, user_id=USER).get("results", [])
facts = "\n".join(f"- {h['memory']}" for h in hits)
# 2. Build the prompt with memory injected
system = "You are a helpful assistant. Known facts about the user:\n" + (facts or "(none yet)")
reply = ollama.chat(
model="llama3.1:latest",
messages=[{"role": "system", "content": system},
{"role": "user", "content": message}],
)["message"]["content"]
# 3. Persist this turn so future sessions remember it
memory.add(
[{"role": "user", "content": message},
{"role": "assistant", "content": reply}],
user_id=USER,
)
return reply
if __name__ == "__main__":
while True:
msg = input("you> ")
if msg.strip() in {"exit", "quit"}:
break
print("bot>", chat(msg))
Run it once, tell it "I prefer Python over Go," quit, restart it the next day, and ask "what language should you use in examples?" — it answers Python, because the fact survived in Qdrant. That persistence across process restarts is the entire point.
First-hand note: on an RTX 3090 (24GB) with llama3.1:8b at Q4_K_M I measured roughly 45–55 tokens/sec for the reply generation, but each memory.add() adds a separate LLM extraction call that took me about 0.8–1.5 seconds per turn on top of the chat itself. Treat those figures as approximate — they swing with prompt length and how much there is to extract — but the takeaway holds: memory isn't free, every stored turn costs you an extra model round-trip. For a snappy UX I moved memory.add() to a background thread so it never blocks the reply.
Qdrant vs Chroma: which local store?
Both are valid local vector stores for Mem0. The trade-off is "separate service" vs "just a folder."
| Qdrant (local) | Chroma (local) | |
|---|---|---|
| Mem0 provider name | qdrant | chromadb (not chroma — common gotcha) |
| How it runs | Docker container or binary on :6333 | Embedded; persists to a local path |
| Setup cost | One Docker command | pip install chromadb |
| Best for | Larger memory sets, you want a real DB UI | Quick local prototypes, no Docker |
| Dim setting | embedding_model_dims: 768 | also set embedding_model_dims: 768 |
The Chroma config swaps just the vector_store block — keep the same llm and embedder:
"vector_store": {
"provider": "chromadb",
"config": {
"collection_name": "agent_memory",
"path": "db_local", # a folder, created if missing
"embedding_model_dims": 768, # still required for nomic-embed-text
},
},
Two real gotchas here, both documented in Mem0 issues: the provider string is chromadb, not chroma, and if you skip embedding_model_dims: 768 Chroma defaults to 1536 and every insert mis-aligns. If you want zero moving parts, Chroma wins; if you expect thousands of memories or want to inspect them in a dashboard, Qdrant is the steadier choice.
Honest take: where the setup gets fiddly
This is genuinely useful, but it is not a one-liner. The friction points I hit:
- The dimension mismatch will get you.
nomic-embed-textis 768-dim; Mem0's defaults assume 1536. Forgetembedding_model_dims: 768and you get cryptic vector-store errors, not a friendly message. This is the single most reported local-Mem0 bug on GitHub. - A small LLM extracts worse facts. Fact extraction is an LLM call.
llama3.1:8bis fine for clean statements but will occasionally store noise or miss nuance thatgpt-5-miniwould catch. Bumping to a larger local model (e.g. a 70B if your hardware allows) noticeably improves what gets remembered. - Every
addis an extra inference. As measured above, storing a turn costs a full extraction round-trip (~1s on a 3090). Do it inline and your chat feels sluggish; background it. - It grows unbounded. With v2's ADD-only behavior, nothing is auto-pruned. For a long-lived agent you need your own cleanup pass (delete stale or low-value memories) or search quality eventually degrades.
- Two services to babysit. Ollama and a vector store both have to be up. If Qdrant isn't running,
Memory.from_configfails at startup — wrap it and surface a clear error.
None of these are dealbreakers. But "add memory in 5 minutes" is marketing; budget an afternoon for a clean local setup, and once it works it's solid. For agents that need to act on memory (call tools, browse) rather than just recall facts, pair this with Goose running on Ollama.
Key Takeaways
- Mem0 runs 100% locally by overriding three config keys —
llm→ Ollama,embedder→ Ollama,vector_store→ local Qdrant/Chroma. The defaultMemory()constructor uses OpenAI, so you must useMemory.from_config(). - Always set
embedding_model_dims: 768when usingnomic-embed-text. The 768-vs-1536 mismatch is the #1 reason local setups break. - The API is two calls:
memory.add(messages, user_id=...)to store,memory.search(query, user_id=...)to retrieve. Mem0's LLM does the fact extraction for you. - Memory isn't free — each
addis a separate extraction inference (~1s on a 3090 with an 8B model). Background it for a responsive agent. - Pick your store by appetite for ops: Chroma (a folder,
pip install) for quick local work; Qdrant (Docker,:6333) for scale and inspectability. Both need the dimension setting.
Next Steps
- New to agents entirely? Start with how to build a local AI agent before adding the memory layer described here.
- Want an agent that acts, not just remembers? See running Goose as a local agent on Ollama.
- Building a memory-backed RAG/answer system? Pair Mem0 with a local answer engine that cites its sources.
- Want to give your agent tools via a standard protocol? Read the Ollama MCP integration guide — MCP and Mem0 compose nicely (one gives tools, the other gives recall).
- Reference: the official Mem0 GitHub repository (Apache-2.0) and the self-hosted Ollama cookbook.
Building agents? Do it the structured way.
AutoGen, CrewAI, tool-use, planning — hands-on and running on your own hardware. First chapter free.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARBuild AI Agents Locally with Ollama: No API Costs (2026 Guide)
- 8 Essential Steps: Optimize Sites for AI Agents 2025
- CrewAI Local Setup Guide: Build Multi-Agent Systems 2026
- CrewAI vs LangGraph vs AutoGen: Tested in 2026
- How to Build a Local AI Agent (2026): Ollama + Tools
Comments (0)
No comments yet. Be the first to share your thoughts!