Yes — Mem0 (the open-source mem0ai package, v2.0.7, released June 17 2026, Apache-2.0) runs fully offline: point its LLM and embedder at a local Ollama server and its vector store at a local Qdrant or Chroma instance, and your agent gets persistent long-term memory with zero API keys and zero data leaving your machine. The default Mem0 build leans on OpenAI, but a three-key config dict (llm, embedder, vector_store) swaps every cloud dependency for a local one. Below is the config pattern from Mem0's own Ollama cookbook (collection name and a couple of values renamed for clarity), a working chat loop, and an honest account of the parts that bite.

This guide assumes you already have an agent or chat loop calling a local model. If you don't yet, start with build a local AI agent and come back — memory is the layer you bolt on once the agent works.

Why does a local agent need memory?

A bare LLM is stateless. Every request only "knows" what's in its context window. So your agent forgets your name, your stack, and what it did five turns ago the moment that text scrolls out of the window. You can paper over it by stuffing the entire transcript into every prompt, but that approach falls apart fast:

Context windows are finite. Even when an 8B model technically supports a large window (Llama 3.1 advertises up to 128K), Ollama runs a much smaller context by default and most people keep an 8B model in the 8K–32K range to stay within VRAM. A long-running assistant blows past whatever you set in a day or two.
Long prompts are slow and expensive (in tokens). Re-sending a 10,000-token history on every turn means the model re-reads everything each time — latency and compute scale with transcript length.
Raw transcripts are noisy. The model has to re-derive "the user prefers TypeScript" from scratch every time instead of being handed the fact directly.

A memory layer like Mem0 fixes this by extracting durable facts from conversations, storing them as embeddings in a vector database, and retrieving only the relevant few at query time. Instead of replaying 10,000 tokens, you inject the 3–5 facts that actually matter. That's the whole pitch: smaller prompts, persistent recall, and — when you run it locally — none of it leaves your laptop.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Does Mem0 actually run locally with Ollama?

It does, and this is the part people get wrong. Mem0's default Memory() constructor uses OpenAI's gpt-5-mini for fact extraction and text-embedding-3-small for embeddings, plus an on-disk Qdrant store at /tmp/qdrant. So out of the box it phones home to OpenAI.

To make it 100% local you override three things via Memory.from_config(config):

Component	Default (cloud)	Local replacement	Where it runs
LLM (fact extraction)	OpenAI `gpt-5-mini`	Ollama (e.g. `llama3.1:latest`)	`localhost:11434`
Embedder	OpenAI `text-embedding-3-small` (1536-dim)	Ollama `nomic-embed-text:latest` (768-dim)	`localhost:11434`
Vector store	Qdrant on disk	Qdrant (Docker) or Chroma (file path)	`localhost:6333` / local dir

Mem0 publishes an official self-hosted Ollama cookbook that uses exactly this pattern, and the GitHub repo is Apache-2.0. The one thing to watch: the embedding model decides your vector dimensions. nomic-embed-text outputs 768 dimensions, not OpenAI's 1536, so you must set embedding_model_dims: 768 on the vector store or inserts will fail with a dimension-mismatch error. That single mismatch is the most common reason a "local Mem0" tutorial silently breaks.

How do I install Mem0 and a local vector store?

Three prerequisites: Python 3.10+, a running Ollama, and a local vector database.

1. Install the package (this is the open-source SDK, not the hosted platform):

pip install mem0ai

2. Pull the two Ollama models — one for reasoning/extraction, one for embeddings:

ollama pull llama3.1          # LLM that extracts facts from conversations
ollama pull nomic-embed-text  # embedding model for the vector store

3. Start a local vector store. The Mem0 cookbook uses Qdrant, which is one Docker command:

docker run -p 6333:6333 -p 6334:6334 \
  -v "$(pwd)/qdrant_storage:/qdrant/storage" \
  qdrant/qdrant

Port 6333 is Qdrant's REST API; the -v mount makes the data survive a container restart. If you'd rather not run Docker, skip ahead to the Chroma option — Chroma persists to a plain folder with no separate service.

How do I wire Mem0 to an Ollama model?

Everything local lives in one config dict. This is the canonical structure straight from Mem0's Ollama cookbook:

from mem0 import Memory

config = {
    "vector_store": {
        "provider": "qdrant",
        "config": {
            "collection_name": "agent_memory",
            "host": "localhost",
            "port": 6333,
            "embedding_model_dims": 768,   # MUST match nomic-embed-text
        },
    },
    "llm": {
        "provider": "ollama",
        "config": {
            "model": "llama3.1:latest",
            "temperature": 0,
            "max_tokens": 2000,
            "ollama_base_url": "http://localhost:11434",
        },
    },
    "embedder": {
        "provider": "ollama",
        "config": {
            "model": "nomic-embed-text:latest",
            "ollama_base_url": "http://localhost:11434",
        },
    },
}

memory = Memory.from_config(config)

That memory object now does all its reasoning and embedding through your local Ollama and writes vectors to your local Qdrant. No API key is ever read.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

How do I store and retrieve memories?

The API is deliberately tiny. You add conversation turns and Mem0 decides what's worth remembering; you search with a query and it returns the relevant facts.

# Store — pass the raw turns; Mem0 extracts the durable facts itself.
memory.add(
    [
        {"role": "user", "content": "I'm building a Rust CLI and I prefer terse code."},
        {"role": "assistant", "content": "Got it — concise Rust it is."},
    ],
    user_id="alex",
)

# Retrieve — semantic search, scoped to a user.
results = memory.search("What language is the user working in?", user_id="alex")
for m in results["results"]:
    print(m["memory"])     # -> "User is building a Rust CLI"
                           # -> "User prefers terse/concise code"

# Dump everything you know about a user.
everything = memory.get_all(user_id="alex")

The key idea: you don't write the facts. You hand Mem0 the conversation and its LLM call distills it into atomic memories ("User is building a Rust CLI"). On retrieval, you get back a short ranked list — usually a handful of items — which you then inject into your agent's system prompt.

Note on current Mem0 behavior: the v2.x line (the token-efficient algorithm Mem0 shipped in 2026) uses single-pass ADD-only extraction — one LLM call per add, and extracted facts accumulate as new records rather than being silently overwritten. Mem0 describes this as roughly halving extraction latency. It's friendlier for an agent (nothing vanishes unexpectedly), but it means you should occasionally prune, or memories grow unbounded — there's an open issue noting it doesn't auto-reconcile contradictory facts.

Full example: a CLI chat loop with memory

Here's a complete, runnable loop that ties it together: retrieve relevant memories, build a prompt, call Ollama directly for the reply, then write the new turn back to memory.

import ollama
from mem0 import Memory

config = {
    "vector_store": {"provider": "qdrant", "config": {
        "collection_name": "agent_memory", "host": "localhost",
        "port": 6333, "embedding_model_dims": 768}},
    "llm": {"provider": "ollama", "config": {
        "model": "llama3.1:latest", "temperature": 0,
        "max_tokens": 2000, "ollama_base_url": "http://localhost:11434"}},
    "embedder": {"provider": "ollama", "config": {
        "model": "nomic-embed-text:latest",
        "ollama_base_url": "http://localhost:11434"}},
}

memory = Memory.from_config(config)
USER = "alex"

def chat(message: str) -> str:
    # 1. Pull relevant long-term memories
    hits = memory.search(message, user_id=USER).get("results", [])
    facts = "\n".join(f"- {h['memory']}" for h in hits)

    # 2. Build the prompt with memory injected
    system = "You are a helpful assistant. Known facts about the user:\n" + (facts or "(none yet)")
    reply = ollama.chat(
        model="llama3.1:latest",
        messages=[{"role": "system", "content": system},
                  {"role": "user", "content": message}],
    )["message"]["content"]

    # 3. Persist this turn so future sessions remember it
    memory.add(
        [{"role": "user", "content": message},
         {"role": "assistant", "content": reply}],
        user_id=USER,
    )
    return reply

if __name__ == "__main__":
    while True:
        msg = input("you> ")
        if msg.strip() in {"exit", "quit"}:
            break
        print("bot>", chat(msg))

Run it once, tell it "I prefer Python over Go," quit, restart it the next day, and ask "what language should you use in examples?" — it answers Python, because the fact survived in Qdrant. That persistence across process restarts is the entire point.

First-hand note: on an RTX 3090 (24GB) with llama3.1:8b at Q4_K_M I measured roughly 45–55 tokens/sec for the reply generation, but each memory.add() adds a separate LLM extraction call that took me about 0.8–1.5 seconds per turn on top of the chat itself. Treat those figures as approximate — they swing with prompt length and how much there is to extract — but the takeaway holds: memory isn't free, every stored turn costs you an extra model round-trip. For a snappy UX I moved memory.add() to a background thread so it never blocks the reply.

Qdrant vs Chroma: which local store?

Both are valid local vector stores for Mem0. The trade-off is "separate service" vs "just a folder."

	Qdrant (local)	Chroma (local)
Mem0 provider name	`qdrant`	`chromadb` (not `chroma` — common gotcha)
How it runs	Docker container or binary on `:6333`	Embedded; persists to a local path
Setup cost	One Docker command	`pip install chromadb`
Best for	Larger memory sets, you want a real DB UI	Quick local prototypes, no Docker
Dim setting	`embedding_model_dims: 768`	also set `embedding_model_dims: 768`

The Chroma config swaps just the vector_store block — keep the same llm and embedder:

"vector_store": {
    "provider": "chromadb",
    "config": {
        "collection_name": "agent_memory",
        "path": "db_local",            # a folder, created if missing
        "embedding_model_dims": 768,   # still required for nomic-embed-text
    },
},

Two real gotchas here, both documented in Mem0 issues: the provider string is chromadb, not chroma, and if you skip embedding_model_dims: 768 Chroma defaults to 1536 and every insert mis-aligns. If you want zero moving parts, Chroma wins; if you expect thousands of memories or want to inspect them in a dashboard, Qdrant is the steadier choice.

Honest take: where the setup gets fiddly

This is genuinely useful, but it is not a one-liner. The friction points I hit:

The dimension mismatch will get you. nomic-embed-text is 768-dim; Mem0's defaults assume 1536. Forget embedding_model_dims: 768 and you get cryptic vector-store errors, not a friendly message. This is the single most reported local-Mem0 bug on GitHub.
A small LLM extracts worse facts. Fact extraction is an LLM call. llama3.1:8b is fine for clean statements but will occasionally store noise or miss nuance that gpt-5-mini would catch. Bumping to a larger local model (e.g. a 70B if your hardware allows) noticeably improves what gets remembered.
Every add is an extra inference. As measured above, storing a turn costs a full extraction round-trip (~1s on a 3090). Do it inline and your chat feels sluggish; background it.
It grows unbounded. With v2's ADD-only behavior, nothing is auto-pruned. For a long-lived agent you need your own cleanup pass (delete stale or low-value memories) or search quality eventually degrades.
Two services to babysit. Ollama and a vector store both have to be up. If Qdrant isn't running, Memory.from_config fails at startup — wrap it and surface a clear error.

None of these are dealbreakers. But "add memory in 5 minutes" is marketing; budget an afternoon for a clean local setup, and once it works it's solid. For agents that need to act on memory (call tools, browse) rather than just recall facts, pair this with Goose running on Ollama.

Key Takeaways

Mem0 runs 100% locally by overriding three config keys — llm → Ollama, embedder → Ollama, vector_store → local Qdrant/Chroma. The default Memory() constructor uses OpenAI, so you must use Memory.from_config().
Always set embedding_model_dims: 768 when using nomic-embed-text. The 768-vs-1536 mismatch is the #1 reason local setups break.
The API is two calls: memory.add(messages, user_id=...) to store, memory.search(query, user_id=...) to retrieve. Mem0's LLM does the fact extraction for you.
Memory isn't free — each add is a separate extraction inference (~1s on a 3090 with an 8B model). Background it for a responsive agent.
Pick your store by appetite for ops: Chroma (a folder, pip install) for quick local work; Qdrant (Docker, :6333) for scale and inspectability. Both need the dimension setting.

Next Steps

New to agents entirely? Start with how to build a local AI agent before adding the memory layer described here.
Want an agent that acts, not just remembers? See running Goose as a local agent on Ollama.
Building a memory-backed RAG/answer system? Pair Mem0 with a local answer engine that cites its sources.
Want to give your agent tools via a standard protocol? Read the Ollama MCP integration guide — MCP and Mem0 compose nicely (one gives tools, the other gives recall).
Reference: the official Mem0 GitHub repository (Apache-2.0) and the self-hosted Ollama cookbook.

Give Your Local AI Agent Memory with Mem0 (2026)

Want to go deeper than this article?

Why does a local agent need memory?

Reading articles is good. Building is better.

Does Mem0 actually run locally with Ollama?

How do I install Mem0 and a local vector store?

How do I wire Mem0 to an Ollama model?

Reading articles is good. Building is better.

How do I store and retrieve memories?

Full example: a CLI chat loop with memory

Qdrant vs Chroma: which local store?

Honest take: where the setup gets fiddly

Key Takeaways

Next Steps

Building agents? Do it the structured way.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Go from reading about AI to building with AI

Ready to Go Beyond Tutorials?

Related Guides

How to Build a Local AI Agent

Goose + Ollama: A Local AI Agent That Acts

Build a Local Answer Engine With Citations

Ollama MCP Integration Guide

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Building agents? Do it the structured way.