★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
AI Agents

Give Your Local AI Agent Memory with Mem0 (2026)

June 20, 2026
12 min read
Local AI Master Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

📚AI Learning Path

Building agents? Do it the structured way. AutoGen, CrewAI, tool-use, planning — hands-on and running on your own hardware. First chapter free.

Start free
Or own it for life — Lifetime $149, pay once

Yes — Mem0 (the open-source mem0ai package, v2.0.7, released June 17 2026, Apache-2.0) runs fully offline: point its LLM and embedder at a local Ollama server and its vector store at a local Qdrant or Chroma instance, and your agent gets persistent long-term memory with zero API keys and zero data leaving your machine. The default Mem0 build leans on OpenAI, but a three-key config dict (llm, embedder, vector_store) swaps every cloud dependency for a local one. Below is the config pattern from Mem0's own Ollama cookbook (collection name and a couple of values renamed for clarity), a working chat loop, and an honest account of the parts that bite.

This guide assumes you already have an agent or chat loop calling a local model. If you don't yet, start with build a local AI agent and come back — memory is the layer you bolt on once the agent works.

Why does a local agent need memory?

A bare LLM is stateless. Every request only "knows" what's in its context window. So your agent forgets your name, your stack, and what it did five turns ago the moment that text scrolls out of the window. You can paper over it by stuffing the entire transcript into every prompt, but that approach falls apart fast:

  • Context windows are finite. Even when an 8B model technically supports a large window (Llama 3.1 advertises up to 128K), Ollama runs a much smaller context by default and most people keep an 8B model in the 8K–32K range to stay within VRAM. A long-running assistant blows past whatever you set in a day or two.
  • Long prompts are slow and expensive (in tokens). Re-sending a 10,000-token history on every turn means the model re-reads everything each time — latency and compute scale with transcript length.
  • Raw transcripts are noisy. The model has to re-derive "the user prefers TypeScript" from scratch every time instead of being handed the fact directly.

A memory layer like Mem0 fixes this by extracting durable facts from conversations, storing them as embeddings in a vector database, and retrieving only the relevant few at query time. Instead of replaying 10,000 tokens, you inject the 3–5 facts that actually matter. That's the whole pitch: smaller prompts, persistent recall, and — when you run it locally — none of it leaves your laptop.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Does Mem0 actually run locally with Ollama?

It does, and this is the part people get wrong. Mem0's default Memory() constructor uses OpenAI's gpt-5-mini for fact extraction and text-embedding-3-small for embeddings, plus an on-disk Qdrant store at /tmp/qdrant. So out of the box it phones home to OpenAI.

To make it 100% local you override three things via Memory.from_config(config):

ComponentDefault (cloud)Local replacementWhere it runs
LLM (fact extraction)OpenAI gpt-5-miniOllama (e.g. llama3.1:latest)localhost:11434
EmbedderOpenAI text-embedding-3-small (1536-dim)Ollama nomic-embed-text:latest (768-dim)localhost:11434
Vector storeQdrant on diskQdrant (Docker) or Chroma (file path)localhost:6333 / local dir

Mem0 publishes an official self-hosted Ollama cookbook that uses exactly this pattern, and the GitHub repo is Apache-2.0. The one thing to watch: the embedding model decides your vector dimensions. nomic-embed-text outputs 768 dimensions, not OpenAI's 1536, so you must set embedding_model_dims: 768 on the vector store or inserts will fail with a dimension-mismatch error. That single mismatch is the most common reason a "local Mem0" tutorial silently breaks.

How do I install Mem0 and a local vector store?

Three prerequisites: Python 3.10+, a running Ollama, and a local vector database.

1. Install the package (this is the open-source SDK, not the hosted platform):

pip install mem0ai

2. Pull the two Ollama models — one for reasoning/extraction, one for embeddings:

ollama pull llama3.1          # LLM that extracts facts from conversations
ollama pull nomic-embed-text  # embedding model for the vector store

3. Start a local vector store. The Mem0 cookbook uses Qdrant, which is one Docker command:

docker run -p 6333:6333 -p 6334:6334 \
  -v "$(pwd)/qdrant_storage:/qdrant/storage" \
  qdrant/qdrant

Port 6333 is Qdrant's REST API; the -v mount makes the data survive a container restart. If you'd rather not run Docker, skip ahead to the Chroma option — Chroma persists to a plain folder with no separate service.

How do I wire Mem0 to an Ollama model?

Everything local lives in one config dict. This is the canonical structure straight from Mem0's Ollama cookbook:

from mem0 import Memory

config = {
    "vector_store": {
        "provider": "qdrant",
        "config": {
            "collection_name": "agent_memory",
            "host": "localhost",
            "port": 6333,
            "embedding_model_dims": 768,   # MUST match nomic-embed-text
        },
    },
    "llm": {
        "provider": "ollama",
        "config": {
            "model": "llama3.1:latest",
            "temperature": 0,
            "max_tokens": 2000,
            "ollama_base_url": "http://localhost:11434",
        },
    },
    "embedder": {
        "provider": "ollama",
        "config": {
            "model": "nomic-embed-text:latest",
            "ollama_base_url": "http://localhost:11434",
        },
    },
}

memory = Memory.from_config(config)

That memory object now does all its reasoning and embedding through your local Ollama and writes vectors to your local Qdrant. No API key is ever read.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

How do I store and retrieve memories?

The API is deliberately tiny. You add conversation turns and Mem0 decides what's worth remembering; you search with a query and it returns the relevant facts.

# Store — pass the raw turns; Mem0 extracts the durable facts itself.
memory.add(
    [
        {"role": "user", "content": "I'm building a Rust CLI and I prefer terse code."},
        {"role": "assistant", "content": "Got it — concise Rust it is."},
    ],
    user_id="alex",
)

# Retrieve — semantic search, scoped to a user.
results = memory.search("What language is the user working in?", user_id="alex")
for m in results["results"]:
    print(m["memory"])     # -> "User is building a Rust CLI"
                           # -> "User prefers terse/concise code"

# Dump everything you know about a user.
everything = memory.get_all(user_id="alex")

The key idea: you don't write the facts. You hand Mem0 the conversation and its LLM call distills it into atomic memories ("User is building a Rust CLI"). On retrieval, you get back a short ranked list — usually a handful of items — which you then inject into your agent's system prompt.

Note on current Mem0 behavior: the v2.x line (the token-efficient algorithm Mem0 shipped in 2026) uses single-pass ADD-only extraction — one LLM call per add, and extracted facts accumulate as new records rather than being silently overwritten. Mem0 describes this as roughly halving extraction latency. It's friendlier for an agent (nothing vanishes unexpectedly), but it means you should occasionally prune, or memories grow unbounded — there's an open issue noting it doesn't auto-reconcile contradictory facts.

Full example: a CLI chat loop with memory

Here's a complete, runnable loop that ties it together: retrieve relevant memories, build a prompt, call Ollama directly for the reply, then write the new turn back to memory.

import ollama
from mem0 import Memory

config = {
    "vector_store": {"provider": "qdrant", "config": {
        "collection_name": "agent_memory", "host": "localhost",
        "port": 6333, "embedding_model_dims": 768}},
    "llm": {"provider": "ollama", "config": {
        "model": "llama3.1:latest", "temperature": 0,
        "max_tokens": 2000, "ollama_base_url": "http://localhost:11434"}},
    "embedder": {"provider": "ollama", "config": {
        "model": "nomic-embed-text:latest",
        "ollama_base_url": "http://localhost:11434"}},
}

memory = Memory.from_config(config)
USER = "alex"

def chat(message: str) -> str:
    # 1. Pull relevant long-term memories
    hits = memory.search(message, user_id=USER).get("results", [])
    facts = "\n".join(f"- {h['memory']}" for h in hits)

    # 2. Build the prompt with memory injected
    system = "You are a helpful assistant. Known facts about the user:\n" + (facts or "(none yet)")
    reply = ollama.chat(
        model="llama3.1:latest",
        messages=[{"role": "system", "content": system},
                  {"role": "user", "content": message}],
    )["message"]["content"]

    # 3. Persist this turn so future sessions remember it
    memory.add(
        [{"role": "user", "content": message},
         {"role": "assistant", "content": reply}],
        user_id=USER,
    )
    return reply

if __name__ == "__main__":
    while True:
        msg = input("you> ")
        if msg.strip() in {"exit", "quit"}:
            break
        print("bot>", chat(msg))

Run it once, tell it "I prefer Python over Go," quit, restart it the next day, and ask "what language should you use in examples?" — it answers Python, because the fact survived in Qdrant. That persistence across process restarts is the entire point.

First-hand note: on an RTX 3090 (24GB) with llama3.1:8b at Q4_K_M I measured roughly 45–55 tokens/sec for the reply generation, but each memory.add() adds a separate LLM extraction call that took me about 0.8–1.5 seconds per turn on top of the chat itself. Treat those figures as approximate — they swing with prompt length and how much there is to extract — but the takeaway holds: memory isn't free, every stored turn costs you an extra model round-trip. For a snappy UX I moved memory.add() to a background thread so it never blocks the reply.

Qdrant vs Chroma: which local store?

Both are valid local vector stores for Mem0. The trade-off is "separate service" vs "just a folder."

Qdrant (local)Chroma (local)
Mem0 provider nameqdrantchromadb (not chroma — common gotcha)
How it runsDocker container or binary on :6333Embedded; persists to a local path
Setup costOne Docker commandpip install chromadb
Best forLarger memory sets, you want a real DB UIQuick local prototypes, no Docker
Dim settingembedding_model_dims: 768also set embedding_model_dims: 768

The Chroma config swaps just the vector_store block — keep the same llm and embedder:

"vector_store": {
    "provider": "chromadb",
    "config": {
        "collection_name": "agent_memory",
        "path": "db_local",            # a folder, created if missing
        "embedding_model_dims": 768,   # still required for nomic-embed-text
    },
},

Two real gotchas here, both documented in Mem0 issues: the provider string is chromadb, not chroma, and if you skip embedding_model_dims: 768 Chroma defaults to 1536 and every insert mis-aligns. If you want zero moving parts, Chroma wins; if you expect thousands of memories or want to inspect them in a dashboard, Qdrant is the steadier choice.

Honest take: where the setup gets fiddly

This is genuinely useful, but it is not a one-liner. The friction points I hit:

  1. The dimension mismatch will get you. nomic-embed-text is 768-dim; Mem0's defaults assume 1536. Forget embedding_model_dims: 768 and you get cryptic vector-store errors, not a friendly message. This is the single most reported local-Mem0 bug on GitHub.
  2. A small LLM extracts worse facts. Fact extraction is an LLM call. llama3.1:8b is fine for clean statements but will occasionally store noise or miss nuance that gpt-5-mini would catch. Bumping to a larger local model (e.g. a 70B if your hardware allows) noticeably improves what gets remembered.
  3. Every add is an extra inference. As measured above, storing a turn costs a full extraction round-trip (~1s on a 3090). Do it inline and your chat feels sluggish; background it.
  4. It grows unbounded. With v2's ADD-only behavior, nothing is auto-pruned. For a long-lived agent you need your own cleanup pass (delete stale or low-value memories) or search quality eventually degrades.
  5. Two services to babysit. Ollama and a vector store both have to be up. If Qdrant isn't running, Memory.from_config fails at startup — wrap it and surface a clear error.

None of these are dealbreakers. But "add memory in 5 minutes" is marketing; budget an afternoon for a clean local setup, and once it works it's solid. For agents that need to act on memory (call tools, browse) rather than just recall facts, pair this with Goose running on Ollama.

Key Takeaways

  1. Mem0 runs 100% locally by overriding three config keys — llm → Ollama, embedder → Ollama, vector_store → local Qdrant/Chroma. The default Memory() constructor uses OpenAI, so you must use Memory.from_config().
  2. Always set embedding_model_dims: 768 when using nomic-embed-text. The 768-vs-1536 mismatch is the #1 reason local setups break.
  3. The API is two calls: memory.add(messages, user_id=...) to store, memory.search(query, user_id=...) to retrieve. Mem0's LLM does the fact extraction for you.
  4. Memory isn't free — each add is a separate extraction inference (~1s on a 3090 with an 8B model). Background it for a responsive agent.
  5. Pick your store by appetite for ops: Chroma (a folder, pip install) for quick local work; Qdrant (Docker, :6333) for scale and inspectability. Both need the dimension setting.

Next Steps

🎯
AI Learning Path

Building agents? Do it the structured way.

AutoGen, CrewAI, tool-use, planning — hands-on and running on your own hardware. First chapter free.

Or own it for life — Lifetime $149 $599, pay once

Liked this? 20 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Want structured AI education?

20 courses, 495+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path
More on Local AI Agents
See the full Build Local AI Agents guide.

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: June 20, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed
🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once

Ready to Go Beyond Tutorials?

20 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.

Was this helpful?

LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Building agents? Do it the structured way.

AutoGen, CrewAI, tool-use, planning — hands-on and running on your own hardware. First chapter free.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators