Build a Local RAG Agent with Ollama (2026): Agentic RAG
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Ollama’s running. Here’s what to build with it. Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.
To build a local RAG agent in 2026, run Ollama with an agentic framework like LangGraph (the maintainers ship an official "agentic RAG" tutorial) and wire your vector retriever in as a callable tool — not a fixed pipeline step. The agent then decides whether to search, grades whether the retrieved chunks are actually relevant, rewrites the query and retries when they are not, and only then answers. The practical local stack is Qwen3-8B or Llama 3.1 8B as the reasoning model, nomic-embed-text or bge-m3 for embeddings, and a vector store (Chroma or Qdrant) — all on a 16-32GB machine, fully offline.
The key difference from a normal RAG setup: a classic pipeline always retrieves and stuffs the top-k chunks into the prompt, no matter what. An agentic RAG system treats retrieval as a decision. It can skip retrieval for chit-chat, retrieve multiple times for multi-hop questions, and catch its own bad retrievals before they poison the answer. If you have already built the static version from our Ollama + ChromaDB RAG pipeline guide, this is the upgrade.
Agentic RAG vs static RAG: what actually changes?
Most "local RAG" tutorials — including the one-click AnythingLLM setup — build a static pipeline: embed query, fetch top-k, concatenate, generate. That is fine for simple lookups and it is where you should start. But it has three well-known failure modes: it retrieves even when the question needs no documents, it cannot recover from a bad first retrieval, and it struggles with questions that require chaining two or more searches.
Agentic RAG fixes those by giving the LLM agency over the retrieval loop. Here is the honest side-by-side.
| Dimension | Static RAG pipeline | Agentic RAG agent |
|---|---|---|
| Retrieval trigger | Always retrieves, every turn | Model decides: search vs answer directly |
| Bad retrieval recovery | None — bad chunks go straight into the prompt | Grades chunks; rewrites query and retries if irrelevant |
| Multi-hop questions | One shot, top-k only | Loops: retrieve, reason, retrieve again |
| Query handling | Uses the raw user question | Rewrites/expands the query for better recall |
| Latency | Lowest (one LLM call) | Higher (2-4 LLM calls per answer) |
| Best for | FAQ lookups, single-doc Q&A | Research, multi-doc, "I don't know" honesty |
| Complexity to build | Low | Moderate (a small state graph) |
The trade-off is real: an agentic loop makes 2-4 model calls instead of one, so it is slower and burns more tokens (or, locally, more wall-clock seconds). Use it when answer quality and honesty matter more than the extra second or two. For a pure FAQ bot, the static pipeline is the right call.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
The agentic RAG loop: retrieve, reason, act
Strip away the framework and an agentic RAG agent is a small state machine. The LangGraph team's official agentic RAG tutorial models it as exactly five nodes, and that shape has become the de-facto standard:
- Decide (generate query or respond) — the model looks at the conversation and chooses to either call the retriever tool or answer directly. Greetings and general questions skip retrieval entirely.
- Retrieve (retriever-as-tool) — instead of a hard-coded step, the vector search is exposed to the model as a tool it can invoke, optionally with a rewritten search string.
- Grade — a grading node scores the retrieved chunks for relevance (a simple binary "yes/no" works well). This is the self-correction gate.
- Rewrite — if the grade is "no," the agent reformulates the question (reasoning about the underlying intent) and loops back to retrieve. This is what catches a bad first search.
- Answer (generate) — once relevant context is in hand, the model writes the final, grounded answer.
The loop between Grade and Rewrite is the whole point. It is also why frameworks built for cycles — LangGraph specifically allows graphs with loops, unlike a linear chain — are the natural fit. LlamaIndex offers similar query-engine-as-tool and retry patterns if you prefer its abstractions; either works locally with Ollama.
What local models are best for a RAG agent in 2026?
A RAG agent leans on two different models: a reasoning/generation model (does the deciding, grading, rewriting and answering) and an embedding model (turns your documents and queries into vectors). They are separate downloads. Here are the picks I would actually run, with figures pulled from each model's card or the Ollama library.
Reasoning models (the agent brain)
| Model | Params | Context | VRAM (Q4_K_M) | License | Released | Why for RAG |
|---|---|---|---|---|---|---|
| Qwen3-8B | 8B dense | 128K (YaRN) | ~5 GB | Apache 2.0 | Apr 2025 | Strong reasoning + hybrid think/no-think mode for grading |
| Qwen3-14B | 14B dense | 128K (YaRN) | ~9.3 GB | Apache 2.0 | Apr 2025 | Best quality if you have 12GB+ VRAM |
| Llama 3.1 8B | 8B dense | 128K | ~4.7 GB | Llama 3.1 | Jul 2024 | Excellent tool-calling, the safe default |
| Gemma 3 12B | 12B dense | 128K | ~8 GB | Gemma | Mar 2025 | Good instruction-following all-rounder |
Two naming notes. First, the Qwen3 context figures: Qwen3-8B and 14B ship a 32K native context window that extends to ~128K with YaRN (Llama 3.1 and Gemma 3 are natively 128K). Second, Google's Gemma line has moved on — Gemma 4 arrived in early 2026 (sizes including E2B/E4B and ~26-31B), but the older Gemma 3 (1B / 4B / 12B / 27B, March 2025) remains a perfectly good, lighter local pick and is what the table above references. The Qwen3 lineup (released April 2025, Apache 2.0) is documented on the official Qwen3 GitHub repo. For agent work specifically, tool-calling reliability matters more than raw benchmark scores; Qwen3 and Llama 3.1 8B both handle the retriever-as-tool pattern cleanly. See our roundup of the best Ollama models for the wider field.
Embedding models (the retriever)
| Embedding model | Dimensions | Context | Size | Notes |
|---|---|---|---|---|
| nomic-embed-text (v1.5) | 768 (variable) | 8,192 | ~274 MB | Long-context, strong general default |
| bge-m3 | 1,024 | 8,192 | ~1.2 GB | Multilingual + hybrid dense/sparse retrieval |
Pull whichever fits your corpus: nomic-embed-text is the lightweight English-first default; bge-m3 is the pick for multilingual documents or when you want hybrid (dense + sparse) search. Both run on CPU happily — embedding is cheap compared to generation.
Build it: a minimal local agentic RAG with Ollama
Here is the smallest end-to-end version that still shows every moving part. First, pull the models:
# Reasoning model (the agent)
ollama pull qwen3:8b # or: ollama pull llama3.1:8b
# Embedding model (the retriever)
ollama pull nomic-embed-text # or: ollama pull bge-m3
Install the framework and a vector store:
pip install langgraph langchain langchain-ollama langchain-chroma chromadb
Index your documents once, exposing the retriever as a tool the agent can call:
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_chroma import Chroma
from langchain_core.tools import tool
# 1. Embeddings + vector store (built once from your docs)
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma(
collection_name="my_docs",
embedding_function=embeddings,
persist_directory="./chroma_db",
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# 2. Wrap retrieval as a TOOL — the agent decides when to use it
@tool
def search_docs(query: str) -> str:
"""Search the local knowledge base for relevant passages."""
docs = retriever.invoke(query)
return "\n\n".join(d.page_content for d in docs)
# 3. The reasoning model, with the tool bound to it
llm = ChatOllama(model="qwen3:8b", temperature=0)
llm_with_tools = llm.bind_tools([search_docs])
Now the grading node — the self-correction gate that separates an agent from a pipeline:
from pydantic import BaseModel, Field
class Grade(BaseModel):
relevant: str = Field(description="'yes' if docs answer the question, else 'no'")
grader = ChatOllama(model="qwen3:8b", temperature=0).with_structured_output(Grade)
def grade_documents(question: str, context: str) -> str:
prompt = (
f"Question: {question}\n\nRetrieved context:\n{context}\n\n"
"Do these passages contain information to answer the question? "
"Answer strictly 'yes' or 'no'."
)
return grader.invoke(prompt).relevant
In LangGraph you wire these into a graph whose edges form the retrieve to grade to (rewrite to retrieve) to answer loop. The conditional edge after grade_documents is the heart of it: on "yes" go to the answer node, on "no" go to a rewrite node that reformulates the query and routes back to retrieval. The official tutorial above shows the full StateGraph assembly; the logic above is the part people get wrong.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
First-hand notes: speed and footprint
I ran a small version of this on a single RTX 3060 12GB with Qwen3-8B at Q4_K_M plus nomic-embed-text, over a ~600-page internal docs corpus in Chroma. Treat these as approximate, single-machine numbers, not a benchmark:
- A simple lookup (one retrieval, graded "yes" on the first pass) returned in roughly 3-5 seconds — the model generated at about 40-45 tok/s.
- A question that triggered a rewrite-and-retry took closer to 8-12 seconds, because the agent made the extra grade and rewrite calls before answering. That is the agentic tax, and it is the cost of catching a bad first retrieval.
- Total resident footprint sat near ~7-8 GB VRAM (model + KV cache); the embedding model ran on CPU and was never the bottleneck.
The takeaway from running it: the rewrite loop earns its keep on vague or multi-part questions, where the static pipeline would have confidently answered from the wrong chunks. On crisp, single-fact questions the extra calls are pure overhead — which is exactly why the agent is allowed to skip them.
How much RAM do you actually need?
For an 8B reasoning model at Q4 plus a small embedding model, 16GB of system RAM (or ~6-8GB VRAM) is the realistic floor, and 32GB is comfortable once you add a vector store and a long context window.
| Setup | Reasoning model | RAM / VRAM target | Experience |
|---|---|---|---|
| Minimum | Llama 3.1 8B / Qwen3-8B (Q4) | 16GB RAM or ~6-8GB VRAM | Works; modest context, CPU embeddings |
| Comfortable | Qwen3-8B (Q4/Q8) | 16-24GB, GPU helps | Smooth, longer context |
| Best local | Qwen3-14B / Gemma 3 12B (Q4) | 32GB / 12GB+ VRAM | Best grading + answer quality |
If you are pairing this agent with persistent recall across sessions, our guide on local AI agent memory with Mem0 covers adding a memory layer on top of retrieval. And if you want the broader agent foundation first — tools, loops, and Ollama wiring from scratch — start with how to build a local AI agent.
Key Takeaways
- Agentic RAG = retrieval as a decision, not a fixed step. The agent chooses to search or answer, grades the results, rewrites and retries on bad retrievals, then answers — versus a static pipeline that always stuffs top-k into the prompt.
- LangGraph is the natural local framework because it supports cyclic graphs; its official agentic RAG tutorial defines the five-node decide to retrieve to grade to rewrite to answer loop. LlamaIndex offers equivalent patterns.
- The grading node is the upgrade. A simple binary "are these chunks relevant?" check, looping back to a query rewrite on "no," is what gives the agent self-correction.
- Best 2026 local stack: Qwen3-8B or Llama 3.1 8B (reasoning) +
nomic-embed-textorbge-m3(embeddings) + Chroma/Qdrant — on 16-32GB RAM, fully offline. Gemma 3 12B is a solid alternate brain (and Gemma 4 shipped in early 2026 if you want the newer line). - Expect an agentic tax of 2-4 model calls per answer (a few extra seconds locally). Worth it for research and honesty; skip it and stay static for plain FAQ lookups.
Next Steps
- Haven't built the basics yet? Start with the static Ollama + ChromaDB RAG pipeline, then return here to make it agentic.
- Want zero code first? Spin up AnythingLLM for a one-click local RAG UI before you graduate to a custom agent.
- Need the agent foundations (tools, loops)? Read how to build a local AI agent with Ollama.
- Adding long-term recall? See local AI agent memory with Mem0.
- Picking the brain model? Compare the field in best Ollama models.
Ollama’s running. Here’s what to build with it.
Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARBuild AI Agents Locally with Ollama: No API Costs (2026 Guide)
- 8 Essential Steps: Optimize Sites for AI Agents 2025
- CrewAI Local Setup Guide: Build Multi-Agent Systems 2026
- CrewAI vs LangGraph vs AutoGen: Tested in 2026
- Give Your Local AI Agent Memory with Mem0 (2026)
- Hardware for Local AI Agents (2026): RAM, GPU & VRAM
- How to Build a Local AI Agent (2026): Ollama + Tools
- LangGraph + Ollama: Build Local AI Agents (2026 Guide)
Comments (0)
No comments yet. Be the first to share your thoughts!