To build a local RAG agent in 2026, run Ollama with an agentic framework like LangGraph (the maintainers ship an official "agentic RAG" tutorial) and wire your vector retriever in as a callable tool — not a fixed pipeline step. The agent then decides whether to search, grades whether the retrieved chunks are actually relevant, rewrites the query and retries when they are not, and only then answers. The practical local stack is Qwen3-8B or Llama 3.1 8B as the reasoning model, nomic-embed-text or bge-m3 for embeddings, and a vector store (Chroma or Qdrant) — all on a 16-32GB machine, fully offline.

The key difference from a normal RAG setup: a classic pipeline always retrieves and stuffs the top-k chunks into the prompt, no matter what. An agentic RAG system treats retrieval as a decision. It can skip retrieval for chit-chat, retrieve multiple times for multi-hop questions, and catch its own bad retrievals before they poison the answer. If you have already built the static version from our Ollama + ChromaDB RAG pipeline guide, this is the upgrade.

Agentic RAG vs static RAG: what actually changes?

Most "local RAG" tutorials — including the one-click AnythingLLM setup — build a static pipeline: embed query, fetch top-k, concatenate, generate. That is fine for simple lookups and it is where you should start. But it has three well-known failure modes: it retrieves even when the question needs no documents, it cannot recover from a bad first retrieval, and it struggles with questions that require chaining two or more searches.

Agentic RAG fixes those by giving the LLM agency over the retrieval loop. Here is the honest side-by-side.

Dimension	Static RAG pipeline	Agentic RAG agent
Retrieval trigger	Always retrieves, every turn	Model decides: search vs answer directly
Bad retrieval recovery	None — bad chunks go straight into the prompt	Grades chunks; rewrites query and retries if irrelevant
Multi-hop questions	One shot, top-k only	Loops: retrieve, reason, retrieve again
Query handling	Uses the raw user question	Rewrites/expands the query for better recall
Latency	Lowest (one LLM call)	Higher (2-4 LLM calls per answer)
Best for	FAQ lookups, single-doc Q&A	Research, multi-doc, "I don't know" honesty
Complexity to build	Low	Moderate (a small state graph)

The trade-off is real: an agentic loop makes 2-4 model calls instead of one, so it is slower and burns more tokens (or, locally, more wall-clock seconds). Use it when answer quality and honesty matter more than the extra second or two. For a pure FAQ bot, the static pipeline is the right call.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

The agentic RAG loop: retrieve, reason, act

Strip away the framework and an agentic RAG agent is a small state machine. The LangGraph team's official agentic RAG tutorial models it as exactly five nodes, and that shape has become the de-facto standard:

Decide (generate query or respond) — the model looks at the conversation and chooses to either call the retriever tool or answer directly. Greetings and general questions skip retrieval entirely.
Retrieve (retriever-as-tool) — instead of a hard-coded step, the vector search is exposed to the model as a tool it can invoke, optionally with a rewritten search string.
Grade — a grading node scores the retrieved chunks for relevance (a simple binary "yes/no" works well). This is the self-correction gate.
Rewrite — if the grade is "no," the agent reformulates the question (reasoning about the underlying intent) and loops back to retrieve. This is what catches a bad first search.
Answer (generate) — once relevant context is in hand, the model writes the final, grounded answer.

The loop between Grade and Rewrite is the whole point. It is also why frameworks built for cycles — LangGraph specifically allows graphs with loops, unlike a linear chain — are the natural fit. LlamaIndex offers similar query-engine-as-tool and retry patterns if you prefer its abstractions; either works locally with Ollama.

What local models are best for a RAG agent in 2026?

A RAG agent leans on two different models: a reasoning/generation model (does the deciding, grading, rewriting and answering) and an embedding model (turns your documents and queries into vectors). They are separate downloads. Here are the picks I would actually run, with figures pulled from each model's card or the Ollama library.

Reasoning models (the agent brain)

Model	Params	Context	VRAM (Q4_K_M)	License	Released	Why for RAG
Qwen3-8B	8B dense	128K (YaRN)	~5 GB	Apache 2.0	Apr 2025	Strong reasoning + hybrid think/no-think mode for grading
Qwen3-14B	14B dense	128K (YaRN)	~9.3 GB	Apache 2.0	Apr 2025	Best quality if you have 12GB+ VRAM
Llama 3.1 8B	8B dense	128K	~4.7 GB	Llama 3.1	Jul 2024	Excellent tool-calling, the safe default
Gemma 3 12B	12B dense	128K	~8 GB	Gemma	Mar 2025	Good instruction-following all-rounder

Two naming notes. First, the Qwen3 context figures: Qwen3-8B and 14B ship a 32K native context window that extends to ~128K with YaRN (Llama 3.1 and Gemma 3 are natively 128K). Second, Google's Gemma line has moved on — Gemma 4 arrived in early 2026 (sizes including E2B/E4B and ~26-31B), but the older Gemma 3 (1B / 4B / 12B / 27B, March 2025) remains a perfectly good, lighter local pick and is what the table above references. The Qwen3 lineup (released April 2025, Apache 2.0) is documented on the official Qwen3 GitHub repo. For agent work specifically, tool-calling reliability matters more than raw benchmark scores; Qwen3 and Llama 3.1 8B both handle the retriever-as-tool pattern cleanly. See our roundup of the best Ollama models for the wider field.

Embedding models (the retriever)

Embedding model	Dimensions	Context	Size	Notes
nomic-embed-text (v1.5)	768 (variable)	8,192	~274 MB	Long-context, strong general default
bge-m3	1,024	8,192	~1.2 GB	Multilingual + hybrid dense/sparse retrieval

Pull whichever fits your corpus: nomic-embed-text is the lightweight English-first default; bge-m3 is the pick for multilingual documents or when you want hybrid (dense + sparse) search. Both run on CPU happily — embedding is cheap compared to generation.

Build it: a minimal local agentic RAG with Ollama

Here is the smallest end-to-end version that still shows every moving part. First, pull the models:

# Reasoning model (the agent)
ollama pull qwen3:8b          # or: ollama pull llama3.1:8b

# Embedding model (the retriever)
ollama pull nomic-embed-text  # or: ollama pull bge-m3

Install the framework and a vector store:

pip install langgraph langchain langchain-ollama langchain-chroma chromadb

Index your documents once, exposing the retriever as a tool the agent can call:

from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_chroma import Chroma
from langchain_core.tools import tool

# 1. Embeddings + vector store (built once from your docs)
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma(
    collection_name="my_docs",
    embedding_function=embeddings,
    persist_directory="./chroma_db",
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# 2. Wrap retrieval as a TOOL — the agent decides when to use it
@tool
def search_docs(query: str) -> str:
    """Search the local knowledge base for relevant passages."""
    docs = retriever.invoke(query)
    return "\n\n".join(d.page_content for d in docs)

# 3. The reasoning model, with the tool bound to it
llm = ChatOllama(model="qwen3:8b", temperature=0)
llm_with_tools = llm.bind_tools([search_docs])

Now the grading node — the self-correction gate that separates an agent from a pipeline:

from pydantic import BaseModel, Field

class Grade(BaseModel):
    relevant: str = Field(description="'yes' if docs answer the question, else 'no'")

grader = ChatOllama(model="qwen3:8b", temperature=0).with_structured_output(Grade)

def grade_documents(question: str, context: str) -> str:
    prompt = (
        f"Question: {question}\n\nRetrieved context:\n{context}\n\n"
        "Do these passages contain information to answer the question? "
        "Answer strictly 'yes' or 'no'."
    )
    return grader.invoke(prompt).relevant

In LangGraph you wire these into a graph whose edges form the retrieve to grade to (rewrite to retrieve) to answer loop. The conditional edge after grade_documents is the heart of it: on "yes" go to the answer node, on "no" go to a rewrite node that reformulates the query and routes back to retrieval. The official tutorial above shows the full StateGraph assembly; the logic above is the part people get wrong.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

First-hand notes: speed and footprint

I ran a small version of this on a single RTX 3060 12GB with Qwen3-8B at Q4_K_M plus nomic-embed-text, over a ~600-page internal docs corpus in Chroma. Treat these as approximate, single-machine numbers, not a benchmark:

A simple lookup (one retrieval, graded "yes" on the first pass) returned in roughly 3-5 seconds — the model generated at about 40-45 tok/s.
A question that triggered a rewrite-and-retry took closer to 8-12 seconds, because the agent made the extra grade and rewrite calls before answering. That is the agentic tax, and it is the cost of catching a bad first retrieval.
Total resident footprint sat near ~7-8 GB VRAM (model + KV cache); the embedding model ran on CPU and was never the bottleneck.

The takeaway from running it: the rewrite loop earns its keep on vague or multi-part questions, where the static pipeline would have confidently answered from the wrong chunks. On crisp, single-fact questions the extra calls are pure overhead — which is exactly why the agent is allowed to skip them.

How much RAM do you actually need?

For an 8B reasoning model at Q4 plus a small embedding model, 16GB of system RAM (or ~6-8GB VRAM) is the realistic floor, and 32GB is comfortable once you add a vector store and a long context window.

Setup	Reasoning model	RAM / VRAM target	Experience
Minimum	Llama 3.1 8B / Qwen3-8B (Q4)	16GB RAM or ~6-8GB VRAM	Works; modest context, CPU embeddings
Comfortable	Qwen3-8B (Q4/Q8)	16-24GB, GPU helps	Smooth, longer context
Best local	Qwen3-14B / Gemma 3 12B (Q4)	32GB / 12GB+ VRAM	Best grading + answer quality

If you are pairing this agent with persistent recall across sessions, our guide on local AI agent memory with Mem0 covers adding a memory layer on top of retrieval. And if you want the broader agent foundation first — tools, loops, and Ollama wiring from scratch — start with how to build a local AI agent.

Key Takeaways

Agentic RAG = retrieval as a decision, not a fixed step. The agent chooses to search or answer, grades the results, rewrites and retries on bad retrievals, then answers — versus a static pipeline that always stuffs top-k into the prompt.
LangGraph is the natural local framework because it supports cyclic graphs; its official agentic RAG tutorial defines the five-node decide to retrieve to grade to rewrite to answer loop. LlamaIndex offers equivalent patterns.
The grading node is the upgrade. A simple binary "are these chunks relevant?" check, looping back to a query rewrite on "no," is what gives the agent self-correction.
Best 2026 local stack: Qwen3-8B or Llama 3.1 8B (reasoning) + nomic-embed-text or bge-m3 (embeddings) + Chroma/Qdrant — on 16-32GB RAM, fully offline. Gemma 3 12B is a solid alternate brain (and Gemma 4 shipped in early 2026 if you want the newer line).
Expect an agentic tax of 2-4 model calls per answer (a few extra seconds locally). Worth it for research and honesty; skip it and stay static for plain FAQ lookups.

Next Steps

Haven't built the basics yet? Start with the static Ollama + ChromaDB RAG pipeline, then return here to make it agentic.
Want zero code first? Spin up AnythingLLM for a one-click local RAG UI before you graduate to a custom agent.
Need the agent foundations (tools, loops)? Read how to build a local AI agent with Ollama.
Adding long-term recall? See local AI agent memory with Mem0.
Picking the brain model? Compare the field in best Ollama models.

Build a Local RAG Agent with Ollama (2026): Agentic RAG

Want to go deeper than this article?

Agentic RAG vs static RAG: what actually changes?

Reading articles is good. Building is better.

The agentic RAG loop: retrieve, reason, act

What local models are best for a RAG agent in 2026?

Reasoning models (the agent brain)

Embedding models (the retriever)

Build it: a minimal local agentic RAG with Ollama

Reading articles is good. Building is better.

First-hand notes: speed and footprint

How much RAM do you actually need?

Key Takeaways

Next Steps

Ollama’s running. Here’s what to build with it.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ready to Go Beyond Tutorials?

Go from reading about AI to building with AI

Related Guides

Ollama + ChromaDB RAG Pipeline

Build a Local AI Agent

Local AI Agent Memory with Mem0

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI