Published on June 20, 2026 • 16 min read

To build a local AI agent in 2026, run a tool-calling model on Ollama — Qwen3 (8B or the 30B-A3B MoE), Llama 3.1/3.3, or Mistral Small — then give it a loop that reads the model's tool calls and actually executes them (web search, shell, file read/write) before feeding the results back. The simplest working version is ~60 lines of plain Python against Ollama's OpenAI-compatible `/v1` endpoint; for anything multi-step or production-shaped, reach for a framework like LangGraph, CrewAI, or Goose. An "agent" is just an LLM in a loop with tools and memory — the model decides which tool to call, your code runs it, and the cycle repeats until the task is done. Everything below runs offline, costs $0 in API fees, and keeps your data on your machine.

Be clear-eyed about one thing up front: small local models are not GPT-5-class planners. An 8B model running on a laptop will happily call a tool, but it will also loop, hallucinate arguments, and give up on tasks longer than a few steps. This guide shows you the build and where the cliff is.

The 5 Steps

Pick a tool-calling model and pull it with Ollama.
Choose a framework (or plain Python) — compare below.
Wire real tools via function calling or MCP.
Add memory so the agent remembers across turns.
Run the loop, then test honestly against its limits.

What is a local AI agent, really?
Step 1: Pick a tool-calling model
Step 2: Pick a framework (or plain Python)
Step 3: Wire tools via function calling or MCP
Step 4: Add memory
Step 5: A complete runnable example
The honest limits of small local models
Key Takeaways
Next Steps

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What is a local AI agent, really?

A local AI agent is a language model running entirely on your own hardware that can take actions, not just generate text. The pattern is the same one cloud agents use, stripped to its essentials:

You give the model a goal and a list of tools (functions it's allowed to call).
The model replies with either a final answer or a structured request to call a tool ("run web_search('latest Ollama release')").
Your code executes that tool and feeds the result back into the conversation.
Repeat until the model produces a final answer.

That loop — model → tool call → execution → observation → model — is the whole game. The "agentic" magic is just the model deciding which tool and what arguments, and your harness being disciplined about running them and capturing the output. Memory (so it remembers prior turns) and a stop condition (so it doesn't loop forever) are the two pieces that turn a toy into something usable.

The thing that makes this practical in 2026 is that Ollama exposes an OpenAI-compatible /v1 endpoint at http://localhost:11434/v1. Any agent framework that speaks the OpenAI API works against your local model by changing one base URL — no cloud key, no per-token billing.

Step 1: Pick a tool-calling model

Not every local model can call tools reliably. You want a model that was trained for function calling — meaning it emits valid, structured tool-call JSON instead of describing the tool in prose. These are the verified solid picks on Ollama right now:

Model	Pull command	Params	VRAM (Q4_K_M)	Tool calling	Best for
Qwen3 8B	`ollama pull qwen3:8b`	8B dense	~6–8 GB	Native	Best all-round small agent; hybrid thinking mode
Qwen3 30B-A3B	`ollama pull qwen3:30b-a3b`	30B MoE / 3B active	~18–19 GB	Native	Strongest local agent if you have 24 GB VRAM
Llama 3.1 8B	`ollama pull llama3.1:8b`	8B	~6–8 GB	Native	Well-documented, 128K context, broad ecosystem
Llama 3.3 70B	`ollama pull llama3.3:70b`	70B	~40+ GB	Native	High-quality planning if you have the hardware
Mistral Small 3.2	`ollama pull mistral-small3.2`	24B	~14–16 GB	Native	Best-in-class agentic + clean JSON output

A few grounded facts behind that table:

Qwen3 was released April 29, 2025 in sizes 0.6B, 1.7B, 4B, 8B, 14B, 32B plus the 30B-A3B and 235B-A22B MoE variants. The whole open-weight family ships under Apache 2.0, with native tool calling plus a hybrid "thinking / non-thinking" mode. The 30B-A3B activates only ~3B parameters per token, so it punches well above its speed class — it runs about as fast as an 8B dense model while planning more like a much larger one.
Llama 3.1 (8B/70B) added "state-of-the-art tool use" and a 128K context window — it's the model Ollama's own tool support announcement launched with.
Mistral Small advertises native function calling and reliable JSON output, which matters a lot when your harness is parsing tool-call arguments.

Performance reality check: On an RTX 3090 (24 GB), Qwen3 8B and Llama 3.1 8B at the Q4_K_M quant comfortably push 40+ tok/s — fast enough that the model is rarely the bottleneck; tool latency (a web request, a shell command) usually is. The 30B-A3B MoE also fits in 24 GB and is noticeably smarter at multi-step planning, and because it activates only ~3B parameters per token it stays fast too (community RTX 3090 reports land in the 60–110 tok/s range, depending on context and quant). Treat all of these as approximate, single-box figures — your numbers will shift with context length, GPU, and quant.

Rule of thumb: if a model isn't listed under Ollama's Tools category, don't expect reliable agent behavior from it. Base/instruct models without tool training will describe calling a function instead of emitting a call you can parse.

Step 2: Pick a framework (or plain Python)

You do not need a framework to build a local agent — the loop is simple enough to write by hand, and starting there teaches you exactly what every framework is doing under the hood. But once you want multi-agent workflows, checkpointing, or human-in-the-loop, a framework saves real time.

Option	What it is	Strength	When to use
Plain Python + Ollama	~60 lines: call `/v1`, parse tool calls, run them	Total control, zero deps, best for learning	Single-agent tasks; understanding the mechanics
n8n	Visual/no-code automation with an Ollama node	Drag-and-drop, great for wiring agents to apps	Non-coders, automations triggered by webhooks/cron
CrewAI	Multi-agent "crews" with roles	Fast to prototype agent teams, low ceremony	Rapid prototyping where dev speed > fine control
LangGraph	Graph-based agent orchestration	Checkpointing, rollback, audit trails, HITL	Production systems needing reliability + control
Goose	Block's open-source (Apache 2.0) coding agent	CLI + desktop, large MCP tool ecosystem, any LLM	A ready-made local coding agent, minimal setup

Notes from the current landscape: CrewAI still leads LangGraph in raw GitHub stars (it crossed 30K while LangGraph sat around 13K), but LangGraph has the stronger pull among enterprise teams building compliance-sensitive systems. In one published 2026 benchmark on Qwen3 32B via Ollama, LangGraph completed ~62% of complex (8+ step) multi-step tasks versus CrewAI's ~54% — a real but not enormous gap (one source's words: a "meaningful 8-point gap"). CrewAI remains the faster on-ramp if you just want a team of agents working without learning graph theory. Goose (an Apache-2.0 project from Block with 45K+ GitHub stars, contributed to the Linux Foundation's Agentic AI Foundation in 2026 alongside MCP) is the shortcut if you want a finished local agent rather than a kit — it pairs with Ollama and other local runtimes, and connects to a large MCP tool ecosystem (Block markets thousands of available tools; the repo ships 70+ first-party extensions).

My advice: write the plain-Python loop once (next sections), then graduate to LangGraph or Goose only when you hit a wall the hand-rolled loop can't handle.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Step 3: Wire tools via function calling or MCP

There are two ways to give your agent hands.

Option A — Function calling (define tools in your code)

You describe each tool as a JSON schema and pass it in the tools field of the chat request. Ollama's /api/chat (and the OpenAI-compatible /v1/chat/completions) accept the standard OpenAI tool format, so the model returns a tool_calls array your code reads and executes. This is the most direct path and what the runnable example below uses.

The three tools almost every agent wants:

Web search — a function that queries a search API (or a local SearXNG instance for fully offline-friendly search) and returns snippets.
Shell / command runner — execute a whitelisted command and capture stdout. Sandbox this. Never hand a model raw, unrestricted shell on a machine you care about.
File read/write — read a file into context, or write generated output. Constrain it to a working directory.

Option B — MCP (Model Context Protocol)

MCP, open-sourced by Anthropic in November 2024, is now the de facto standard for connecting agents to tools — think "USB-C for AI tools." Instead of hand-coding every integration, you point your agent at an MCP server (filesystem, GitHub, a database, a browser) and it gains those tools instantly. Agents like Goose lean heavily on MCP to reach a large catalog of community servers, and you can bridge MCP servers to local Ollama models. Use MCP when you want a reusable ecosystem of tools across projects; use plain function calling when you want one or two bespoke tools with no extra moving parts.

For a deeper, dedicated walkthrough of wiring MCP servers to a local model, see our Ollama + MCP integration guide.

Step 4: Add memory

A bare loop forgets everything once the conversation array is discarded. There are three escalating tiers of memory:

Conversation memory (free): just keep appending messages — user, assistant, tool results — to the messages list you send each turn. This is "memory" within a session and costs you nothing but context-window budget.
Summary memory: when the message list gets long, ask the model to summarize older turns into a compact note and replace them. Keeps you under the context limit on small models (8B models choke long before their nominal 128K context is usefully coherent).
Persistent / semantic memory: store facts in a vector database and retrieve the relevant ones each turn, so the agent "remembers" across sessions and restarts. Libraries like Mem0 specialize in this. We cover the full setup in local AI agent memory with Mem0.

For your first build, tier 1 is enough. Reach for tiers 2–3 the moment your agent needs to run longer than a single context window or remember things between runs.

Step 5: A complete runnable example

Here's a minimal but complete local agent in plain Python — no framework — using Ollama's OpenAI-compatible endpoint. It has two real tools (a calculator and a sandboxed file reader), a tool-execution loop, and conversation memory. It runs entirely offline once you've pulled the model.

Prereqs:

# 1. Install Ollama (macOS/Windows: download from ollama.com)
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull a tool-calling model
ollama pull qwen3:8b   # or llama3.1:8b, or mistral-small3.2

# 3. Install the OpenAI client (Ollama speaks the same API)
pip install openai

agent.py:

import json, os
from openai import OpenAI

# Ollama exposes an OpenAI-compatible endpoint. No real API key needed.
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
MODEL = "qwen3:8b"

# --- Tool implementations (your code) ---
def calculate(expression: str) -> str:
    try:
        # NOTE: eval is unsafe in production — sandbox or use a real parser.
        return str(eval(expression, {"__builtins__": {}}, {}))
    except Exception as e:
        return f"error: {e}"

def read_file(path: str) -> str:
    safe_root = os.path.abspath("./workspace")
    full = os.path.abspath(os.path.join(safe_root, path))
    if not full.startswith(safe_root):      # block path traversal
        return "error: access denied"
    try:
        with open(full) as f:
            return f.read()[:4000]
    except Exception as e:
        return f"error: {e}"

TOOLS_IMPL = {"calculate": calculate, "read_file": read_file}

# --- Tool schemas (what the model sees) ---
tools = [
    {"type": "function", "function": {
        "name": "calculate",
        "description": "Evaluate an arithmetic expression.",
        "parameters": {"type": "object",
            "properties": {"expression": {"type": "string"}},
            "required": ["expression"]}}},
    {"type": "function", "function": {
        "name": "read_file",
        "description": "Read a text file from the ./workspace folder.",
        "parameters": {"type": "object",
            "properties": {"path": {"type": "string"}},
            "required": ["path"]}}},
]

def run_agent(goal: str, max_steps: int = 6):
    # 'messages' IS the memory — we append to it every turn.
    messages = [
        {"role": "system", "content": "You are a local AI agent. "
         "Use tools when they help. Stop when the task is done."},
        {"role": "user", "content": goal},
    ]
    for step in range(max_steps):
        resp = client.chat.completions.create(
            model=MODEL, messages=messages, tools=tools)
        msg = resp.choices[0].message
        messages.append(msg)

        if not msg.tool_calls:        # model gave a final answer
            return msg.content

        for call in msg.tool_calls:   # run each requested tool
            fn = TOOLS_IMPL[call.function.name]
            args = json.loads(call.function.arguments)
            result = fn(**args)
            messages.append({"role": "tool",
                             "tool_call_id": call.id,
                             "content": result})
    return "Stopped: hit max_steps without finishing."

if __name__ == "__main__":
    print(run_agent("What is 1487 * 23, and then read notes.txt "
                    "and summarize it in one sentence."))

Run it with python agent.py. Drop a notes.txt into a ./workspace folder first so the file tool has something to read. The model will (1) call calculate, (2) call read_file, (3) read both observations back, then (4) produce a final answer — that's a real agent loop, ~60 lines, no cloud, no bill.

What to change next: swap in a web_search tool, raise max_steps, and add tier-2 summary memory once conversations get long. The structure stays identical — you're only adding tools and tightening the stop condition.

The honest limits of small local models

This is the section most tutorials skip. If you build the agent above with an 8B model and then ask it to do something genuinely multi-step, you will see the rough edges:

Planning falls apart past ~3–6 steps. Small models lose the thread on long task chains, repeat tool calls, or declare victory early. The published LangGraph-vs-CrewAI benchmark above completed only ~54–62% of complex multi-step tasks even on a 32B model — and that's with mature frameworks doing the orchestration.
Argument hallucination. A small model will sometimes invent a file path or malform JSON arguments. Always validate tool inputs in your code and return a clear error the model can recover from (the example does this).
No real sandbox by default. A model that can run shell or eval is a model that can wreck your machine if it hallucinates a destructive command. Whitelist commands, constrain file access to a working directory, and never give an agent unrestricted privileges.
Context coherence ≠ context length. "128K context" does not mean an 8B model stays coherent across 128K tokens of agent history. Use summary memory aggressively.

The practical takeaway: for reliable agentic work today, either (a) use the largest model your hardware allows — Qwen3 30B-A3B or Llama 3.3 70B are meaningfully better planners than any 8B — or (b) keep tasks short and well-scoped and let a framework like LangGraph handle retries and checkpointing. Small local agents are excellent for focused, 1–3 step automations and privacy-sensitive tasks; they are not yet drop-in replacements for frontier cloud agents on open-ended work.

Key Takeaways

An agent is an LLM in a loop with tools and memory. Model proposes a tool call → your code runs it → result goes back → repeat until done.
Model choice is the first gate. Use a tool-trained model: Qwen3 8B / 30B-A3B, Llama 3.1/3.3, or Mistral Small. Base models without function-calling training won't emit parseable tool calls.
Ollama's OpenAI-compatible /v1 endpoint (http://localhost:11434/v1) lets every major framework talk to your local model by changing one base URL.
Start with ~60 lines of plain Python to learn the loop; graduate to LangGraph, CrewAI, or Goose for multi-agent, checkpointing, or a ready-made coding agent.
Wire tools via function calling for one-offs, or MCP for a reusable tool ecosystem. Sandbox anything that touches the shell or filesystem.
Add memory in tiers: append-to-messages → summarize → vector store (Mem0) for persistence.
Be honest about limits. Small local models break down past a few steps and can hallucinate tool arguments — scope tasks tightly and validate every input.

Next Steps

Now that you have a working loop, go deeper on the pieces that make a local agent actually useful:

Want a finished agent instead of a kit? Start with Goose + Ollama for a local agent — Block's open-source agent runs against your local model out of the box.
Give your agent a real toolbox. Connect Ollama to tools with MCP so it can reach the filesystem, GitHub, and thousands of other integrations without bespoke code.
Make it remember across sessions. Add persistent recall with our guide to local AI agent memory using Mem0.
Building a coding agent specifically? Cline + Ollama setup turns a local model into an autonomous in-editor coding agent.

Pick a tool-calling model, run the example above, then layer in MCP, memory, and a framework as your tasks grow. Everything stays on your machine, and it all costs $0 in API fees.

How to Build a Local AI Agent (2026): Ollama + Tools, Step by Step

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What is a local AI agent, really?

Step 1: Pick a tool-calling model

Step 2: Pick a framework (or plain Python)

Reading articles is good. Building is better.

Step 3: Wire tools via function calling or MCP

Option A — Function calling (define tools in your code)

Option B — MCP (Model Context Protocol)

Step 4: Add memory

Step 5: A complete runnable example

The honest limits of small local models

Key Takeaways

Next Steps

Ollama’s running. Here’s what to build with it.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Go from reading about AI to building with AI

Ready to Go Beyond Tutorials?

Related Guides

Goose + Ollama: Run a Local AI Agent

Local AI Agent Memory with Mem0

Ollama + MCP Integration Guide

Cline + Ollama Setup

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Building agents? Do it the structured way.