★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
AI Agents

How to Build a Local AI Agent (2026): Ollama + Tools, Step by Step

June 20, 2026
16 min read
Local AI Master Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

📚AI Learning Path

Ollama’s running. Here’s what to build with it. Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.

Start free
Or own it for life — Lifetime $149, pay once

Published on June 20, 2026 • 16 min read

To build a local AI agent in 2026, run a tool-calling model on Ollama — Qwen3 (8B or the 30B-A3B MoE), Llama 3.1/3.3, or Mistral Small — then give it a loop that reads the model's tool calls and actually executes them (web search, shell, file read/write) before feeding the results back. The simplest working version is ~60 lines of plain Python against Ollama's OpenAI-compatible `/v1` endpoint; for anything multi-step or production-shaped, reach for a framework like LangGraph, CrewAI, or Goose. An "agent" is just an LLM in a loop with tools and memory — the model decides which tool to call, your code runs it, and the cycle repeats until the task is done. Everything below runs offline, costs $0 in API fees, and keeps your data on your machine.

Be clear-eyed about one thing up front: small local models are not GPT-5-class planners. An 8B model running on a laptop will happily call a tool, but it will also loop, hallucinate arguments, and give up on tasks longer than a few steps. This guide shows you the build and where the cliff is.

The 5 Steps

  1. Pick a tool-calling model and pull it with Ollama.
  2. Choose a framework (or plain Python) — compare below.
  3. Wire real tools via function calling or MCP.
  4. Add memory so the agent remembers across turns.
  5. Run the loop, then test honestly against its limits.

Table of Contents

  1. What is a local AI agent, really?
  2. Step 1: Pick a tool-calling model
  3. Step 2: Pick a framework (or plain Python)
  4. Step 3: Wire tools via function calling or MCP
  5. Step 4: Add memory
  6. Step 5: A complete runnable example
  7. The honest limits of small local models
  8. Key Takeaways
  9. Next Steps

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What is a local AI agent, really?

A local AI agent is a language model running entirely on your own hardware that can take actions, not just generate text. The pattern is the same one cloud agents use, stripped to its essentials:

  1. You give the model a goal and a list of tools (functions it's allowed to call).
  2. The model replies with either a final answer or a structured request to call a tool ("run web_search('latest Ollama release')").
  3. Your code executes that tool and feeds the result back into the conversation.
  4. Repeat until the model produces a final answer.

That loop — model → tool call → execution → observation → model — is the whole game. The "agentic" magic is just the model deciding which tool and what arguments, and your harness being disciplined about running them and capturing the output. Memory (so it remembers prior turns) and a stop condition (so it doesn't loop forever) are the two pieces that turn a toy into something usable.

The thing that makes this practical in 2026 is that Ollama exposes an OpenAI-compatible /v1 endpoint at http://localhost:11434/v1. Any agent framework that speaks the OpenAI API works against your local model by changing one base URL — no cloud key, no per-token billing.


Step 1: Pick a tool-calling model

Not every local model can call tools reliably. You want a model that was trained for function calling — meaning it emits valid, structured tool-call JSON instead of describing the tool in prose. These are the verified solid picks on Ollama right now:

ModelPull commandParamsVRAM (Q4_K_M)Tool callingBest for
Qwen3 8Bollama pull qwen3:8b8B dense~6–8 GBNativeBest all-round small agent; hybrid thinking mode
Qwen3 30B-A3Bollama pull qwen3:30b-a3b30B MoE / 3B active~18–19 GBNativeStrongest local agent if you have 24 GB VRAM
Llama 3.1 8Bollama pull llama3.1:8b8B~6–8 GBNativeWell-documented, 128K context, broad ecosystem
Llama 3.3 70Bollama pull llama3.3:70b70B~40+ GBNativeHigh-quality planning if you have the hardware
Mistral Small 3.2ollama pull mistral-small3.224B~14–16 GBNativeBest-in-class agentic + clean JSON output

A few grounded facts behind that table:

  • Qwen3 was released April 29, 2025 in sizes 0.6B, 1.7B, 4B, 8B, 14B, 32B plus the 30B-A3B and 235B-A22B MoE variants. The whole open-weight family ships under Apache 2.0, with native tool calling plus a hybrid "thinking / non-thinking" mode. The 30B-A3B activates only ~3B parameters per token, so it punches well above its speed class — it runs about as fast as an 8B dense model while planning more like a much larger one.
  • Llama 3.1 (8B/70B) added "state-of-the-art tool use" and a 128K context window — it's the model Ollama's own tool support announcement launched with.
  • Mistral Small advertises native function calling and reliable JSON output, which matters a lot when your harness is parsing tool-call arguments.

Performance reality check: On an RTX 3090 (24 GB), Qwen3 8B and Llama 3.1 8B at the Q4_K_M quant comfortably push 40+ tok/s — fast enough that the model is rarely the bottleneck; tool latency (a web request, a shell command) usually is. The 30B-A3B MoE also fits in 24 GB and is noticeably smarter at multi-step planning, and because it activates only ~3B parameters per token it stays fast too (community RTX 3090 reports land in the 60–110 tok/s range, depending on context and quant). Treat all of these as approximate, single-box figures — your numbers will shift with context length, GPU, and quant.

Rule of thumb: if a model isn't listed under Ollama's Tools category, don't expect reliable agent behavior from it. Base/instruct models without tool training will describe calling a function instead of emitting a call you can parse.


Step 2: Pick a framework (or plain Python)

You do not need a framework to build a local agent — the loop is simple enough to write by hand, and starting there teaches you exactly what every framework is doing under the hood. But once you want multi-agent workflows, checkpointing, or human-in-the-loop, a framework saves real time.

OptionWhat it isStrengthWhen to use
Plain Python + Ollama~60 lines: call /v1, parse tool calls, run themTotal control, zero deps, best for learningSingle-agent tasks; understanding the mechanics
n8nVisual/no-code automation with an Ollama nodeDrag-and-drop, great for wiring agents to appsNon-coders, automations triggered by webhooks/cron
CrewAIMulti-agent "crews" with rolesFast to prototype agent teams, low ceremonyRapid prototyping where dev speed > fine control
LangGraphGraph-based agent orchestrationCheckpointing, rollback, audit trails, HITLProduction systems needing reliability + control
GooseBlock's open-source (Apache 2.0) coding agentCLI + desktop, large MCP tool ecosystem, any LLMA ready-made local coding agent, minimal setup

Notes from the current landscape: CrewAI still leads LangGraph in raw GitHub stars (it crossed 30K while LangGraph sat around 13K), but LangGraph has the stronger pull among enterprise teams building compliance-sensitive systems. In one published 2026 benchmark on Qwen3 32B via Ollama, LangGraph completed ~62% of complex (8+ step) multi-step tasks versus CrewAI's ~54% — a real but not enormous gap (one source's words: a "meaningful 8-point gap"). CrewAI remains the faster on-ramp if you just want a team of agents working without learning graph theory. Goose (an Apache-2.0 project from Block with 45K+ GitHub stars, contributed to the Linux Foundation's Agentic AI Foundation in 2026 alongside MCP) is the shortcut if you want a finished local agent rather than a kit — it pairs with Ollama and other local runtimes, and connects to a large MCP tool ecosystem (Block markets thousands of available tools; the repo ships 70+ first-party extensions).

My advice: write the plain-Python loop once (next sections), then graduate to LangGraph or Goose only when you hit a wall the hand-rolled loop can't handle.


Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Step 3: Wire tools via function calling or MCP

There are two ways to give your agent hands.

Option A — Function calling (define tools in your code)

You describe each tool as a JSON schema and pass it in the tools field of the chat request. Ollama's /api/chat (and the OpenAI-compatible /v1/chat/completions) accept the standard OpenAI tool format, so the model returns a tool_calls array your code reads and executes. This is the most direct path and what the runnable example below uses.

The three tools almost every agent wants:

  • Web search — a function that queries a search API (or a local SearXNG instance for fully offline-friendly search) and returns snippets.
  • Shell / command runner — execute a whitelisted command and capture stdout. Sandbox this. Never hand a model raw, unrestricted shell on a machine you care about.
  • File read/write — read a file into context, or write generated output. Constrain it to a working directory.

Option B — MCP (Model Context Protocol)

MCP, open-sourced by Anthropic in November 2024, is now the de facto standard for connecting agents to tools — think "USB-C for AI tools." Instead of hand-coding every integration, you point your agent at an MCP server (filesystem, GitHub, a database, a browser) and it gains those tools instantly. Agents like Goose lean heavily on MCP to reach a large catalog of community servers, and you can bridge MCP servers to local Ollama models. Use MCP when you want a reusable ecosystem of tools across projects; use plain function calling when you want one or two bespoke tools with no extra moving parts.

For a deeper, dedicated walkthrough of wiring MCP servers to a local model, see our Ollama + MCP integration guide.


Step 4: Add memory

A bare loop forgets everything once the conversation array is discarded. There are three escalating tiers of memory:

  1. Conversation memory (free): just keep appending messages — user, assistant, tool results — to the messages list you send each turn. This is "memory" within a session and costs you nothing but context-window budget.
  2. Summary memory: when the message list gets long, ask the model to summarize older turns into a compact note and replace them. Keeps you under the context limit on small models (8B models choke long before their nominal 128K context is usefully coherent).
  3. Persistent / semantic memory: store facts in a vector database and retrieve the relevant ones each turn, so the agent "remembers" across sessions and restarts. Libraries like Mem0 specialize in this. We cover the full setup in local AI agent memory with Mem0.

For your first build, tier 1 is enough. Reach for tiers 2–3 the moment your agent needs to run longer than a single context window or remember things between runs.


Step 5: A complete runnable example

Here's a minimal but complete local agent in plain Python — no framework — using Ollama's OpenAI-compatible endpoint. It has two real tools (a calculator and a sandboxed file reader), a tool-execution loop, and conversation memory. It runs entirely offline once you've pulled the model.

Prereqs:

# 1. Install Ollama (macOS/Windows: download from ollama.com)
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull a tool-calling model
ollama pull qwen3:8b   # or llama3.1:8b, or mistral-small3.2

# 3. Install the OpenAI client (Ollama speaks the same API)
pip install openai

agent.py:

import json, os
from openai import OpenAI

# Ollama exposes an OpenAI-compatible endpoint. No real API key needed.
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
MODEL = "qwen3:8b"

# --- Tool implementations (your code) ---
def calculate(expression: str) -> str:
    try:
        # NOTE: eval is unsafe in production — sandbox or use a real parser.
        return str(eval(expression, {"__builtins__": {}}, {}))
    except Exception as e:
        return f"error: {e}"

def read_file(path: str) -> str:
    safe_root = os.path.abspath("./workspace")
    full = os.path.abspath(os.path.join(safe_root, path))
    if not full.startswith(safe_root):      # block path traversal
        return "error: access denied"
    try:
        with open(full) as f:
            return f.read()[:4000]
    except Exception as e:
        return f"error: {e}"

TOOLS_IMPL = {"calculate": calculate, "read_file": read_file}

# --- Tool schemas (what the model sees) ---
tools = [
    {"type": "function", "function": {
        "name": "calculate",
        "description": "Evaluate an arithmetic expression.",
        "parameters": {"type": "object",
            "properties": {"expression": {"type": "string"}},
            "required": ["expression"]}}},
    {"type": "function", "function": {
        "name": "read_file",
        "description": "Read a text file from the ./workspace folder.",
        "parameters": {"type": "object",
            "properties": {"path": {"type": "string"}},
            "required": ["path"]}}},
]

def run_agent(goal: str, max_steps: int = 6):
    # 'messages' IS the memory — we append to it every turn.
    messages = [
        {"role": "system", "content": "You are a local AI agent. "
         "Use tools when they help. Stop when the task is done."},
        {"role": "user", "content": goal},
    ]
    for step in range(max_steps):
        resp = client.chat.completions.create(
            model=MODEL, messages=messages, tools=tools)
        msg = resp.choices[0].message
        messages.append(msg)

        if not msg.tool_calls:        # model gave a final answer
            return msg.content

        for call in msg.tool_calls:   # run each requested tool
            fn = TOOLS_IMPL[call.function.name]
            args = json.loads(call.function.arguments)
            result = fn(**args)
            messages.append({"role": "tool",
                             "tool_call_id": call.id,
                             "content": result})
    return "Stopped: hit max_steps without finishing."

if __name__ == "__main__":
    print(run_agent("What is 1487 * 23, and then read notes.txt "
                    "and summarize it in one sentence."))

Run it with python agent.py. Drop a notes.txt into a ./workspace folder first so the file tool has something to read. The model will (1) call calculate, (2) call read_file, (3) read both observations back, then (4) produce a final answer — that's a real agent loop, ~60 lines, no cloud, no bill.

What to change next: swap in a web_search tool, raise max_steps, and add tier-2 summary memory once conversations get long. The structure stays identical — you're only adding tools and tightening the stop condition.


The honest limits of small local models

This is the section most tutorials skip. If you build the agent above with an 8B model and then ask it to do something genuinely multi-step, you will see the rough edges:

  • Planning falls apart past ~3–6 steps. Small models lose the thread on long task chains, repeat tool calls, or declare victory early. The published LangGraph-vs-CrewAI benchmark above completed only ~54–62% of complex multi-step tasks even on a 32B model — and that's with mature frameworks doing the orchestration.
  • Argument hallucination. A small model will sometimes invent a file path or malform JSON arguments. Always validate tool inputs in your code and return a clear error the model can recover from (the example does this).
  • No real sandbox by default. A model that can run shell or eval is a model that can wreck your machine if it hallucinates a destructive command. Whitelist commands, constrain file access to a working directory, and never give an agent unrestricted privileges.
  • Context coherence ≠ context length. "128K context" does not mean an 8B model stays coherent across 128K tokens of agent history. Use summary memory aggressively.

The practical takeaway: for reliable agentic work today, either (a) use the largest model your hardware allows — Qwen3 30B-A3B or Llama 3.3 70B are meaningfully better planners than any 8B — or (b) keep tasks short and well-scoped and let a framework like LangGraph handle retries and checkpointing. Small local agents are excellent for focused, 1–3 step automations and privacy-sensitive tasks; they are not yet drop-in replacements for frontier cloud agents on open-ended work.


Key Takeaways

  1. An agent is an LLM in a loop with tools and memory. Model proposes a tool call → your code runs it → result goes back → repeat until done.
  2. Model choice is the first gate. Use a tool-trained model: Qwen3 8B / 30B-A3B, Llama 3.1/3.3, or Mistral Small. Base models without function-calling training won't emit parseable tool calls.
  3. Ollama's OpenAI-compatible /v1 endpoint (http://localhost:11434/v1) lets every major framework talk to your local model by changing one base URL.
  4. Start with ~60 lines of plain Python to learn the loop; graduate to LangGraph, CrewAI, or Goose for multi-agent, checkpointing, or a ready-made coding agent.
  5. Wire tools via function calling for one-offs, or MCP for a reusable tool ecosystem. Sandbox anything that touches the shell or filesystem.
  6. Add memory in tiers: append-to-messages → summarize → vector store (Mem0) for persistence.
  7. Be honest about limits. Small local models break down past a few steps and can hallucinate tool arguments — scope tasks tightly and validate every input.

Next Steps

Now that you have a working loop, go deeper on the pieces that make a local agent actually useful:

Pick a tool-calling model, run the example above, then layer in MCP, memory, and a framework as your tasks grow. Everything stays on your machine, and it all costs $0 in API fees.

🎯
AI Learning Path

Ollama’s running. Here’s what to build with it.

Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.

Or own it for life — Lifetime $149 $599, pay once

Liked this? 20 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Want structured AI education?

20 courses, 495+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path
More on Local AI Agents
See the full Build Local AI Agents guide.

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: June 20, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed
🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once

Ready to Go Beyond Tutorials?

20 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.

Was this helpful?

LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Building agents? Do it the structured way.

AutoGen, CrewAI, tool-use, planning — hands-on and running on your own hardware. First chapter free.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators