For a fully local Hermes Agent in 2026, the best Ollama model is Hermes 4.3 36B (Nous Research's own tool-calling model, built on ByteDance's Seed-OSS-36B base, ~21.8 GB at Q4_K_M) if you have a 24 GB GPU — and Qwen3 14B (~9.3 GB) is the best pick for a single 12-16 GB card. Hermes Agent talks to Ollama through its OpenAI-compatible endpoint at http://localhost:11434/v1, so any tool-calling Ollama model works; the two things that actually make-or-break it are (1) the model must support function calling and (2) you must raise Ollama's context window to at least 64K, because the agent's memory and skills eat tokens fast.

Hermes Agent is the open-source (MIT-licensed) autonomous agent from Nous Research. Unlike an in-IDE copilot, it runs as a persistent daemon that accumulates memory across sessions, writes and refines its own reusable skills, runs scheduled cron tasks, and reaches you over 20+ messaging platforms (Telegram, Discord, Slack, WhatsApp, Signal, and more). This guide is about doing all of that on your own hardware, with Ollama as the inference backend — no API keys, no per-token cost.

What is Hermes Agent (and why run it on Ollama)?

Hermes Agent is Nous Research's answer to "an agent that gets more capable the longer it runs." The headline pieces, straight from the official docs:

Persistent memory. A closed learning loop with agent-curated memory, cross-session recall and LLM summarization — it remembers your projects and preferences without you repeating them.
Self-improving skills. The agent autonomously creates and refines reusable skills from experience.
60+ built-in tools plus Model Context Protocol (MCP) server support, so it can read files, run shell commands, browse, and call external services.
Built-in cron. Scheduled automations that deliver results to any connected platform.
20+ messaging connectors. One gateway to CLI, Telegram, Discord, Slack, WhatsApp, Signal, Matrix and others.

Running it on Ollama matters because a persistent agent that thinks all day, every day, is exactly the workload where per-token cloud bills get scary. Local inference makes the cost fixed (your electricity) and keeps memory of your projects on your own disk. The trade-off is that you need a model good enough at tool calling to drive the agent reliably — that is the whole game here, and it is why model choice matters more than for a plain chatbot. If you are new to agentic local setups, our guide to running AI agents locally covers the broader landscape first.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What is the best Ollama model for Hermes Agent in 2026?

The agent only works if the model emits clean, parseable tool calls. Below is the ranking I'd actually use, weighted toward tool-calling reliability and how much VRAM each model needs at Q4_K_M. Parameter counts and base models are from each official model card; tool-calling support is confirmed on each model's Ollama page or card.

Rank	Model	Ollama tag	Params	Approx VRAM (Q4_K_M)	Why it's here
🥇 1	Hermes 4.3 36B	(GGUF, import)	36B dense	~22 GB	Nous Research's own model; native `<tool_call>` format, trained for agentic use
🥈 2	Qwen3 14B	`qwen3:14b`	14.8B dense	~9.3 GB	Most reliable tool calls per GB; fits a 12-16 GB card
🥉 3	Qwen3 8B	`qwen3:8b`	8.2B dense	~5.2 GB	Best on 8-12 GB; great speed/quality balance for ~5 tools
4	Llama-3-Groq-8B-Tool-Use	`llama3-groq-tool-use:8b`	8B dense	~5 GB	Purpose-built for function calling — 89.06% on BFCL
5	Gemma 4 (E4B)	`gemma4`	Effective 4B	~4-6 GB	Native function-calling + multimodal; lightest capable pick

A few honest notes on this table:

There is no official hermes model in the Ollama library the way there is for Qwen3. Hermes 4.3 36B ships as GGUF on Hugging Face, so you import it into Ollama with a Modelfile (shown below). Don't expect a one-line ollama pull hermes for the 36B.
Qwen3 is the pragmatic default. Across community testing it has the most stable tool calling — it rarely hallucinates calls or drops parameters — and the 8B/14B sizes fit normal GPUs. For most people on one consumer card, qwen3:14b is the right answer to the head question.
Llama-3-Groq-8B-Tool-Use is a specialist: Groq fine-tuned it purely for tool use and it hit 89.06% overall on the Berkeley Function Calling Leaderboard (the 70B sibling hit 90.76%). It is older (Llama 3 era) and not multimodal, but for nothing but reliable function calls on a small GPU it's excellent.

For a deeper look at how different local models behave specifically when calling tools, see our Ollama tool calling guide.

Model release + tool-calling facts (verified)

Because "latest/best" claims age fast, here are the dated, sourced facts behind the picks above:

Model	Base / lineage	Released	License	Tool calling
Hermes 4.3 36B	ByteDance Seed-OSS-36B base	Dec 2, 2025	Apache 2.0	Yes — native `<tool_call>`
Hermes 4 (14B / 70B / 405B)	Llama 3.1	Aug 26, 2025	open weights	Yes
Qwen3 (0.6B-235B)	Qwen3	2025	Apache 2.0	Yes (tools + thinking)
Llama-3-Groq-8B-Tool-Use	Llama 3 8B	2024	Llama-3 license	Yes (89.06% BFCL)
Gemma 4 (E2B/E4B/26B-MoE/31B)	Gemma 4	Apr 2, 2026	Apache 2.0	Yes (native function-calling)

Seed-OSS-36B, the base under Hermes 4.3, is itself Apache 2.0 and ships with a very large native context (512K), which is part of why the Hermes 4.3 build is comfortable as a long-running agent brain. Hermes 4.3 was post-trained on Nous Research's decentralized "Psyche" network rather than a single GPU cluster.

How to install Hermes Agent with Ollama

Install the agent with the official one-line script (it works on Linux, macOS, WSL2 and Android/Termux), pull a tool-calling model, and start Ollama with a large context window:

# 1. Install Hermes Agent (official installer)
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

# 2. Pull a tool-calling model (Qwen3 14B is the balanced default)
ollama pull qwen3:14b

# 3. IMPORTANT: start Ollama with a large context window
#    Ollama defaults to a small context; the agent needs >= 64K
OLLAMA_CONTEXT_LENGTH=64000 ollama serve

Then point Hermes at Ollama's OpenAI-compatible endpoint. Run the model picker and choose the custom-endpoint option:

hermes model
# Select: "Custom endpoint (self-hosted / vLLM / etc.)"
# API base URL: http://localhost:11434/v1
# API key: (leave blank for local Ollama)
# Model name: qwen3:14b

Or set it directly in Hermes' config file:

model:
  default: qwen3:14b
  provider: custom
  base_url: http://localhost:11434/v1
  context_length: 64000

The one setting people miss is the context window. Ollama uses a small default context (4096 tokens on most models) unless you override it, and an agent that carries memory plus skill definitions plus tool schemas will blow through that almost immediately, causing it to "forget" mid-task. Setting OLLAMA_CONTEXT_LENGTH=64000 before ollama serve (and matching context_length in the config) fixes the most common "my agent goes senile" complaint.

Importing Hermes 4.3 36B into Ollama

If you have the VRAM and want Nous Research's own brain driving the agent, grab the GGUF from the Hermes-4.3-36B-GGUF model card and import it. Quant sizes from that card: Q4_K_M is ~21.8 GB, Q5_K_M ~25.6 GB, Q8_0 ~38.4 GB.

# After downloading Hermes-4.3-36B.Q4_K_M.gguf
printf 'FROM ./Hermes-4.3-36B.Q4_K_M.gguf\nPARAMETER num_ctx 64000\n' > Modelfile
ollama create hermes-4.3-36b -f Modelfile
ollama run hermes-4.3-36b "hello"

Then set default: hermes-4.3-36b in the Hermes config.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

How much VRAM does each tier need?

Match the model to your GPU. These are practical totals at Q4_K_M with a roomy context for agent use — bigger than weights-only, because the agent runs long contexts.

Your GPU / VRAM	Recommended model	Notes
8 GB	Qwen3 8B or Llama-3-Groq-8B-Tool-Use	Keep tool count modest (~5); raise context to 32-64K
12 GB	Qwen3 14B (Q4)	The sweet spot — reliable calls, fits with context
16 GB	Qwen3 14B (Q5/Q8) or Gemma 4	Headroom for a longer agent context window
24 GB	Hermes 4.3 36B (Q4_K_M, ~22 GB)	Nous Research's own model end-to-end
48 GB+	Hermes 4.3 36B (Q8) or Hermes 4 70B	Maximum quality; for heavy multi-tool workflows

If your card is below 8 GB, Hermes Agent will technically run a 4B model like Gemma 4 E4B, but tool-calling reliability drops as you shrink — you'll spend more time babysitting failed calls than you save. To size an exact quant against your specific card, plug the numbers into our VRAM calculator.

Testing tool calls (the only test that matters)

The whole agent rides on the model emitting valid tool calls, so verify that before trusting it with cron jobs or messaging. The quickest check is to confirm the model advertises tools in Ollama:

ollama show qwen3:14b
# Look for "tools" in the capabilities line

Then give Hermes a task that forces a tool call — something it cannot answer from parametric memory, like reading a local file or checking the time:

hermes run "What files are in my current directory? Use a tool, don't guess."

If the model is wired correctly you'll see it invoke a shell/file tool and report real results. If instead you see the tool call printed as plain text (e.g. raw <tool_call> JSON in the reply) rather than executed, the model is "describing" calls instead of making them — that's the #1 symptom of a misconfigured backend, covered next.

First-hand: speed on a 24 GB card

On my own RTX 3090 (24 GB), running Hermes Agent against qwen3:14b at Q4_K_M with a 64K context, I measured roughly 35-45 tokens/sec during generation with the whole model on the GPU — fast enough that the agent feels responsive over Telegram. Hermes 4.3 36B at Q4_K_M (~22 GB) fit, but with a long agent context it sat close to the VRAM ceiling and ran noticeably slower (high-20s tok/s) because there was little headroom for KV cache. These are ballpark figures from a single machine, not a controlled benchmark — treat them as approximate and expect the moment any layer spills to system RAM, throughput to fall off a cliff. The practical lesson: on 24 GB, Qwen3 14B leaves room for a big context and feels better day-to-day than squeezing in the 36B.

Troubleshooting

Tool calls show up as text, never execute. The backend isn't parsing them. On Ollama the tool parser is on by default for tool-capable models — confirm with ollama show <model> that it lists tools. If you switched to vLLM or llama.cpp instead, you must pass the tool-parser flags (vLLM: --enable-auto-tool-choice --tool-call-parser hermes; llama.cpp: --jinja).
Agent forgets context / loses the thread. Your context window is too small. Restart Ollama with OLLAMA_CONTEXT_LENGTH=64000 ollama serve and set context_length: 64000 in the Hermes config. Note a Modelfile PARAMETER num_ctx overrides the env var, so set it in the Modelfile for imported models.
Connection refused on :11434. Ollama isn't running, or you used the wrong path. The endpoint must be the OpenAI-compatible one: http://localhost:11434/v1 (note the /v1), not the bare port.
Weak / dumb tool decisions on a small model. An 8B model with 8-10 tools will sometimes skip a needed call. Step up to qwen3:14b, or reduce the number of tools/skills you expose for that task.
36B won't load. Hermes 4.3 36B at Q4_K_M needs ~22 GB — it will not fit a 16 GB card with any real context. Drop to Qwen3 14B, which fits comfortably and still calls tools reliably.

Key Takeaways

Best overall for a 24 GB card: Hermes 4.3 36B (Nous Research's own model, Seed-OSS-36B base, ~22 GB at Q4_K_M, native tool-call format). Best for one 12-16 GB GPU: Qwen3 14B (~9.3 GB).
Hermes Agent connects to Ollama via the OpenAI-compatible endpoint at http://localhost:11434/v1 with the provider set to custom and a blank API key.
Raise the context window. Ollama's small default context starves the agent; set OLLAMA_CONTEXT_LENGTH=64000 and match it in the config — this fixes most "agent forgets" problems.
Tool calling is non-negotiable. Pick a tool-capable model (Qwen3, Hermes, Llama-3-Groq-Tool-Use, Gemma 4) and verify with a forced tool-call test before trusting cron or messaging.
It's MIT-licensed and free to self-host — the recurring cost is electricity, not tokens, which is the real reason to run a persistent agent locally.

Next Steps

New to local agents entirely? Start with our running AI agents locally primer, then come back to wire up Hermes.
Want to build your own from parts instead of using Hermes? See build a local AI agent for the DIY path.
Comparing agent frameworks before you commit? Read AI agent frameworks compared.
Debugging tool calls specifically? Our Ollama tool calling guide walks through the parser flags and common failure modes.
Curious about the older Nous flagship for agents? See our breakdown of Nous Hermes 2 Mixtral 8x7B.

Run Hermes Agent Locally with Ollama (2026 Setup Guide)

Want to go deeper than this article?

What is Hermes Agent (and why run it on Ollama)?

Reading articles is good. Building is better.

What is the best Ollama model for Hermes Agent in 2026?

Model release + tool-calling facts (verified)

How to install Hermes Agent with Ollama

Importing Hermes 4.3 36B into Ollama

Reading articles is good. Building is better.

How much VRAM does each tier need?

Testing tool calls (the only test that matters)

First-hand: speed on a 24 GB card

Troubleshooting

Key Takeaways

Next Steps

Ollama’s running. Here’s what to build with it.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ready to Go Beyond Tutorials?

Go from reading about AI to building with AI

Related Guides

Running AI Agents Locally

Ollama Tool Calling Guide

Build a Local AI Agent

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Ollama’s running. Here’s what to build with it.