Run Hermes Agent Locally with Ollama (2026 Setup Guide)
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Ollama’s running. Here’s what to build with it. Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.
For a fully local Hermes Agent in 2026, the best Ollama model is Hermes 4.3 36B (Nous Research's own tool-calling model, built on ByteDance's Seed-OSS-36B base, ~21.8 GB at Q4_K_M) if you have a 24 GB GPU — and Qwen3 14B (~9.3 GB) is the best pick for a single 12-16 GB card. Hermes Agent talks to Ollama through its OpenAI-compatible endpoint at http://localhost:11434/v1, so any tool-calling Ollama model works; the two things that actually make-or-break it are (1) the model must support function calling and (2) you must raise Ollama's context window to at least 64K, because the agent's memory and skills eat tokens fast.
Hermes Agent is the open-source (MIT-licensed) autonomous agent from Nous Research. Unlike an in-IDE copilot, it runs as a persistent daemon that accumulates memory across sessions, writes and refines its own reusable skills, runs scheduled cron tasks, and reaches you over 20+ messaging platforms (Telegram, Discord, Slack, WhatsApp, Signal, and more). This guide is about doing all of that on your own hardware, with Ollama as the inference backend — no API keys, no per-token cost.
What is Hermes Agent (and why run it on Ollama)?
Hermes Agent is Nous Research's answer to "an agent that gets more capable the longer it runs." The headline pieces, straight from the official docs:
- Persistent memory. A closed learning loop with agent-curated memory, cross-session recall and LLM summarization — it remembers your projects and preferences without you repeating them.
- Self-improving skills. The agent autonomously creates and refines reusable skills from experience.
- 60+ built-in tools plus Model Context Protocol (MCP) server support, so it can read files, run shell commands, browse, and call external services.
- Built-in cron. Scheduled automations that deliver results to any connected platform.
- 20+ messaging connectors. One gateway to CLI, Telegram, Discord, Slack, WhatsApp, Signal, Matrix and others.
Running it on Ollama matters because a persistent agent that thinks all day, every day, is exactly the workload where per-token cloud bills get scary. Local inference makes the cost fixed (your electricity) and keeps memory of your projects on your own disk. The trade-off is that you need a model good enough at tool calling to drive the agent reliably — that is the whole game here, and it is why model choice matters more than for a plain chatbot. If you are new to agentic local setups, our guide to running AI agents locally covers the broader landscape first.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What is the best Ollama model for Hermes Agent in 2026?
The agent only works if the model emits clean, parseable tool calls. Below is the ranking I'd actually use, weighted toward tool-calling reliability and how much VRAM each model needs at Q4_K_M. Parameter counts and base models are from each official model card; tool-calling support is confirmed on each model's Ollama page or card.
| Rank | Model | Ollama tag | Params | Approx VRAM (Q4_K_M) | Why it's here |
|---|---|---|---|---|---|
| 🥇 1 | Hermes 4.3 36B | (GGUF, import) | 36B dense | ~22 GB | Nous Research's own model; native <tool_call> format, trained for agentic use |
| 🥈 2 | Qwen3 14B | qwen3:14b | 14.8B dense | ~9.3 GB | Most reliable tool calls per GB; fits a 12-16 GB card |
| 🥉 3 | Qwen3 8B | qwen3:8b | 8.2B dense | ~5.2 GB | Best on 8-12 GB; great speed/quality balance for ~5 tools |
| 4 | Llama-3-Groq-8B-Tool-Use | llama3-groq-tool-use:8b | 8B dense | ~5 GB | Purpose-built for function calling — 89.06% on BFCL |
| 5 | Gemma 4 (E4B) | gemma4 | Effective 4B | ~4-6 GB | Native function-calling + multimodal; lightest capable pick |
A few honest notes on this table:
- There is no official
hermesmodel in the Ollama library the way there is for Qwen3. Hermes 4.3 36B ships as GGUF on Hugging Face, so you import it into Ollama with a Modelfile (shown below). Don't expect a one-lineollama pull hermesfor the 36B. - Qwen3 is the pragmatic default. Across community testing it has the most stable tool calling — it rarely hallucinates calls or drops parameters — and the 8B/14B sizes fit normal GPUs. For most people on one consumer card,
qwen3:14bis the right answer to the head question. - Llama-3-Groq-8B-Tool-Use is a specialist: Groq fine-tuned it purely for tool use and it hit 89.06% overall on the Berkeley Function Calling Leaderboard (the 70B sibling hit 90.76%). It is older (Llama 3 era) and not multimodal, but for nothing but reliable function calls on a small GPU it's excellent.
For a deeper look at how different local models behave specifically when calling tools, see our Ollama tool calling guide.
Model release + tool-calling facts (verified)
Because "latest/best" claims age fast, here are the dated, sourced facts behind the picks above:
| Model | Base / lineage | Released | License | Tool calling |
|---|---|---|---|---|
| Hermes 4.3 36B | ByteDance Seed-OSS-36B base | Dec 2, 2025 | Apache 2.0 | Yes — native <tool_call> |
| Hermes 4 (14B / 70B / 405B) | Llama 3.1 | Aug 26, 2025 | open weights | Yes |
| Qwen3 (0.6B-235B) | Qwen3 | 2025 | Apache 2.0 | Yes (tools + thinking) |
| Llama-3-Groq-8B-Tool-Use | Llama 3 8B | 2024 | Llama-3 license | Yes (89.06% BFCL) |
| Gemma 4 (E2B/E4B/26B-MoE/31B) | Gemma 4 | Apr 2, 2026 | Apache 2.0 | Yes (native function-calling) |
Seed-OSS-36B, the base under Hermes 4.3, is itself Apache 2.0 and ships with a very large native context (512K), which is part of why the Hermes 4.3 build is comfortable as a long-running agent brain. Hermes 4.3 was post-trained on Nous Research's decentralized "Psyche" network rather than a single GPU cluster.
How to install Hermes Agent with Ollama
Install the agent with the official one-line script (it works on Linux, macOS, WSL2 and Android/Termux), pull a tool-calling model, and start Ollama with a large context window:
# 1. Install Hermes Agent (official installer)
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
# 2. Pull a tool-calling model (Qwen3 14B is the balanced default)
ollama pull qwen3:14b
# 3. IMPORTANT: start Ollama with a large context window
# Ollama defaults to a small context; the agent needs >= 64K
OLLAMA_CONTEXT_LENGTH=64000 ollama serve
Then point Hermes at Ollama's OpenAI-compatible endpoint. Run the model picker and choose the custom-endpoint option:
hermes model
# Select: "Custom endpoint (self-hosted / vLLM / etc.)"
# API base URL: http://localhost:11434/v1
# API key: (leave blank for local Ollama)
# Model name: qwen3:14b
Or set it directly in Hermes' config file:
model:
default: qwen3:14b
provider: custom
base_url: http://localhost:11434/v1
context_length: 64000
The one setting people miss is the context window. Ollama uses a small default context (4096 tokens on most models) unless you override it, and an agent that carries memory plus skill definitions plus tool schemas will blow through that almost immediately, causing it to "forget" mid-task. Setting OLLAMA_CONTEXT_LENGTH=64000 before ollama serve (and matching context_length in the config) fixes the most common "my agent goes senile" complaint.
Importing Hermes 4.3 36B into Ollama
If you have the VRAM and want Nous Research's own brain driving the agent, grab the GGUF from the Hermes-4.3-36B-GGUF model card and import it. Quant sizes from that card: Q4_K_M is ~21.8 GB, Q5_K_M ~25.6 GB, Q8_0 ~38.4 GB.
# After downloading Hermes-4.3-36B.Q4_K_M.gguf
printf 'FROM ./Hermes-4.3-36B.Q4_K_M.gguf\nPARAMETER num_ctx 64000\n' > Modelfile
ollama create hermes-4.3-36b -f Modelfile
ollama run hermes-4.3-36b "hello"
Then set default: hermes-4.3-36b in the Hermes config.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
How much VRAM does each tier need?
Match the model to your GPU. These are practical totals at Q4_K_M with a roomy context for agent use — bigger than weights-only, because the agent runs long contexts.
| Your GPU / VRAM | Recommended model | Notes |
|---|---|---|
| 8 GB | Qwen3 8B or Llama-3-Groq-8B-Tool-Use | Keep tool count modest (~5); raise context to 32-64K |
| 12 GB | Qwen3 14B (Q4) | The sweet spot — reliable calls, fits with context |
| 16 GB | Qwen3 14B (Q5/Q8) or Gemma 4 | Headroom for a longer agent context window |
| 24 GB | Hermes 4.3 36B (Q4_K_M, ~22 GB) | Nous Research's own model end-to-end |
| 48 GB+ | Hermes 4.3 36B (Q8) or Hermes 4 70B | Maximum quality; for heavy multi-tool workflows |
If your card is below 8 GB, Hermes Agent will technically run a 4B model like Gemma 4 E4B, but tool-calling reliability drops as you shrink — you'll spend more time babysitting failed calls than you save. To size an exact quant against your specific card, plug the numbers into our VRAM calculator.
Testing tool calls (the only test that matters)
The whole agent rides on the model emitting valid tool calls, so verify that before trusting it with cron jobs or messaging. The quickest check is to confirm the model advertises tools in Ollama:
ollama show qwen3:14b
# Look for "tools" in the capabilities line
Then give Hermes a task that forces a tool call — something it cannot answer from parametric memory, like reading a local file or checking the time:
hermes run "What files are in my current directory? Use a tool, don't guess."
If the model is wired correctly you'll see it invoke a shell/file tool and report real results. If instead you see the tool call printed as plain text (e.g. raw <tool_call> JSON in the reply) rather than executed, the model is "describing" calls instead of making them — that's the #1 symptom of a misconfigured backend, covered next.
First-hand: speed on a 24 GB card
On my own RTX 3090 (24 GB), running Hermes Agent against qwen3:14b at Q4_K_M with a 64K context, I measured roughly 35-45 tokens/sec during generation with the whole model on the GPU — fast enough that the agent feels responsive over Telegram. Hermes 4.3 36B at Q4_K_M (~22 GB) fit, but with a long agent context it sat close to the VRAM ceiling and ran noticeably slower (high-20s tok/s) because there was little headroom for KV cache. These are ballpark figures from a single machine, not a controlled benchmark — treat them as approximate and expect the moment any layer spills to system RAM, throughput to fall off a cliff. The practical lesson: on 24 GB, Qwen3 14B leaves room for a big context and feels better day-to-day than squeezing in the 36B.
Troubleshooting
- Tool calls show up as text, never execute. The backend isn't parsing them. On Ollama the tool parser is on by default for tool-capable models — confirm with
ollama show <model>that it lists tools. If you switched to vLLM or llama.cpp instead, you must pass the tool-parser flags (vLLM:--enable-auto-tool-choice --tool-call-parser hermes; llama.cpp:--jinja). - Agent forgets context / loses the thread. Your context window is too small. Restart Ollama with
OLLAMA_CONTEXT_LENGTH=64000 ollama serveand setcontext_length: 64000in the Hermes config. Note a ModelfilePARAMETER num_ctxoverrides the env var, so set it in the Modelfile for imported models. - Connection refused on :11434. Ollama isn't running, or you used the wrong path. The endpoint must be the OpenAI-compatible one:
http://localhost:11434/v1(note the/v1), not the bare port. - Weak / dumb tool decisions on a small model. An 8B model with 8-10 tools will sometimes skip a needed call. Step up to
qwen3:14b, or reduce the number of tools/skills you expose for that task. - 36B won't load. Hermes 4.3 36B at Q4_K_M needs ~22 GB — it will not fit a 16 GB card with any real context. Drop to Qwen3 14B, which fits comfortably and still calls tools reliably.
Key Takeaways
- Best overall for a 24 GB card: Hermes 4.3 36B (Nous Research's own model, Seed-OSS-36B base, ~22 GB at Q4_K_M, native tool-call format). Best for one 12-16 GB GPU: Qwen3 14B (~9.3 GB).
- Hermes Agent connects to Ollama via the OpenAI-compatible endpoint at
http://localhost:11434/v1with the provider set to custom and a blank API key. - Raise the context window. Ollama's small default context starves the agent; set
OLLAMA_CONTEXT_LENGTH=64000and match it in the config — this fixes most "agent forgets" problems. - Tool calling is non-negotiable. Pick a tool-capable model (Qwen3, Hermes, Llama-3-Groq-Tool-Use, Gemma 4) and verify with a forced tool-call test before trusting cron or messaging.
- It's MIT-licensed and free to self-host — the recurring cost is electricity, not tokens, which is the real reason to run a persistent agent locally.
Next Steps
- New to local agents entirely? Start with our running AI agents locally primer, then come back to wire up Hermes.
- Want to build your own from parts instead of using Hermes? See build a local AI agent for the DIY path.
- Comparing agent frameworks before you commit? Read AI agent frameworks compared.
- Debugging tool calls specifically? Our Ollama tool calling guide walks through the parser flags and common failure modes.
- Curious about the older Nous flagship for agents? See our breakdown of Nous Hermes 2 Mixtral 8x7B.
Ollama’s running. Here’s what to build with it.
Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARBest Ollama Models 2026: 15 Ranked (Coding, Reasoning, Chat)
- 15 Best Free AI Models to Run Locally with Ollama (2026) — No API Key
- Best Local LLMs for Tool & Function Calling (2026 Tested)
- Best Ollama Models for 8GB RAM 2026: 12 Tested Local Picks
- Best Ollama Models for AI Agents 2026: 9 Tested & Ranked
- Build a Local AI Slack & Discord Bot with Ollama (Full Tutorial)
- Build a Local RAG Pipeline: Ollama + ChromaDB Step-by-Step
- Build a Telegram Bot with Local AI (Ollama + Python Tutorial)
- CodeLlama Instruct 7B: Ollama Setup, HumanEval (2026)
- Complete Ollama Guide 2026: Install, Run & Manage 500+ Local AI Models
Comments (0)
No comments yet. Be the first to share your thoughts!