Best Local LLMs for Tool & Function Calling (2026 Tested)
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Ollama’s running. Here’s what to build with it. Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.
The best local model for tool and function calling in 2026 is Llama-3-Groq-70B-Tool-Use, which scored 90.76% overall on the Berkeley Function Calling Leaderboard (BFCL) — the highest of any open model when Groq published it — with the matching Llama-3-Groq-8B-Tool-Use at 89.06% for people on a single GPU. For a current general-purpose pick, Qwen3 has native tool calling built into its chat template and runs cleanly through Ollama's tools API, while Cohere's Command-R (35B) remains the specialist for multi-step tool use plus RAG. Below, every model is checked against a reproducible Ollama tools-array test so you can see the failure modes, not just the marketing.
If you are wiring a local model into an agent, the metric that actually matters is not raw chat quality — it is whether the model emits valid, parseable JSON that matches your tool schema, every time, without narrating around it. A model that writes beautiful prose but wraps its function call in a markdown fence or invents an argument name will break your agent loop. This guide ranks models on that reliability.
What makes a model good at tool calling (and how we tested it)
Tool calling (also called function calling) is when you hand the model a list of available functions — each with a name, description, and a JSON schema for its arguments — and the model decides which to call and fills in the arguments. Ollama exposes this through the tools array on its /api/chat endpoint, and the model is expected to return a structured tool_calls object rather than free text.
Four things separate a good tool-calling model from a frustrating one:
- Valid-JSON rate — does it return arguments that actually parse, with no trailing commas, no markdown fences, no commentary mixed in?
- Argument accuracy — does it map your prompt to the right tool and fill the right fields (correct types, no hallucinated params)?
- Parallel tool calls — when a request needs two lookups at once, can it emit multiple calls in a single turn instead of one-at-a-time?
- Knowing when NOT to call — does it answer directly when no tool is needed, instead of forcing an irrelevant call?
To test reproducibly, define two simple tools and pass them to Ollama's tools API. Here is the minimal harness:
import ollama
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["city"]
}
}
},
{
"type": "function",
"function": {
"name": "convert_currency",
"description": "Convert an amount between two currencies",
"parameters": {
"type": "object",
"properties": {
"amount": {"type": "number"},
"from": {"type": "string"},
"to": {"type": "string"}
},
"required": ["amount", "from", "to"]
}
}
}
]
resp = ollama.chat(
model="llama3-groq-tool-use:8b",
messages=[{"role": "user",
"content": "Weather in Tokyo in celsius, and convert 50 USD to JPY"}],
tools=tools,
)
print(resp["message"]["tool_calls"])
A strong model returns two tool_calls from that one prompt — get_weather(city="Tokyo", unit="celsius") and convert_currency(amount=50, from="USD", to="JPY") — with clean JSON and no extra text. A weak model returns one call, forgets the unit, or wraps the JSON in a code fence that breaks the parser. For the full setup walkthrough, see our Ollama tool calling guide and the deeper function calling with tools tutorial.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What is the best Ollama model for tool calling in 2026? (Ranked)
The ranking below blends published Berkeley Function Calling Leaderboard (BFCL) results — the standard academic benchmark for function calling — with how each model behaves in Ollama's native tools API. BFCL scores are quoted from each model's own published source where available; where a model has no single published BFCL figure, that cell says so rather than inventing one.
| Rank | Model | Size | Tool support | Published BFCL | Best for |
|---|---|---|---|---|---|
| 🥇 1 | Llama-3-Groq-70B-Tool-Use | 70B dense | Native (fine-tuned) | 90.76% (Groq, BFCL #1 at launch) | Highest accuracy, big GPU/server |
| 🥈 2 | Qwen3 (e.g. 8B / 14B / 32B) | dense + MoE family | Native tool template | ~75.7% (32B, BFCL v3, third-party) | Best modern general-purpose pick |
| 🥉 3 | Llama-3-Groq-8B-Tool-Use | 8B dense | Native (fine-tuned) | 89.06% (Groq, BFCL #3 at launch) | Best single-GPU / small tool model |
| 4 | Command-R | 35B dense | Native (single + multi-step) | Not a single published BFCL figure | Multi-step tool use + RAG |
| 5 | Hermes 4 (14B / 70B) | dense | Native <tool_call> JSON | Not a single published BFCL figure | Open agent stacks, low refusals |
| 6 | Mistral (7B v0.3 / Small) | 7B+ | Native function calling | Not a single published BFCL figure | Lightweight, resource-constrained |
| 7 | Firefunction-v2 | Llama-3-70B base | Native (parallel calls) | On-par-with-GPT-4o (vendor claim) | Hosted-first; heavier to self-host |
A few honest notes on this table. The two Llama-3-Groq-Tool-Use models still hold the cleanest published BFCL numbers of any easily self-hosted option — 90.76% and 89.06% — because Groq fine-tuned them specifically for this one job with Glaive. They are based on Llama 3 (mid-2024), so they are not the newest brains, but for pure schema-following they are hard to beat. Qwen3 is the model most people should reach for today: tool calling is baked into its chat template, it runs through Ollama's tools API without hacks, and it brings far stronger reasoning than the 2024-era Groq finetunes. Command-R earns its spot because Cohere trained it for genuine multi-step tool use and grounded RAG, with publicly available 35B weights and a 128K context.
Why Llama-3-Groq-Tool-Use still tops the published benchmark
When Groq released these two models (in collaboration with Glaive), the 70B variant took the #1 spot on the Berkeley Function Calling Leaderboard at 90.76% overall accuracy, and the 8B variant landed #3 at 89.06% — beating, at the time, several proprietary models on that specific benchmark. They got there by doing one narrow thing extremely well: full fine-tuning plus Direct Preference Optimization (DPO) aimed purely at emitting correct tool calls.
You can run the 8B locally through Ollama:
ollama pull llama3-groq-tool-use:8b
ollama run llama3-groq-tool-use:8b
The trade-off is age. These are Llama-3 finetunes from mid-2024, so their general reasoning and world knowledge trail newer models. If your agent only needs to translate a request into the right function call against a fixed toolset, that does not matter and the Groq 8B is an excellent, lean choice. If your agent also has to reason hard between steps, a newer model like Qwen3 will feel smarter even if its raw BFCL number is lower. You can read Groq's own write-up of the scores on the official Groq Tool Use announcement, and we cover the 8B in depth on our Llama 3 Groq 8B model page.
Best small model for tool calling (single GPU / 8GB-16GB)
Not everyone has a 70B-capable box. If you are on a single consumer GPU, three small models stand out:
- Llama-3-Groq-8B-Tool-Use (8B) — the highest published BFCL score in the small class at 89.06%. It fits comfortably in ~6-8 GB of VRAM at a 4-bit quant and is purpose-built for clean tool calls. This is the safe default if tool reliability is the only thing you care about.
- Qwen3 8B — a better all-rounder. Native tool template, strong reasoning, and a long context, so it both calls tools correctly and reasons well between calls. Pick this if your agent does more than dumb function dispatch. See our companion guide on the build-a-local-AI-agent walkthrough for how it slots into an agent loop.
- Mistral 7B (v0.3 or later) — native function calling and the lightest footprint here, ideal for resource-constrained or edge setups. It is less specialized than the Groq finetune, so expect to validate its JSON a bit more carefully.
There are also community finetunes built solely for this — Watt-Tool-8B and ToolACE-8B, both based on Llama-3.1-8B-Instruct — that report state-of-the-art small-model results on BFCL. They are worth trying if the mainstream 8B options leave gaps, though they live on Hugging Face rather than the core Ollama library, so you may need to import a GGUF. For sizing any of these against your card, run the numbers through our VRAM calculator first.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
A note on Qwen2.5, Gemma, and "Gemma 4"
Two quick clarifications, because the naming trips people up:
- Qwen2.5-7B still works fine for tool calling and remains a solid, lighter alternative if you have not moved to Qwen3 — its instruct models support the same tools-array flow. Details are on our Qwen2.5 7B model page.
- There is no "Gemma 4." As of mid-2026 the current Google open model is Gemma 3 (the 27B is the largest). Gemma 3 can do function calling, but it has no dedicated tool-call tokens — you steer it into structured JSON purely through prompting, which makes it less reliable for agents than the natively trained models above. Google also shipped a tiny specialized variant, FunctionGemma (based on Gemma 3 270M), aimed only at function calling. If you want plug-and-play tool calling, prefer Qwen3 or the Groq Tool-Use models over base Gemma 3.
How fast are tool-calling models in practice?
Throughput depends almost entirely on whether the model fits fully in VRAM. On my own RTX 3090 (24GB), the Llama-3-Groq-8B-Tool-Use Q4_K_M quant ran at roughly 55-65 tokens/sec, and Qwen3 8B landed around 45-55 tokens/sec, both fully GPU-offloaded — ballpark figures from a single machine, not a controlled benchmark. The good news for agents: tool calls are usually short outputs (a few dozen JSON tokens), so even a model in the 40-60 tok/s range feels instant per call. The 35B Command-R is noticeably heavier and wants a 24GB card at a 4-bit quant; the 70B Groq model realistically needs two GPUs or a server. The moment any layer spills to system RAM, speed collapses — keep the whole model on the GPU.
| Model | Quant | Approx tokens/sec (24GB GPU) | Practical fit |
|---|---|---|---|
| Llama-3-Groq-8B-Tool-Use | Q4_K_M | ~55-65 | 8-12 GB GPU |
| Qwen3 8B | Q4_K_M | ~45-55 | 8-12 GB GPU |
| Mistral 7B v0.3 | Q4_K_M | ~60+ | 6-8 GB GPU |
| Command-R 35B | Q4_K_M | ~18-25 | 24 GB GPU |
| Llama-3-Groq-70B-Tool-Use | Q4_K_M | needs 2x GPU / server | ~40 GB+ |
Common tool-calling failure modes (and how to fix them)
Even the best model will occasionally misbehave. These are the failures we hit most often in the Ollama tools test, and the fixes:
- JSON wrapped in a markdown fence. The model returns a fenced JSON code block (a triple-backtick
jsonblock) instead of a cleantool_callsobject, and your parser chokes. Fix: use a model with native tool support (Qwen3, Groq Tool-Use) so Ollama parses it intotool_callsfor you, rather than asking a non-tool model to "respond in JSON." - Hallucinated or renamed arguments. The model invents a field that is not in your schema, or renames
citytolocation. Fix: write tight, unambiguous parameter descriptions, mark fieldsrequired, and validate the returned args against your schema before executing — never trust them blind. - One call when you needed two (no parallel calls). Weaker models answer only the first half of a compound request. Fix: prefer models that explicitly support parallel tool calls (Firefunction-v2, Qwen3, the Groq finetunes), or split the request into separate turns.
- Calling a tool when none was needed. The model forces an irrelevant function call on a plain question. Fix: add a clear "answer directly if no tool applies" instruction and choose a model trained to abstain — the Groq Tool-Use finetunes were specifically tuned to decide whether to call.
- Wrong types (string where a number belongs).
"amount": "50"instead of50. Fix: coerce/validate types on your side; do not assume the model respects"type": "number".
For end-to-end agent patterns that handle these gracefully, see our local AI agents guide.
Key Takeaways
- Llama-3-Groq-70B-Tool-Use leads the published benchmark at 90.76% BFCL (8B at 89.06%) — the cleanest function-calling finetunes you can self-host, if older-generation reasoning is acceptable.
- Qwen3 is the best modern general-purpose pick — native tool template, works directly through Ollama's tools API, and far stronger reasoning than 2024-era finetunes.
- Command-R (35B) is the multi-step + RAG specialist; Hermes 4 and Mistral round out open agent stacks; Firefunction-v2 is parallel-call capable but hosted-first.
- For a single GPU, pick Llama-3-Groq-8B for pure tool reliability or Qwen3 8B for reasoning-plus-tools — both fit ~8-12 GB at 4-bit.
- There is no "Gemma 4." The current model is Gemma 3, which lacks native tool tokens — prefer natively trained tool models for agents.
- The real metric is valid-JSON rate and argument accuracy, not chat quality. Always validate returned arguments against your schema before executing them.
Next Steps
- New to the mechanics? Start with the Ollama tool calling guide and the hands-on function calling with tools tutorial.
- Building an agent? Follow the build a local AI agent walkthrough and the broader local AI agents guide.
- Choosing the small-GPU model? Compare the Llama 3 Groq 8B and Qwen2.5 7B model pages.
- Not sure a model fits your card? Size any quant with the VRAM calculator before downloading the weights.
Ollama’s running. Here’s what to build with it.
Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARBest Ollama Models 2026: 15 Ranked (Coding, Reasoning, Chat)
- 15 Best Free AI Models to Run Locally with Ollama (2026) — No API Key
- Best Ollama Models for 8GB RAM 2026: 12 Tested Local Picks
- Best Ollama Models for AI Agents 2026: 9 Tested & Ranked
- Build a Local AI Slack & Discord Bot with Ollama (Full Tutorial)
- Build a Local RAG Pipeline: Ollama + ChromaDB Step-by-Step
- Build a Telegram Bot with Local AI (Ollama + Python Tutorial)
- CodeLlama Instruct 7B: Ollama Setup, HumanEval (2026)
- Complete Ollama Guide 2026: Install, Run & Manage 500+ Local AI Models
- Dolphin 2.6 Mistral 7B: Uncensored Ollama Setup (2026)
Comments (0)
No comments yet. Be the first to share your thoughts!