The best local model for tool and function calling in 2026 is Llama-3-Groq-70B-Tool-Use, which scored 90.76% overall on the Berkeley Function Calling Leaderboard (BFCL) — the highest of any open model when Groq published it — with the matching Llama-3-Groq-8B-Tool-Use at 89.06% for people on a single GPU. For a current general-purpose pick, Qwen3 has native tool calling built into its chat template and runs cleanly through Ollama's tools API, while Cohere's Command-R (35B) remains the specialist for multi-step tool use plus RAG. Below, every model is checked against a reproducible Ollama tools-array test so you can see the failure modes, not just the marketing.

If you are wiring a local model into an agent, the metric that actually matters is not raw chat quality — it is whether the model emits valid, parseable JSON that matches your tool schema, every time, without narrating around it. A model that writes beautiful prose but wraps its function call in a markdown fence or invents an argument name will break your agent loop. This guide ranks models on that reliability.

What makes a model good at tool calling (and how we tested it)

Tool calling (also called function calling) is when you hand the model a list of available functions — each with a name, description, and a JSON schema for its arguments — and the model decides which to call and fills in the arguments. Ollama exposes this through the tools array on its /api/chat endpoint, and the model is expected to return a structured tool_calls object rather than free text.

Four things separate a good tool-calling model from a frustrating one:

Valid-JSON rate — does it return arguments that actually parse, with no trailing commas, no markdown fences, no commentary mixed in?
Argument accuracy — does it map your prompt to the right tool and fill the right fields (correct types, no hallucinated params)?
Parallel tool calls — when a request needs two lookups at once, can it emit multiple calls in a single turn instead of one-at-a-time?
Knowing when NOT to call — does it answer directly when no tool is needed, instead of forcing an irrelevant call?

To test reproducibly, define two simple tools and pass them to Ollama's tools API. Here is the minimal harness:

import ollama

tools = [
  {
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get the current weather for a city",
      "parameters": {
        "type": "object",
        "properties": {
          "city": {"type": "string", "description": "City name"},
          "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
        },
        "required": ["city"]
      }
    }
  },
  {
    "type": "function",
    "function": {
      "name": "convert_currency",
      "description": "Convert an amount between two currencies",
      "parameters": {
        "type": "object",
        "properties": {
          "amount": {"type": "number"},
          "from": {"type": "string"},
          "to": {"type": "string"}
        },
        "required": ["amount", "from", "to"]
      }
    }
  }
]

resp = ollama.chat(
  model="llama3-groq-tool-use:8b",
  messages=[{"role": "user",
             "content": "Weather in Tokyo in celsius, and convert 50 USD to JPY"}],
  tools=tools,
)
print(resp["message"]["tool_calls"])

A strong model returns two tool_calls from that one prompt — get_weather(city="Tokyo", unit="celsius") and convert_currency(amount=50, from="USD", to="JPY") — with clean JSON and no extra text. A weak model returns one call, forgets the unit, or wraps the JSON in a code fence that breaks the parser. For the full setup walkthrough, see our Ollama tool calling guide and the deeper function calling with tools tutorial.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What is the best Ollama model for tool calling in 2026? (Ranked)

The ranking below blends published Berkeley Function Calling Leaderboard (BFCL) results — the standard academic benchmark for function calling — with how each model behaves in Ollama's native tools API. BFCL scores are quoted from each model's own published source where available; where a model has no single published BFCL figure, that cell says so rather than inventing one.

Rank	Model	Size	Tool support	Published BFCL	Best for
🥇 1	Llama-3-Groq-70B-Tool-Use	70B dense	Native (fine-tuned)	90.76% (Groq, BFCL #1 at launch)	Highest accuracy, big GPU/server
🥈 2	Qwen3 (e.g. 8B / 14B / 32B)	dense + MoE family	Native tool template	~75.7% (32B, BFCL v3, third-party)	Best modern general-purpose pick
🥉 3	Llama-3-Groq-8B-Tool-Use	8B dense	Native (fine-tuned)	89.06% (Groq, BFCL #3 at launch)	Best single-GPU / small tool model
4	Command-R	35B dense	Native (single + multi-step)	Not a single published BFCL figure	Multi-step tool use + RAG
5	Hermes 4 (14B / 70B)	dense	Native `<tool_call>` JSON	Not a single published BFCL figure	Open agent stacks, low refusals
6	Mistral (7B v0.3 / Small)	7B+	Native function calling	Not a single published BFCL figure	Lightweight, resource-constrained
7	Firefunction-v2	Llama-3-70B base	Native (parallel calls)	On-par-with-GPT-4o (vendor claim)	Hosted-first; heavier to self-host

A few honest notes on this table. The two Llama-3-Groq-Tool-Use models still hold the cleanest published BFCL numbers of any easily self-hosted option — 90.76% and 89.06% — because Groq fine-tuned them specifically for this one job with Glaive. They are based on Llama 3 (mid-2024), so they are not the newest brains, but for pure schema-following they are hard to beat. Qwen3 is the model most people should reach for today: tool calling is baked into its chat template, it runs through Ollama's tools API without hacks, and it brings far stronger reasoning than the 2024-era Groq finetunes. Command-R earns its spot because Cohere trained it for genuine multi-step tool use and grounded RAG, with publicly available 35B weights and a 128K context.

Why Llama-3-Groq-Tool-Use still tops the published benchmark

When Groq released these two models (in collaboration with Glaive), the 70B variant took the #1 spot on the Berkeley Function Calling Leaderboard at 90.76% overall accuracy, and the 8B variant landed #3 at 89.06% — beating, at the time, several proprietary models on that specific benchmark. They got there by doing one narrow thing extremely well: full fine-tuning plus Direct Preference Optimization (DPO) aimed purely at emitting correct tool calls.

You can run the 8B locally through Ollama:

ollama pull llama3-groq-tool-use:8b
ollama run llama3-groq-tool-use:8b

The trade-off is age. These are Llama-3 finetunes from mid-2024, so their general reasoning and world knowledge trail newer models. If your agent only needs to translate a request into the right function call against a fixed toolset, that does not matter and the Groq 8B is an excellent, lean choice. If your agent also has to reason hard between steps, a newer model like Qwen3 will feel smarter even if its raw BFCL number is lower. You can read Groq's own write-up of the scores on the official Groq Tool Use announcement, and we cover the 8B in depth on our Llama 3 Groq 8B model page.

Best small model for tool calling (single GPU / 8GB-16GB)

Not everyone has a 70B-capable box. If you are on a single consumer GPU, three small models stand out:

Llama-3-Groq-8B-Tool-Use (8B) — the highest published BFCL score in the small class at 89.06%. It fits comfortably in ~6-8 GB of VRAM at a 4-bit quant and is purpose-built for clean tool calls. This is the safe default if tool reliability is the only thing you care about.
Qwen3 8B — a better all-rounder. Native tool template, strong reasoning, and a long context, so it both calls tools correctly and reasons well between calls. Pick this if your agent does more than dumb function dispatch. See our companion guide on the build-a-local-AI-agent walkthrough for how it slots into an agent loop.
Mistral 7B (v0.3 or later) — native function calling and the lightest footprint here, ideal for resource-constrained or edge setups. It is less specialized than the Groq finetune, so expect to validate its JSON a bit more carefully.

There are also community finetunes built solely for this — Watt-Tool-8B and ToolACE-8B, both based on Llama-3.1-8B-Instruct — that report state-of-the-art small-model results on BFCL. They are worth trying if the mainstream 8B options leave gaps, though they live on Hugging Face rather than the core Ollama library, so you may need to import a GGUF. For sizing any of these against your card, run the numbers through our VRAM calculator first.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

A note on Qwen2.5, Gemma, and "Gemma 4"

Two quick clarifications, because the naming trips people up:

Qwen2.5-7B still works fine for tool calling and remains a solid, lighter alternative if you have not moved to Qwen3 — its instruct models support the same tools-array flow. Details are on our Qwen2.5 7B model page.
There is no "Gemma 4." As of mid-2026 the current Google open model is Gemma 3 (the 27B is the largest). Gemma 3 can do function calling, but it has no dedicated tool-call tokens — you steer it into structured JSON purely through prompting, which makes it less reliable for agents than the natively trained models above. Google also shipped a tiny specialized variant, FunctionGemma (based on Gemma 3 270M), aimed only at function calling. If you want plug-and-play tool calling, prefer Qwen3 or the Groq Tool-Use models over base Gemma 3.

How fast are tool-calling models in practice?

Throughput depends almost entirely on whether the model fits fully in VRAM. On my own RTX 3090 (24GB), the Llama-3-Groq-8B-Tool-Use Q4_K_M quant ran at roughly 55-65 tokens/sec, and Qwen3 8B landed around 45-55 tokens/sec, both fully GPU-offloaded — ballpark figures from a single machine, not a controlled benchmark. The good news for agents: tool calls are usually short outputs (a few dozen JSON tokens), so even a model in the 40-60 tok/s range feels instant per call. The 35B Command-R is noticeably heavier and wants a 24GB card at a 4-bit quant; the 70B Groq model realistically needs two GPUs or a server. The moment any layer spills to system RAM, speed collapses — keep the whole model on the GPU.

Model	Quant	Approx tokens/sec (24GB GPU)	Practical fit
Llama-3-Groq-8B-Tool-Use	Q4_K_M	~55-65	8-12 GB GPU
Qwen3 8B	Q4_K_M	~45-55	8-12 GB GPU
Mistral 7B v0.3	Q4_K_M	~60+	6-8 GB GPU
Command-R 35B	Q4_K_M	~18-25	24 GB GPU
Llama-3-Groq-70B-Tool-Use	Q4_K_M	needs 2x GPU / server	~40 GB+

Common tool-calling failure modes (and how to fix them)

Even the best model will occasionally misbehave. These are the failures we hit most often in the Ollama tools test, and the fixes:

JSON wrapped in a markdown fence. The model returns a fenced JSON code block (a triple-backtick json block) instead of a clean tool_calls object, and your parser chokes. Fix: use a model with native tool support (Qwen3, Groq Tool-Use) so Ollama parses it into tool_calls for you, rather than asking a non-tool model to "respond in JSON."
Hallucinated or renamed arguments. The model invents a field that is not in your schema, or renames city to location. Fix: write tight, unambiguous parameter descriptions, mark fields required, and validate the returned args against your schema before executing — never trust them blind.
One call when you needed two (no parallel calls). Weaker models answer only the first half of a compound request. Fix: prefer models that explicitly support parallel tool calls (Firefunction-v2, Qwen3, the Groq finetunes), or split the request into separate turns.
Calling a tool when none was needed. The model forces an irrelevant function call on a plain question. Fix: add a clear "answer directly if no tool applies" instruction and choose a model trained to abstain — the Groq Tool-Use finetunes were specifically tuned to decide whether to call.
Wrong types (string where a number belongs). "amount": "50" instead of 50. Fix: coerce/validate types on your side; do not assume the model respects "type": "number".

For end-to-end agent patterns that handle these gracefully, see our local AI agents guide.

Key Takeaways

Llama-3-Groq-70B-Tool-Use leads the published benchmark at 90.76% BFCL (8B at 89.06%) — the cleanest function-calling finetunes you can self-host, if older-generation reasoning is acceptable.
Qwen3 is the best modern general-purpose pick — native tool template, works directly through Ollama's tools API, and far stronger reasoning than 2024-era finetunes.
Command-R (35B) is the multi-step + RAG specialist; Hermes 4 and Mistral round out open agent stacks; Firefunction-v2 is parallel-call capable but hosted-first.
For a single GPU, pick Llama-3-Groq-8B for pure tool reliability or Qwen3 8B for reasoning-plus-tools — both fit ~8-12 GB at 4-bit.
There is no "Gemma 4." The current model is Gemma 3, which lacks native tool tokens — prefer natively trained tool models for agents.
The real metric is valid-JSON rate and argument accuracy, not chat quality. Always validate returned arguments against your schema before executing them.

Next Steps

New to the mechanics? Start with the Ollama tool calling guide and the hands-on function calling with tools tutorial.
Building an agent? Follow the build a local AI agent walkthrough and the broader local AI agents guide.
Choosing the small-GPU model? Compare the Llama 3 Groq 8B and Qwen2.5 7B model pages.
Not sure a model fits your card? Size any quant with the VRAM calculator before downloading the weights.

Best Local LLMs for Tool & Function Calling (2026 Tested)

Want to go deeper than this article?

What makes a model good at tool calling (and how we tested it)

Reading articles is good. Building is better.

What is the best Ollama model for tool calling in 2026? (Ranked)

Why Llama-3-Groq-Tool-Use still tops the published benchmark

Best small model for tool calling (single GPU / 8GB-16GB)

Reading articles is good. Building is better.

A note on Qwen2.5, Gemma, and "Gemma 4"

How fast are tool-calling models in practice?

Common tool-calling failure modes (and how to fix them)

Key Takeaways

Next Steps

Ollama’s running. Here’s what to build with it.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ready to Go Beyond Tutorials?

Go from reading about AI to building with AI

Related Guides

Ollama Tool Calling Guide

Ollama Function Calling with Tools

Build a Local AI Agent

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Ollama’s running. Here’s what to build with it.