★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
AI Models

Best Local LLMs for Tool & Function Calling (2026 Tested)

June 20, 2026
12 min read
Local AI Master Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

📚AI Learning Path

Ollama’s running. Here’s what to build with it. Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.

Start free
Or own it for life — Lifetime $149, pay once

The best local model for tool and function calling in 2026 is Llama-3-Groq-70B-Tool-Use, which scored 90.76% overall on the Berkeley Function Calling Leaderboard (BFCL) — the highest of any open model when Groq published it — with the matching Llama-3-Groq-8B-Tool-Use at 89.06% for people on a single GPU. For a current general-purpose pick, Qwen3 has native tool calling built into its chat template and runs cleanly through Ollama's tools API, while Cohere's Command-R (35B) remains the specialist for multi-step tool use plus RAG. Below, every model is checked against a reproducible Ollama tools-array test so you can see the failure modes, not just the marketing.

If you are wiring a local model into an agent, the metric that actually matters is not raw chat quality — it is whether the model emits valid, parseable JSON that matches your tool schema, every time, without narrating around it. A model that writes beautiful prose but wraps its function call in a markdown fence or invents an argument name will break your agent loop. This guide ranks models on that reliability.

What makes a model good at tool calling (and how we tested it)

Tool calling (also called function calling) is when you hand the model a list of available functions — each with a name, description, and a JSON schema for its arguments — and the model decides which to call and fills in the arguments. Ollama exposes this through the tools array on its /api/chat endpoint, and the model is expected to return a structured tool_calls object rather than free text.

Four things separate a good tool-calling model from a frustrating one:

  1. Valid-JSON rate — does it return arguments that actually parse, with no trailing commas, no markdown fences, no commentary mixed in?
  2. Argument accuracy — does it map your prompt to the right tool and fill the right fields (correct types, no hallucinated params)?
  3. Parallel tool calls — when a request needs two lookups at once, can it emit multiple calls in a single turn instead of one-at-a-time?
  4. Knowing when NOT to call — does it answer directly when no tool is needed, instead of forcing an irrelevant call?

To test reproducibly, define two simple tools and pass them to Ollama's tools API. Here is the minimal harness:

import ollama

tools = [
  {
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get the current weather for a city",
      "parameters": {
        "type": "object",
        "properties": {
          "city": {"type": "string", "description": "City name"},
          "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
        },
        "required": ["city"]
      }
    }
  },
  {
    "type": "function",
    "function": {
      "name": "convert_currency",
      "description": "Convert an amount between two currencies",
      "parameters": {
        "type": "object",
        "properties": {
          "amount": {"type": "number"},
          "from": {"type": "string"},
          "to": {"type": "string"}
        },
        "required": ["amount", "from", "to"]
      }
    }
  }
]

resp = ollama.chat(
  model="llama3-groq-tool-use:8b",
  messages=[{"role": "user",
             "content": "Weather in Tokyo in celsius, and convert 50 USD to JPY"}],
  tools=tools,
)
print(resp["message"]["tool_calls"])

A strong model returns two tool_calls from that one prompt — get_weather(city="Tokyo", unit="celsius") and convert_currency(amount=50, from="USD", to="JPY") — with clean JSON and no extra text. A weak model returns one call, forgets the unit, or wraps the JSON in a code fence that breaks the parser. For the full setup walkthrough, see our Ollama tool calling guide and the deeper function calling with tools tutorial.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What is the best Ollama model for tool calling in 2026? (Ranked)

The ranking below blends published Berkeley Function Calling Leaderboard (BFCL) results — the standard academic benchmark for function calling — with how each model behaves in Ollama's native tools API. BFCL scores are quoted from each model's own published source where available; where a model has no single published BFCL figure, that cell says so rather than inventing one.

RankModelSizeTool supportPublished BFCLBest for
🥇 1Llama-3-Groq-70B-Tool-Use70B denseNative (fine-tuned)90.76% (Groq, BFCL #1 at launch)Highest accuracy, big GPU/server
🥈 2Qwen3 (e.g. 8B / 14B / 32B)dense + MoE familyNative tool template~75.7% (32B, BFCL v3, third-party)Best modern general-purpose pick
🥉 3Llama-3-Groq-8B-Tool-Use8B denseNative (fine-tuned)89.06% (Groq, BFCL #3 at launch)Best single-GPU / small tool model
4Command-R35B denseNative (single + multi-step)Not a single published BFCL figureMulti-step tool use + RAG
5Hermes 4 (14B / 70B)denseNative <tool_call> JSONNot a single published BFCL figureOpen agent stacks, low refusals
6Mistral (7B v0.3 / Small)7B+Native function callingNot a single published BFCL figureLightweight, resource-constrained
7Firefunction-v2Llama-3-70B baseNative (parallel calls)On-par-with-GPT-4o (vendor claim)Hosted-first; heavier to self-host

A few honest notes on this table. The two Llama-3-Groq-Tool-Use models still hold the cleanest published BFCL numbers of any easily self-hosted option — 90.76% and 89.06% — because Groq fine-tuned them specifically for this one job with Glaive. They are based on Llama 3 (mid-2024), so they are not the newest brains, but for pure schema-following they are hard to beat. Qwen3 is the model most people should reach for today: tool calling is baked into its chat template, it runs through Ollama's tools API without hacks, and it brings far stronger reasoning than the 2024-era Groq finetunes. Command-R earns its spot because Cohere trained it for genuine multi-step tool use and grounded RAG, with publicly available 35B weights and a 128K context.

Why Llama-3-Groq-Tool-Use still tops the published benchmark

When Groq released these two models (in collaboration with Glaive), the 70B variant took the #1 spot on the Berkeley Function Calling Leaderboard at 90.76% overall accuracy, and the 8B variant landed #3 at 89.06% — beating, at the time, several proprietary models on that specific benchmark. They got there by doing one narrow thing extremely well: full fine-tuning plus Direct Preference Optimization (DPO) aimed purely at emitting correct tool calls.

You can run the 8B locally through Ollama:

ollama pull llama3-groq-tool-use:8b
ollama run llama3-groq-tool-use:8b

The trade-off is age. These are Llama-3 finetunes from mid-2024, so their general reasoning and world knowledge trail newer models. If your agent only needs to translate a request into the right function call against a fixed toolset, that does not matter and the Groq 8B is an excellent, lean choice. If your agent also has to reason hard between steps, a newer model like Qwen3 will feel smarter even if its raw BFCL number is lower. You can read Groq's own write-up of the scores on the official Groq Tool Use announcement, and we cover the 8B in depth on our Llama 3 Groq 8B model page.

Best small model for tool calling (single GPU / 8GB-16GB)

Not everyone has a 70B-capable box. If you are on a single consumer GPU, three small models stand out:

  • Llama-3-Groq-8B-Tool-Use (8B) — the highest published BFCL score in the small class at 89.06%. It fits comfortably in ~6-8 GB of VRAM at a 4-bit quant and is purpose-built for clean tool calls. This is the safe default if tool reliability is the only thing you care about.
  • Qwen3 8B — a better all-rounder. Native tool template, strong reasoning, and a long context, so it both calls tools correctly and reasons well between calls. Pick this if your agent does more than dumb function dispatch. See our companion guide on the build-a-local-AI-agent walkthrough for how it slots into an agent loop.
  • Mistral 7B (v0.3 or later) — native function calling and the lightest footprint here, ideal for resource-constrained or edge setups. It is less specialized than the Groq finetune, so expect to validate its JSON a bit more carefully.

There are also community finetunes built solely for this — Watt-Tool-8B and ToolACE-8B, both based on Llama-3.1-8B-Instruct — that report state-of-the-art small-model results on BFCL. They are worth trying if the mainstream 8B options leave gaps, though they live on Hugging Face rather than the core Ollama library, so you may need to import a GGUF. For sizing any of these against your card, run the numbers through our VRAM calculator first.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

A note on Qwen2.5, Gemma, and "Gemma 4"

Two quick clarifications, because the naming trips people up:

  • Qwen2.5-7B still works fine for tool calling and remains a solid, lighter alternative if you have not moved to Qwen3 — its instruct models support the same tools-array flow. Details are on our Qwen2.5 7B model page.
  • There is no "Gemma 4." As of mid-2026 the current Google open model is Gemma 3 (the 27B is the largest). Gemma 3 can do function calling, but it has no dedicated tool-call tokens — you steer it into structured JSON purely through prompting, which makes it less reliable for agents than the natively trained models above. Google also shipped a tiny specialized variant, FunctionGemma (based on Gemma 3 270M), aimed only at function calling. If you want plug-and-play tool calling, prefer Qwen3 or the Groq Tool-Use models over base Gemma 3.

How fast are tool-calling models in practice?

Throughput depends almost entirely on whether the model fits fully in VRAM. On my own RTX 3090 (24GB), the Llama-3-Groq-8B-Tool-Use Q4_K_M quant ran at roughly 55-65 tokens/sec, and Qwen3 8B landed around 45-55 tokens/sec, both fully GPU-offloaded — ballpark figures from a single machine, not a controlled benchmark. The good news for agents: tool calls are usually short outputs (a few dozen JSON tokens), so even a model in the 40-60 tok/s range feels instant per call. The 35B Command-R is noticeably heavier and wants a 24GB card at a 4-bit quant; the 70B Groq model realistically needs two GPUs or a server. The moment any layer spills to system RAM, speed collapses — keep the whole model on the GPU.

ModelQuantApprox tokens/sec (24GB GPU)Practical fit
Llama-3-Groq-8B-Tool-UseQ4_K_M~55-658-12 GB GPU
Qwen3 8BQ4_K_M~45-558-12 GB GPU
Mistral 7B v0.3Q4_K_M~60+6-8 GB GPU
Command-R 35BQ4_K_M~18-2524 GB GPU
Llama-3-Groq-70B-Tool-UseQ4_K_Mneeds 2x GPU / server~40 GB+

Common tool-calling failure modes (and how to fix them)

Even the best model will occasionally misbehave. These are the failures we hit most often in the Ollama tools test, and the fixes:

  • JSON wrapped in a markdown fence. The model returns a fenced JSON code block (a triple-backtick json block) instead of a clean tool_calls object, and your parser chokes. Fix: use a model with native tool support (Qwen3, Groq Tool-Use) so Ollama parses it into tool_calls for you, rather than asking a non-tool model to "respond in JSON."
  • Hallucinated or renamed arguments. The model invents a field that is not in your schema, or renames city to location. Fix: write tight, unambiguous parameter descriptions, mark fields required, and validate the returned args against your schema before executing — never trust them blind.
  • One call when you needed two (no parallel calls). Weaker models answer only the first half of a compound request. Fix: prefer models that explicitly support parallel tool calls (Firefunction-v2, Qwen3, the Groq finetunes), or split the request into separate turns.
  • Calling a tool when none was needed. The model forces an irrelevant function call on a plain question. Fix: add a clear "answer directly if no tool applies" instruction and choose a model trained to abstain — the Groq Tool-Use finetunes were specifically tuned to decide whether to call.
  • Wrong types (string where a number belongs). "amount": "50" instead of 50. Fix: coerce/validate types on your side; do not assume the model respects "type": "number".

For end-to-end agent patterns that handle these gracefully, see our local AI agents guide.

Key Takeaways

  1. Llama-3-Groq-70B-Tool-Use leads the published benchmark at 90.76% BFCL (8B at 89.06%) — the cleanest function-calling finetunes you can self-host, if older-generation reasoning is acceptable.
  2. Qwen3 is the best modern general-purpose pick — native tool template, works directly through Ollama's tools API, and far stronger reasoning than 2024-era finetunes.
  3. Command-R (35B) is the multi-step + RAG specialist; Hermes 4 and Mistral round out open agent stacks; Firefunction-v2 is parallel-call capable but hosted-first.
  4. For a single GPU, pick Llama-3-Groq-8B for pure tool reliability or Qwen3 8B for reasoning-plus-tools — both fit ~8-12 GB at 4-bit.
  5. There is no "Gemma 4." The current model is Gemma 3, which lacks native tool tokens — prefer natively trained tool models for agents.
  6. The real metric is valid-JSON rate and argument accuracy, not chat quality. Always validate returned arguments against your schema before executing them.

Next Steps

🎯
AI Learning Path

Ollama’s running. Here’s what to build with it.

Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.

Or own it for life — Lifetime $149 $599, pay once

Liked this? 20 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Want structured AI education?

20 courses, 495+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path
More on Ollama
See the full Best Ollama Models 2026 guide.

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: June 20, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed

Ready to Go Beyond Tutorials?

20 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once

Was this helpful?

LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Ollama’s running. Here’s what to build with it.

Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators