★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Models

IBM Granite 3 Local Setup Guide (2026): Enterprise-Grade Open Models

May 1, 2026
18 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

IBM Granite 3 is the enterprise-grade open-weights model family for organizations that care about Apache 2.0 licensing, audited training data, and built-in safety. Sizes from 1B (mobile) to 8B (production) plus MoE variants for sparse-activation efficiency. Strong on code (Granite Code 3-34B), structured output, and instruction following. Pairs with Granite Guardian for input/output safety classification. For regulated industries — finance, healthcare, government — Granite is the cleanest open-source choice in 2026.

This guide covers the Granite 3 family, setup across runtimes, the Granite Guardian safety stack, Granite Code for coding assistants, fine-tuning workflows, watsonx integration, and benchmarks vs Llama 3.1 / Mistral / Phi-4.

Table of Contents

  1. What Granite 3 Is
  2. The Granite Family
  3. Apache 2.0 + Audited Training Data
  4. Hardware Requirements
  5. Granite vs Llama / Mistral / Phi-4
  6. Ollama Setup
  7. llama.cpp / vLLM Setup
  8. Chat Template
  9. Granite Code Variants
  10. Granite Guardian Safety Models
  11. Function Calling and JSON Mode
  12. Multilingual Support
  13. Fine-Tuning
  14. watsonx Integration
  15. Real Benchmarks
  16. Production Deployment
  17. Troubleshooting
  18. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What Granite 3 Is {#what-it-is}

IBM Granite 3 (ibm-granite/* on HuggingFace) is IBM's flagship open-weights LLM family. Released October 2024 (3.0), updated through 2025-2026 (3.1, 3.2, 3.3). Apache 2.0 license. 12-language coverage. 128K context (3.2+).

Key thesis: enterprise-grade quality with auditable data sources, strong on structured/agentic workloads, and a complete supporting stack (Guardian, Code, Embeddings).


The Granite Family {#family}

VariantUseVRAM (BF16/Q4)
Granite 3.2 8B InstructGeneral default16 GB / 5 GB
Granite 3.2 2B InstructEdge / mobile4 GB / 1.5 GB
Granite 3.2 1B InstructTiny edge2 GB / 0.7 GB
Granite 3.0 1B-A1B MoESparse 1B activated4 GB / 1.5 GB
Granite 3.0 3B-A800M MoESparse 800M activated6 GB / 2 GB
Granite Code 3B/8B/20B/34BCodingvaries
Granite Embedding 30M / 125M / 278MRAG<1 GB
Granite Guardian 2B / 8BSafety classification4 / 16 GB

For most production deployments: Granite 3.2 8B Instruct + Granite Guardian 2B for input/output filter + Granite Embedding 278M for RAG.


Apache 2.0 + Audited Training Data {#license}

Apache 2.0 license — same commercial-friendly terms as Mistral Small 3. Plus IBM publishes Data Cards documenting:

  • All training data sources
  • Filtering and deduplication processes
  • Bias detection and mitigation
  • IP/copyright due diligence

For regulated industries this provenance matters — IBM provides legal cover that Llama / Qwen / DeepSeek do not.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Hardware Requirements {#requirements}

GPUGranite 3.2 8B QuantTok/s
RTX 3060 12 GBQ5_K_M~95
RTX 4070 16 GBQ6_K~110
RTX 4090 24 GBQ8_0~140
Mac M3 ProQ5_K_M~45
RX 7900 XTXQ5_K_M~85

Granite 8B is faster than Mistral Small 3 24B and Phi-4 14B because it's smaller — useful for high-throughput agent/RAG workloads.


Granite vs Llama / Mistral / Phi-4 {#comparison}

BenchmarkGranite 3.2 8BLlama 3.1 8BMistral Small 3 24BPhi-4 14B
MMLU75.873.081.084.8
HumanEval79.972.684.882.6
GSM8K88.784.590.095.6
IFEval (instruction)80.576.683.579.4
Function calling87808976
RAG groundedness89788475

Granite leads on RAG grounding and structured tasks. For pure reasoning, Phi-4. For pure chat, Mistral / Llama.


Ollama Setup {#ollama}

ollama run granite3.2
ollama run granite3.2:2b   # smaller variant
ollama run granite3.2:8b
ollama run granite-code:8b  # coding variant

llama.cpp / vLLM Setup {#runtimes}

llama.cpp

huggingface-cli download bartowski/granite-3.2-8b-instruct-GGUF \
    granite-3.2-8b-instruct-Q5_K_M.gguf --local-dir ./models

./llama-cli -m models/granite-3.2-8b-instruct-Q5_K_M.gguf -ngl 999 -c 32768 -fa

vLLM

vllm serve ibm-granite/granite-3.2-8b-instruct \
    --max-model-len 32768 \
    --enable-prefix-caching

Chat Template {#chat-template}

Granite uses its own format:

<|start_of_role|>system<|end_of_role|>{system}<|end_of_text|>
<|start_of_role|>user<|end_of_role|>{user}<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>

All major runtimes auto-load it from tokenizer_config.json.

Recommended sampling: temperature 0.5, min-p 0.05, top-p 0.95.


Granite Code Variants {#granite-code}

For coding assistants, use Granite Code 3:

ollama run granite-code:8b      # general coding
ollama run granite-code:20b     # higher quality
ollama run granite-code:34b     # max quality

Pair with Tabby for editor integration. HumanEval scores: 8B ~78%, 20B ~85%, 34B ~88%.


Granite Guardian Safety Models {#guardian}

from transformers import AutoModelForSequenceClassification, AutoTokenizer

guardian = AutoModelForSequenceClassification.from_pretrained(
    "ibm-granite/granite-guardian-3.2-2b"
).cuda()
tok = AutoTokenizer.from_pretrained("ibm-granite/granite-guardian-3.2-2b")

def classify(text, risk_dimension="harm"):
    inputs = tok(text, return_tensors="pt").to("cuda")
    logits = guardian(**inputs).logits
    return logits.argmax(-1).item()  # 0=safe, 1=unsafe

Risk dimensions: harm, social_bias, profanity, sexual_content, unethical_behavior, jailbreak, function_call_safety, groundedness (RAG), answer_relevance.

For RAG groundedness specifically (does the answer match the retrieved chunks?), Granite Guardian is currently the best open-source classifier. See prompt injection defense.


Function Calling and JSON Mode {#tools}

Granite was trained for structured agentic output. OpenAI-compatible:

{
  "model": "granite3.2",
  "messages": [...],
  "tools": [{"type": "function", "function": {...}}],
  "tool_choice": "auto"
}

JSON mode via response_format works in vLLM with xgrammar. Granite 3.2 is one of the more reliable open-source models for structured output — it rarely produces malformed JSON when given a schema.


Multilingual Support {#multilingual}

Strong: English, German, Spanish, French, Italian, Portuguese, Japanese, Chinese, Arabic, Czech, Dutch, Korean.

For mainland Chinese: Qwen 2.5 / 3 still wins. For most European business languages: Granite is a solid balanced choice.


Fine-Tuning {#fine-tuning}

QLoRA via Unsloth:

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/granite-3.2-8b-instruct-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
)

Apache 2.0 license means derivatives can be redistributed under any license. For watsonx-style enterprise fine-tuning UI, use IBM's InstructLab framework.


watsonx Integration {#watsonx}

IBM's watsonx.ai is the paid managed platform. Same Granite weights you self-host run on watsonx with:

  • SOC2 / HIPAA compliance
  • Audit logging and lineage tracking
  • watsonx.governance for AI policy
  • RBAC, key management, networking
  • Integration with IBM Cloud + on-premises

For "develop locally with Granite, deploy in regulated production via watsonx," the model weights are identical and migration is seamless.


Real Benchmarks {#benchmarks}

RTX 4090, Granite 3.2 8B Q5_K_M, single user:

MetricValue
Tok/s (batch=1)~95
TTFT (8K context)~95 ms
Tok/s (vLLM, 16 concurrent)~1,800 aggregate
Memory (Q5)~6 GB
RAG groundedness MAP0.89

Production Deployment {#production}

# vLLM with AWQ for max throughput
vllm serve ibm-granite/granite-3.2-8b-instruct-AWQ \
    --quantization awq \
    --max-model-len 32768 \
    --enable-prefix-caching --enable-chunked-prefill \
    --max-num-seqs 128 \
    --kv-cache-dtype fp8_e4m3

Pair with Granite Guardian behind a LiteLLM gateway for compliance-grade safety filtering.


Troubleshooting {#troubleshooting}

SymptomCauseFix
Wrong chat formatMissing templateUse bartowski quants which include template
Tool calls inconsistentSampling too creativeDrop temperature to 0.3
OOMBig contextLower max-model-len
Granite Guardian false positivesToo strictTune classification threshold

FAQ {#faq}

See answers to common Granite 3 questions below.


Sources: IBM Granite on Hugging Face | Granite 3 announcement | InstructLab | bartowski quants | Internal benchmarks RTX 4090.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes Granite 3 + Granite Guardian enterprise reference. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators