IBM Granite 3 Local Setup Guide (2026): Enterprise-Grade Open Models
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
IBM Granite 3 is the enterprise-grade open-weights model family for organizations that care about Apache 2.0 licensing, audited training data, and built-in safety. Sizes from 1B (mobile) to 8B (production) plus MoE variants for sparse-activation efficiency. Strong on code (Granite Code 3-34B), structured output, and instruction following. Pairs with Granite Guardian for input/output safety classification. For regulated industries — finance, healthcare, government — Granite is the cleanest open-source choice in 2026.
This guide covers the Granite 3 family, setup across runtimes, the Granite Guardian safety stack, Granite Code for coding assistants, fine-tuning workflows, watsonx integration, and benchmarks vs Llama 3.1 / Mistral / Phi-4.
Table of Contents
- What Granite 3 Is
- The Granite Family
- Apache 2.0 + Audited Training Data
- Hardware Requirements
- Granite vs Llama / Mistral / Phi-4
- Ollama Setup
- llama.cpp / vLLM Setup
- Chat Template
- Granite Code Variants
- Granite Guardian Safety Models
- Function Calling and JSON Mode
- Multilingual Support
- Fine-Tuning
- watsonx Integration
- Real Benchmarks
- Production Deployment
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What Granite 3 Is {#what-it-is}
IBM Granite 3 (ibm-granite/* on HuggingFace) is IBM's flagship open-weights LLM family. Released October 2024 (3.0), updated through 2025-2026 (3.1, 3.2, 3.3). Apache 2.0 license. 12-language coverage. 128K context (3.2+).
Key thesis: enterprise-grade quality with auditable data sources, strong on structured/agentic workloads, and a complete supporting stack (Guardian, Code, Embeddings).
The Granite Family {#family}
| Variant | Use | VRAM (BF16/Q4) |
|---|---|---|
| Granite 3.2 8B Instruct | General default | 16 GB / 5 GB |
| Granite 3.2 2B Instruct | Edge / mobile | 4 GB / 1.5 GB |
| Granite 3.2 1B Instruct | Tiny edge | 2 GB / 0.7 GB |
| Granite 3.0 1B-A1B MoE | Sparse 1B activated | 4 GB / 1.5 GB |
| Granite 3.0 3B-A800M MoE | Sparse 800M activated | 6 GB / 2 GB |
| Granite Code 3B/8B/20B/34B | Coding | varies |
| Granite Embedding 30M / 125M / 278M | RAG | <1 GB |
| Granite Guardian 2B / 8B | Safety classification | 4 / 16 GB |
For most production deployments: Granite 3.2 8B Instruct + Granite Guardian 2B for input/output filter + Granite Embedding 278M for RAG.
Apache 2.0 + Audited Training Data {#license}
Apache 2.0 license — same commercial-friendly terms as Mistral Small 3. Plus IBM publishes Data Cards documenting:
- All training data sources
- Filtering and deduplication processes
- Bias detection and mitigation
- IP/copyright due diligence
For regulated industries this provenance matters — IBM provides legal cover that Llama / Qwen / DeepSeek do not.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Hardware Requirements {#requirements}
| GPU | Granite 3.2 8B Quant | Tok/s |
|---|---|---|
| RTX 3060 12 GB | Q5_K_M | ~95 |
| RTX 4070 16 GB | Q6_K | ~110 |
| RTX 4090 24 GB | Q8_0 | ~140 |
| Mac M3 Pro | Q5_K_M | ~45 |
| RX 7900 XTX | Q5_K_M | ~85 |
Granite 8B is faster than Mistral Small 3 24B and Phi-4 14B because it's smaller — useful for high-throughput agent/RAG workloads.
Granite vs Llama / Mistral / Phi-4 {#comparison}
| Benchmark | Granite 3.2 8B | Llama 3.1 8B | Mistral Small 3 24B | Phi-4 14B |
|---|---|---|---|---|
| MMLU | 75.8 | 73.0 | 81.0 | 84.8 |
| HumanEval | 79.9 | 72.6 | 84.8 | 82.6 |
| GSM8K | 88.7 | 84.5 | 90.0 | 95.6 |
| IFEval (instruction) | 80.5 | 76.6 | 83.5 | 79.4 |
| Function calling | 87 | 80 | 89 | 76 |
| RAG groundedness | 89 | 78 | 84 | 75 |
Granite leads on RAG grounding and structured tasks. For pure reasoning, Phi-4. For pure chat, Mistral / Llama.
Ollama Setup {#ollama}
ollama run granite3.2
ollama run granite3.2:2b # smaller variant
ollama run granite3.2:8b
ollama run granite-code:8b # coding variant
llama.cpp / vLLM Setup {#runtimes}
llama.cpp
huggingface-cli download bartowski/granite-3.2-8b-instruct-GGUF \
granite-3.2-8b-instruct-Q5_K_M.gguf --local-dir ./models
./llama-cli -m models/granite-3.2-8b-instruct-Q5_K_M.gguf -ngl 999 -c 32768 -fa
vLLM
vllm serve ibm-granite/granite-3.2-8b-instruct \
--max-model-len 32768 \
--enable-prefix-caching
Chat Template {#chat-template}
Granite uses its own format:
<|start_of_role|>system<|end_of_role|>{system}<|end_of_text|>
<|start_of_role|>user<|end_of_role|>{user}<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>
All major runtimes auto-load it from tokenizer_config.json.
Recommended sampling: temperature 0.5, min-p 0.05, top-p 0.95.
Granite Code Variants {#granite-code}
For coding assistants, use Granite Code 3:
ollama run granite-code:8b # general coding
ollama run granite-code:20b # higher quality
ollama run granite-code:34b # max quality
Pair with Tabby for editor integration. HumanEval scores: 8B ~78%, 20B ~85%, 34B ~88%.
Granite Guardian Safety Models {#guardian}
from transformers import AutoModelForSequenceClassification, AutoTokenizer
guardian = AutoModelForSequenceClassification.from_pretrained(
"ibm-granite/granite-guardian-3.2-2b"
).cuda()
tok = AutoTokenizer.from_pretrained("ibm-granite/granite-guardian-3.2-2b")
def classify(text, risk_dimension="harm"):
inputs = tok(text, return_tensors="pt").to("cuda")
logits = guardian(**inputs).logits
return logits.argmax(-1).item() # 0=safe, 1=unsafe
Risk dimensions: harm, social_bias, profanity, sexual_content, unethical_behavior, jailbreak, function_call_safety, groundedness (RAG), answer_relevance.
For RAG groundedness specifically (does the answer match the retrieved chunks?), Granite Guardian is currently the best open-source classifier. See prompt injection defense.
Function Calling and JSON Mode {#tools}
Granite was trained for structured agentic output. OpenAI-compatible:
{
"model": "granite3.2",
"messages": [...],
"tools": [{"type": "function", "function": {...}}],
"tool_choice": "auto"
}
JSON mode via response_format works in vLLM with xgrammar. Granite 3.2 is one of the more reliable open-source models for structured output — it rarely produces malformed JSON when given a schema.
Multilingual Support {#multilingual}
Strong: English, German, Spanish, French, Italian, Portuguese, Japanese, Chinese, Arabic, Czech, Dutch, Korean.
For mainland Chinese: Qwen 2.5 / 3 still wins. For most European business languages: Granite is a solid balanced choice.
Fine-Tuning {#fine-tuning}
QLoRA via Unsloth:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/granite-3.2-8b-instruct-bnb-4bit",
max_seq_length=4096,
load_in_4bit=True,
)
Apache 2.0 license means derivatives can be redistributed under any license. For watsonx-style enterprise fine-tuning UI, use IBM's InstructLab framework.
watsonx Integration {#watsonx}
IBM's watsonx.ai is the paid managed platform. Same Granite weights you self-host run on watsonx with:
- SOC2 / HIPAA compliance
- Audit logging and lineage tracking
- watsonx.governance for AI policy
- RBAC, key management, networking
- Integration with IBM Cloud + on-premises
For "develop locally with Granite, deploy in regulated production via watsonx," the model weights are identical and migration is seamless.
Real Benchmarks {#benchmarks}
RTX 4090, Granite 3.2 8B Q5_K_M, single user:
| Metric | Value |
|---|---|
| Tok/s (batch=1) | ~95 |
| TTFT (8K context) | ~95 ms |
| Tok/s (vLLM, 16 concurrent) | ~1,800 aggregate |
| Memory (Q5) | ~6 GB |
| RAG groundedness MAP | 0.89 |
Production Deployment {#production}
# vLLM with AWQ for max throughput
vllm serve ibm-granite/granite-3.2-8b-instruct-AWQ \
--quantization awq \
--max-model-len 32768 \
--enable-prefix-caching --enable-chunked-prefill \
--max-num-seqs 128 \
--kv-cache-dtype fp8_e4m3
Pair with Granite Guardian behind a LiteLLM gateway for compliance-grade safety filtering.
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| Wrong chat format | Missing template | Use bartowski quants which include template |
| Tool calls inconsistent | Sampling too creative | Drop temperature to 0.3 |
| OOM | Big context | Lower max-model-len |
| Granite Guardian false positives | Too strict | Tune classification threshold |
FAQ {#faq}
See answers to common Granite 3 questions below.
Sources: IBM Granite on Hugging Face | Granite 3 announcement | InstructLab | bartowski quants | Internal benchmarks RTX 4090.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!