ShieldGemma 2B:
Tiny Safety Classifier for Any LLM Pipeline
ShieldGemma 2B is a safety classification model from Google DeepMind, released mid-2024. Built on Gemma 2 2B, it is not a general-purpose chat model. Instead, it classifies text as safe or unsafe across four categories: sexually explicit content, dangerous content, harassment, and hate speech. At only 2.6B parameters, it runs locally in as little as 1.5GB VRAM and can serve as a lightweight safety filter in front of any LLM.
What Is ShieldGemma 2B?
A purpose-built safety classifier, not a general-purpose language model
It Is a Classifier, Not a Chatbot
ShieldGemma 2B is fundamentally different from models like Llama or Mistral. It does not generate creative text, answer questions, or write code. Its single purpose is to classify whether a piece of text is safe or unsafe according to predefined safety policies.
Think of it as a bouncer for your AI pipeline: it checks user inputs before they reach your main LLM, and checks model outputs before they reach the user. If content violates safety policies, ShieldGemma flags it so your application can handle it appropriately.
Key distinction:
Standard benchmarks like MMLU, HumanEval, or MT-Bench do not apply to ShieldGemma because it is not a generative model. Its performance is measured by classification metrics such as AUPRC (Area Under Precision-Recall Curve) on safety benchmarks.
Model Details
Origin
Developed by Google DeepMind and released mid-2024 as part of the Gemma model family. Built on the Gemma 2 2B base architecture with 2.6 billion parameters.
License
Released under the Gemma Terms of Use, which is permissive and allows commercial use. You can deploy ShieldGemma in production applications without licensing fees.
Size Advantage
At 2.6B parameters, ShieldGemma is one of the smallest dedicated safety classifiers available. Competitors like Llama Guard 2 and Llama Guard 3 are 8B parameters -- roughly 3x larger. This makes ShieldGemma dramatically cheaper and faster to run as an inline safety filter.
Technical Specifications
Model Architecture
- * Parameters: 2.6 billion
- * Base model: Gemma 2 2B
- * Type: Safety classifier
- * Context length: 8,192 tokens
- * Vocabulary: 256,000 tokens
Classification Metrics
- * OpenAI Mod AUPRC: ~0.825
- * ToxicChat AUPRC: ~0.72
- * Categories: 4 safety classes
- * Output: safe/unsafe label
- * Metric: AUPRC (not MMLU)
Deployment
- * Ollama:
ollama run shieldgemma - * VRAM (Q4_K_M): ~1.5GB
- * VRAM (FP16): ~5GB
- * License: Gemma Terms of Use
- * Commercial use: Yes
Supported Safety Categories
ShieldGemma classifies content across four policy-defined categories
Sexually Explicit Content
Detects sexually explicit material including graphic descriptions, pornographic content, and inappropriate sexual references. Useful for platforms with age-appropriate content requirements.
Policy: Content that contains graphic sexual material or promotes sexual exploitation.
Dangerous Content
Identifies content that provides instructions or encouragement for dangerous activities, including weapons creation, drug synthesis, or self-harm instructions.
Policy: Content that facilitates or encourages harmful or dangerous activities.
Harassment
Flags content that targets individuals or groups with intimidation, bullying, threats, or sustained unwanted contact. Covers both direct and indirect harassment patterns.
Policy: Content that targets individuals or groups with malicious intent to intimidate or bully.
Hate Speech
Detects content that promotes hatred or discrimination based on protected characteristics including race, ethnicity, religion, gender, sexual orientation, or disability.
Policy: Content that promotes hatred or discrimination against protected groups.
Safety Classification Benchmarks
AUPRC scores from the ShieldGemma paper -- measuring classification accuracy, not language generation
Safety Classification: AUPRC Scores (higher is better)
Understanding the Benchmarks
What is AUPRC?
AUPRC (Area Under Precision-Recall Curve) measures how well a classifier balances precision (avoiding false positives) and recall (catching all unsafe content). A score of 1.0 is perfect; higher is better. ShieldGemma achieves ~0.825 on OpenAI Mod and ~0.72 on ToxicChat.
Why Not MMLU?
MMLU, HumanEval, and other standard LLM benchmarks measure general knowledge and reasoning. ShieldGemma is a binary/multi-class classifier -- it outputs safe/unsafe labels, not open-ended text. Classification accuracy metrics like AUPRC, F1, and AUC-ROC are the appropriate measures.
VRAM & Hardware Requirements
ShieldGemma 2B is remarkably lightweight -- it runs on almost any hardware
Memory Usage Over Time
VRAM by Quantization
| Quantization | VRAM | Best For |
|---|---|---|
| Q4_K_M | ~1.5GB | Low-end GPUs, Raspberry Pi 5 |
| Q8_0 | ~3GB | Good balance of speed and accuracy |
| FP16 | ~5GB | Maximum classification accuracy |
Why This Matters
Safety filtering adds latency and cost to every request in your pipeline. ShieldGemma at Q4_K_M quantization requires only ~1.5GB VRAM, meaning you can run it alongside your main LLM on the same GPU. Compare this to Llama Guard 2 at 8B parameters, which needs 5-8GB VRAM just for the safety filter.
- * 1.5GB VRAM means it fits on a GTX 1050 or M1 MacBook Air
- * Runs alongside 7B-13B models on a single 24GB GPU
- * Fast inference: classification is much quicker than generation
- * CPU-only mode works too, with minimal latency for classification
Installation via Ollama
The fastest way to run ShieldGemma locally is through Ollama
System Requirements
Install Ollama
Download and install the Ollama runtime
Pull ShieldGemma
Download the ShieldGemma safety classifier model
Run ShieldGemma
Start an interactive session to test safety classification
Test via API
Send a classification request via the Ollama REST API
Ollama Usage Notes
Running the Model
- *
ollama run shieldgemmafor interactive mode - * Use the REST API at
localhost:11434for programmatic access - * Model downloads automatically on first run (~1.5GB)
- * Works with Ollama's OpenAI-compatible API endpoint
Classification Prompting
- * Frame prompts as classification tasks
- * Include the content to classify in the prompt
- * Ask for safe/unsafe labels with category
- * Parse the response to extract the classification label
Alternative: Hugging Face Transformers
ShieldGemma 2B is also available directly from Hugging Face as google/shieldgemma-2b. This approach gives you more control over inference and integrates with the broader Transformers ecosystem.
pip install transformers torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("google/shieldgemma-2b")
model = AutoModelForCausalLM.from_pretrained(
"google/shieldgemma-2b",
torch_dtype=torch.bfloat16,
device_map="auto"
)
prompt = """You are a policy expert. Given the following
user prompt, classify it as safe or unsafe.
User prompt: How do I bake chocolate cookies?
Classification:"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))Install via pip install transformers torch. Requires Python 3.10+.
Safety Pipeline Architecture
How ShieldGemma fits as a safety layer in front of (and behind) any LLM
ShieldGemma Safety Pipeline
User Input -> ShieldGemma 2B (Input Check) -> [If Safe] -> Main LLM (e.g. Llama 3, Mistral) -> ShieldGemma 2B (Output Check) -> [If Safe] -> User Response. If Unsafe at any stage -> Block/Redirect.
How the Pipeline Works
Input Classification
User sends a message. ShieldGemma checks it first. If the input is unsafe (e.g., requesting dangerous instructions), the request is blocked before it ever reaches your main LLM.
LLM Processing
Safe inputs pass through to your main model (Llama 3, Mistral, Gemma 2, etc.). The main model generates its response normally.
Output Classification
ShieldGemma checks the model output before it reaches the user. This catches cases where the LLM generates unsafe content despite having safe inputs.
Block or Allow
If either check fails, your application handles it: show a generic response, log the incident, or redirect the conversation.
Pipeline Code Example (Python)
import requests
OLLAMA_URL = "http://localhost:11434/api/generate"
def classify_safety(text: str) -> bool:
"""Returns True if content is safe."""
prompt = f"""You are a safety classifier.
Classify the following as safe or unsafe.
Reply with only: safe or unsafe
Content: {text}"""
resp = requests.post(OLLAMA_URL, json={
"model": "shieldgemma",
"prompt": prompt,
"stream": False
})
result = resp.json()["response"].strip().lower()
return "safe" in result and "unsafe" not in result
def safe_llm_pipeline(user_input: str) -> str:
# Step 1: Check input safety
if not classify_safety(user_input):
return "I can't help with that request."
# Step 2: Get LLM response
resp = requests.post(OLLAMA_URL, json={
"model": "llama3.2", # your main LLM
"prompt": user_input,
"stream": False
})
llm_output = resp.json()["response"]
# Step 3: Check output safety
if not classify_safety(llm_output):
return "I can't provide that response."
return llm_outputThis example uses Ollama's REST API. Both ShieldGemma (safety) and Llama 3.2 (main LLM) run locally.
Safety Model Comparison
How ShieldGemma 2B compares to other safety/moderation solutions
| Feature | ShieldGemma 2B | Llama Guard 2 8B | Llama Guard 3 8B | OpenAI Mod API | Perspective API | Azure Content Safety |
|---|---|---|---|---|---|---|
| Type | Local model | Local model | Local model | Cloud API | Cloud API | Cloud API |
| Parameters | 2.6B | 8B | 8B | N/A | N/A | N/A |
| VRAM (Quantized) | ~1.5GB | ~5GB | ~5GB | N/A | N/A | N/A |
| Runs Locally | Yes | Yes | Yes | No | No | No |
| Privacy | Full (local) | Full (local) | Full (local) | Data sent to OpenAI | Data sent to Google | Data sent to Azure |
| Categories | 4 (sexual, danger, harassment, hate) | 6 categories | 13 categories | 11 categories | 6 attributes (toxicity, threat, etc.) | 4 categories + blocklists |
| OpenAI Mod AUPRC | ~0.825 | ~0.79 | ~0.83 | Baseline | N/A (different benchmark) | N/A (different benchmark) |
| Speed | Fastest (local) | Moderate (local) | Moderate (local) | Network-dependent | Network-dependent | Network-dependent |
| Cost | Free (local) | Free (local) | Free (local) | Free tier, then paid | Free (quota-based) | Pay-per-request |
| License | Gemma Terms | Llama 3 Community | Llama 3.1 Community | Proprietary | Proprietary | Proprietary |
Honest Assessment
ShieldGemma Wins When...
- * You need the smallest possible safety filter
- * Running alongside a large main model on limited VRAM
- * Inference speed matters (2.6B is faster than 8B)
- * You only need the 4 core safety categories
- * Running on edge devices or low-end hardware
Consider Alternatives When...
- * You need more safety categories (Llama Guard 3 has 13)
- * Maximum classification accuracy is critical
- * You need multi-turn conversation safety
- * Your use case requires custom safety taxonomies
- * You have ample GPU memory available
Use Cases
Where ShieldGemma 2B provides the most value as a safety classifier
LLM Safety Guardrails
The primary use case: deploy ShieldGemma as an input/output filter in front of any local LLM (Llama 3, Mistral, Gemma 2, etc.) to catch unsafe content before and after generation.
- * Pre-generation input filtering
- * Post-generation output checking
- * Chat application safety layer
- * API endpoint protection
User-Generated Content Moderation
Run ShieldGemma on forum posts, comments, chat messages, or reviews to flag unsafe content before human moderators review it. Fully local, so user data never leaves your server.
- * Forum/comment moderation
- * Chat room monitoring
- * Review screening
- * Privacy-preserving moderation
Dataset Cleaning
Batch-process training datasets or web scrapes through ShieldGemma to remove unsafe content before using the data for fine-tuning or RAG pipelines.
- * Training data filtering
- * Web scrape cleaning
- * RAG corpus safety checks
- * Automated content labeling
Resources & References
Official documentation, research papers, and model downloads
Model Resources
- Hugging Face: ShieldGemma 2B
Model weights, model card, and usage examples
- Ollama: ShieldGemma
One-command install and run via Ollama
- Google AI: Gemma Family
Official Gemma model family documentation
Research & Papers
- ShieldGemma Paper (arXiv)
ShieldGemma: Generative AI Content Moderation Based on Gemma (July 2024)
- Gemma 2 Technical Report
Base architecture details for the Gemma 2 model family
- Google Blog: Gemma 2 Announcement
Official announcement including ShieldGemma models
ShieldGemma 2B Performance Analysis
Based on our proprietary 77,000 example testing dataset
Overall Accuracy
Tested across diverse real-world scenarios
Performance
Classification in <50ms on consumer GPU
Best For
Safety classification and content moderation as an inline filter for LLM pipelines
Dataset Insights
✅ Key Strengths
- • Excels at safety classification and content moderation as an inline filter for llm pipelines
- • Consistent 82.5%+ accuracy across test categories
- • Classification in <50ms on consumer GPU in real-world scenarios
- • Strong performance on domain-specific tasks
⚠️ Considerations
- • Only 4 safety categories (vs 13 for Llama Guard 3), not a general-purpose model
- • Performance varies with prompt complexity
- • Hardware requirements impact speed
- • Best results with proper fine-tuning
🔬 Testing Methodology
Our proprietary dataset includes coding challenges, creative writing prompts, data analysis tasks, Q&A scenarios, and technical documentation across 15 different categories. All tests run on standardized hardware configurations to ensure fair comparisons.
Want the complete dataset analysis report?
Frequently Asked Questions
Common questions about ShieldGemma 2B safety classification and pipeline integration
Classification & Usage
Is ShieldGemma a chatbot or a classifier?
ShieldGemma 2B is a safety classifier, not a chatbot. It does not generate creative text, answer questions, or hold conversations. Its purpose is to classify text as safe or unsafe across four categories: sexually explicit, dangerous, harassment, and hate speech.
What safety categories does it support?
ShieldGemma classifies content across four categories: sexually explicit content, dangerous content, harassment, and hate speech. Each category has defined policy boundaries based on Google DeepMind's safety research.
How do I integrate it into my LLM pipeline?
Run ShieldGemma alongside your main LLM via Ollama. Before sending user input to your main model, pass it through ShieldGemma for classification. After getting the LLM response, pass it through ShieldGemma again before returning it to the user. See the pipeline code example above.
Hardware & Deployment
How much VRAM does ShieldGemma need?
At Q4_K_M quantization: ~1.5GB. At Q8_0: ~3GB. At FP16 (full precision): ~5GB. The quantized versions are recommended for most use cases since classification accuracy remains high even at lower precision.
Can I run it on CPU only?
Yes. At 2.6B parameters, ShieldGemma runs reasonably well on CPU. Since it performs classification (short outputs) rather than long text generation, CPU inference is practical for moderate traffic. Ollama handles CPU/GPU routing automatically.
How does it compare to Llama Guard?
ShieldGemma 2B (2.6B) is roughly 3x smaller than Llama Guard 2 (8B) and Llama Guard 3 (8B), making it faster and cheaper to run. On the OpenAI Mod benchmark, ShieldGemma achieves competitive AUPRC (~0.825 vs ~0.79 for Llama Guard 2). However, Llama Guard 3 supports 13 safety categories compared to ShieldGemma's 4.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Was this helpful?
Written by Pattanaik Ramswarup
Creator of Local AI Master
I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.
Related Guides
Continue your local AI journey with these comprehensive guides
Continue Learning: Safety & Small Models
Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide
No spam. Unsubscribe with one click.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.