PaliGemma 3B by Google
Updated: March 16, 2026
Correction Notice (March 2026)
This page previously contained fabricated institutional case studies (MIT, Vatican, NASA), fabricated specialized domain benchmarks, and incorrectly showed PaliGemma as an Ollama chat model. PaliGemma is a vision-language transfer model — it requires image+text input and is designed for fine-tuning, not interactive chat. It is not available on Ollama.
Google's PaliGemma combines a SigLIP vision encoder with Gemma 2B language model to create a vision-language model designed as a fine-tuning base for tasks like image captioning, visual Q&A, OCR, and object detection.
A transfer model — meant to be fine-tuned for your specific task, not used as a general chatbot.
What Is PaliGemma?
Vision + Language
Processes images and text together as multimodal input
Transfer Model
Designed for fine-tuning, not zero-shot chat
3 Resolutions
224px, 448px, 896px — tradeoff speed vs detail
Gemma License
Gemma Terms of Use (free, but not Apache 2.0)
PaliGemma (May 2024) is part of Google's PaLI (Pathways Language and Image) research line. It pairs a SigLIP ViT-So400m/14 vision encoder with the Gemma 2B language decoder. The “3B” refers to total parameter count (~3 billion). It was pre-trained on Google's WebLI (Web Language-Image) dataset, then released in “pt” (pretrained) and “mix” (multi-task fine-tuned) variants.
Paper: arXiv:2407.07726 (PaliGemma: A versatile 3B VLM for transfer)
Architecture: SigLIP + Gemma
PaliGemma is a two-part model: a frozen or partially-frozen vision encoder that converts images into token embeddings, and a language decoder that processes those visual tokens alongside text tokens to generate output.
Vision Encoder: SigLIP ViT-So400m/14
- • 400M parameter Vision Transformer
- • Trained with sigmoid loss (SigLIP) instead of softmax (CLIP)
- • Patch size 14 — image split into 14x14px patches
- • 224px: 256 image tokens
- • 448px: 1024 image tokens
- • 896px: 4096 image tokens
Language Decoder: Gemma 2B
- • 2B parameter autoregressive decoder
- • 18 transformer layers, 2048 hidden dim
- • 256K vocabulary (SentencePiece)
- • Receives visual tokens as prefix
- • Generates text output conditioned on image
- • RoPE positional encoding
How It Works:
Image → SigLIP encodes to visual tokens → visual tokens + text prompt concatenated → Gemma 2B decoder generates response autoregressively. The key insight: visual tokens are treated as a “soft prefix” that the language model conditions on, similar to how prefix-tuning works.
Model Variants & Resolutions
| Variant | HuggingFace ID | Resolution | Image Tokens | Use Case |
|---|---|---|---|---|
| PT-224 (pretrained) | google/paligemma-3b-pt-224 | 224x224 | 256 | Fine-tuning base (fastest) |
| PT-448 | google/paligemma-3b-pt-448 | 448x448 | 1024 | Fine-tuning (balanced) |
| PT-896 | google/paligemma-3b-pt-896 | 896x896 | 4096 | Fine-detail tasks (OCR, small text) |
| Mix-448 (multi-task) | google/paligemma-3b-mix-448 | 448x448 | 1024 | Ready-to-use (multi-task fine-tuned) |
PT = pretrained (needs fine-tuning for best results). Mix = fine-tuned on a mixture of vision-language tasks (can be used directly). Higher resolution = more detail but slower inference and higher VRAM.
Real Benchmarks
PaliGemma is a transfer model — benchmark scores depend heavily on resolution and fine-tuning. The “mix” variant scores reflect multi-task fine-tuning, not zero-shot ability.
| Benchmark | PaliGemma 3B (mix-448) | Notes |
|---|---|---|
| MMVP (Multimodal Visual Perception) | 46.0% | Mix-448 variant |
| TextVQA | ~73-78% | Varies by resolution (higher at 896px) |
| AI2D (Diagram Understanding) | ~72% | After fine-tuning on AI2D |
| COCO Captions (CIDEr) | ~140+ | After fine-tuning on COCO |
| RefCOCO (Object Localization) | ~90% | After fine-tuning; outputs bounding boxes |
Source: arXiv:2407.07726 and HuggingFace model card. Scores are approximate and vary by training configuration. PaliGemma shines when fine-tuned for a specific task — not as a general-purpose VLM.
Hardware Requirements
| Configuration | VRAM (Inference) | VRAM (Fine-Tuning with LoRA) | Notes |
|---|---|---|---|
| FP16 (224px) | ~6 GB | ~10 GB | Smallest variant, fastest |
| FP16 (448px) | ~8 GB | ~14 GB | Best balance |
| FP16 (896px) | ~16 GB | ~24 GB+ | High-detail, needs beefy GPU |
| 4-bit quantized (448px) | ~3 GB | ~6 GB | bitsandbytes / GPTQ |
Higher resolution = more image tokens = more VRAM. The 896px variant produces 4096 image tokens, requiring significantly more memory. For most fine-tuning tasks, 448px with LoRA is the sweet spot.
System Requirements
Setup with HuggingFace Transformers
Not Available on Ollama
PaliGemma is a vision-language model that requires image input. It is not available on Ollama and cannot be used as a text-only chat model. Use HuggingFace Transformers or Google's big_vision library instead.
Basic Inference (Mix variant)
The “mix” variant is pre-fine-tuned on multiple tasks and can be used directly.
pip install transformers torch pillow
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import torch
model_id = "google/paligemma-3b-mix-448"
processor = AutoProcessor.from_pretrained(model_id)
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# Load an image
image = Image.open("photo.jpg")
# Ask a question about the image
prompt = "What objects are in this image?"
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=256)
result = processor.decode(output[0], skip_special_tokens=True)
print(result)Note: You may need to accept the Gemma license on HuggingFace and use huggingface-cli login before downloading.
Task-Specific Prompt Formats
PaliGemma uses specific prompt prefixes for different tasks:
| Task | Prompt Format | Output |
|---|---|---|
| Image Captioning | "caption en" | Natural language description |
| Visual Q&A | "What color is the car?" | Short answer |
| Object Detection | "detect cat" | Bounding box coordinates |
| OCR | "ocr" | Extracted text from image |
| Segmentation | "segment cat" | Segmentation mask tokens |
Fine-Tuning for Custom Tasks
PaliGemma's primary value is as a fine-tuning base. The “pt” (pretrained) variants are designed to be fine-tuned on your specific dataset for tasks like medical image classification, document understanding, satellite imagery analysis, etc.
LoRA Fine-Tuning Example:
from transformers import PaliGemmaForConditionalGeneration
from peft import LoraConfig, get_peft_model
model = PaliGemmaForConditionalGeneration.from_pretrained(
"google/paligemma-3b-pt-448",
torch_dtype=torch.float16
)
# Apply LoRA to language model layers
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# Train on your image-text dataset...Good Fine-Tuning Tasks
- • Medical/scientific image classification
- • Document OCR and form extraction
- • Product image captioning for e-commerce
- • Satellite/drone imagery analysis
- • Custom object detection domains
Not Suited For
- • General visual conversation (use LLaVA, Qwen-VL)
- • Text-only tasks (use Gemma, Llama, Qwen)
- • Video understanding
- • Interactive chat applications
- • Zero-shot complex reasoning about images
PaliGemma 2 (December 2024)
Google released PaliGemma 2 in December 2024, upgrading the language decoder from Gemma 2B to Gemma 2 (available in 3B, 10B, and 28B sizes). This significantly improves reasoning capabilities while keeping the same SigLIP vision encoder.
| Model | Vision Encoder | Language Decoder | Total Params | VRAM (FP16, 448px) |
|---|---|---|---|---|
| PaliGemma (v1) | SigLIP 400M | Gemma 2B | ~3B | ~8 GB |
| PaliGemma 2 (3B) | SigLIP 400M | Gemma 2 2B | ~3B | ~8 GB |
| PaliGemma 2 (10B) | SigLIP 400M | Gemma 2 9B | ~10B | ~22 GB |
| PaliGemma 2 (28B) | SigLIP 400M | Gemma 2 27B | ~28B | ~60 GB |
Recommendation: For new projects in 2026, start with PaliGemma 2 (10B) at 448px. It offers significantly better reasoning than PaliGemma v1 while remaining runnable on a single RTX 3090/4090. See PaliGemma 2 on HuggingFace.
Strengths & Limitations
Strengths
- Excellent fine-tuning base: Clean architecture designed for transfer learning across vision-language tasks
- Multiple resolutions: 224/448/896px lets you trade speed for detail depending on task
- Small enough for local: 3B params fits on consumer GPUs (6-8GB VRAM)
- Versatile output: Can generate text, bounding boxes, and segmentation masks
- Strong SigLIP encoder: The vision encoder is battle-tested and produces high-quality image representations
Limitations
- Not a chatbot: Cannot have visual conversations like LLaVA or Qwen-VL — it's a transfer model
- Requires fine-tuning: PT variants give mediocre results out-of-the-box; mix variant is usable but limited
- No Ollama support: Requires Python/HuggingFace setup — not as simple as one-line install
- Gemma 2B decoder is small: Limited reasoning compared to 7B+ language models (PaliGemma 2 fixes this)
- Gemma license restrictions: Not Apache 2.0 — review Gemma Terms of Use before commercial deployment
| Model | Size | RAM Required | Speed | Quality | Cost/Month |
|---|---|---|---|---|---|
| PaliGemma 3B (v1) | 3B | 6-8GB | ~20 tok/s | 68% | Free |
| PaliGemma 2 10B | 10B | ~22GB | ~12 tok/s | 78% | Free |
| Qwen2-VL 7B | 7B | ~16GB | ~15 tok/s | 80% | Free |
| LLaVA 1.6 7B | 7B | ~14GB | ~18 tok/s | 72% | Free |
Quality scores are editorial estimates for vision-language tasks. Qwen2-VL and LLaVA are general visual chat models. PaliGemma excels when fine-tuned for specific tasks but is weaker as a general VLM.
Frequently Asked Questions
Can I use PaliGemma as a visual chatbot?
Not really. PaliGemma is designed as a transfer model for specific vision tasks, not interactive visual conversation. For visual chat, use Qwen2-VL, LLaVA, or InternVL instead. PaliGemma excels when fine-tuned for a specific task like medical image classification or document OCR.
Should I use PaliGemma v1 or PaliGemma 2?
For new projects, use PaliGemma 2. The 10B variant (Gemma 2 9B decoder) offers much better reasoning while the SigLIP vision encoder remains the same. PaliGemma v1 (this page) is only worth using if you need the absolute smallest model (3B with Gemma 2B decoder) and PaliGemma 2 3B doesn't meet your needs.
Is PaliGemma available on Ollama?
No. PaliGemma requires image+text multimodal input and is not in Ollama's model library. Use HuggingFace Transformers (PaliGemmaForConditionalGeneration) or Google's big_vision library. Installation requires Python and pip.
What resolution should I use?
448px is the best default. 224px is faster but loses fine detail. 896px captures small text and fine details but uses 4x more VRAM and is much slower (4096 image tokens vs 1024). Use 896px only for OCR-heavy tasks or when you need to read small text in images.
Resources
Ready to Go Beyond Tutorials?
10 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.
Related Guides
Continue your local AI journey with these comprehensive guides
Written by Pattanaik Ramswarup
Creator of Local AI Master
I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.