Free course — 2 free chapters of every course. No credit card.Start learning free
🤖AI MODEL GUIDE

PaliGemma 3B by Google

Updated: March 16, 2026

Correction Notice (March 2026)

This page previously contained fabricated institutional case studies (MIT, Vatican, NASA), fabricated specialized domain benchmarks, and incorrectly showed PaliGemma as an Ollama chat model. PaliGemma is a vision-language transfer model — it requires image+text input and is designed for fine-tuning, not interactive chat. It is not available on Ollama.

Google's PaliGemma combines a SigLIP vision encoder with Gemma 2B language model to create a vision-language model designed as a fine-tuning base for tasks like image captioning, visual Q&A, OCR, and object detection.

A transfer model — meant to be fine-tuned for your specific task, not used as a general chatbot.

What Is PaliGemma?

👁️

Vision + Language

Processes images and text together as multimodal input

🔧

Transfer Model

Designed for fine-tuning, not zero-shot chat

📐

3 Resolutions

224px, 448px, 896px — tradeoff speed vs detail

📜

Gemma License

Gemma Terms of Use (free, but not Apache 2.0)

PaliGemma (May 2024) is part of Google's PaLI (Pathways Language and Image) research line. It pairs a SigLIP ViT-So400m/14 vision encoder with the Gemma 2B language decoder. The “3B” refers to total parameter count (~3 billion). It was pre-trained on Google's WebLI (Web Language-Image) dataset, then released in “pt” (pretrained) and “mix” (multi-task fine-tuned) variants.

Paper: arXiv:2407.07726 (PaliGemma: A versatile 3B VLM for transfer)

Architecture: SigLIP + Gemma

PaliGemma is a two-part model: a frozen or partially-frozen vision encoder that converts images into token embeddings, and a language decoder that processes those visual tokens alongside text tokens to generate output.

Vision Encoder: SigLIP ViT-So400m/14

  • • 400M parameter Vision Transformer
  • • Trained with sigmoid loss (SigLIP) instead of softmax (CLIP)
  • • Patch size 14 — image split into 14x14px patches
  • • 224px: 256 image tokens
  • • 448px: 1024 image tokens
  • • 896px: 4096 image tokens

Language Decoder: Gemma 2B

  • • 2B parameter autoregressive decoder
  • • 18 transformer layers, 2048 hidden dim
  • • 256K vocabulary (SentencePiece)
  • • Receives visual tokens as prefix
  • • Generates text output conditioned on image
  • • RoPE positional encoding

How It Works:

Image → SigLIP encodes to visual tokens → visual tokens + text prompt concatenated → Gemma 2B decoder generates response autoregressively. The key insight: visual tokens are treated as a “soft prefix” that the language model conditions on, similar to how prefix-tuning works.

Model Variants & Resolutions

VariantHuggingFace IDResolutionImage TokensUse Case
PT-224 (pretrained)google/paligemma-3b-pt-224224x224256Fine-tuning base (fastest)
PT-448google/paligemma-3b-pt-448448x4481024Fine-tuning (balanced)
PT-896google/paligemma-3b-pt-896896x8964096Fine-detail tasks (OCR, small text)
Mix-448 (multi-task)google/paligemma-3b-mix-448448x4481024Ready-to-use (multi-task fine-tuned)

PT = pretrained (needs fine-tuning for best results). Mix = fine-tuned on a mixture of vision-language tasks (can be used directly). Higher resolution = more detail but slower inference and higher VRAM.

Real Benchmarks

PaliGemma is a transfer model — benchmark scores depend heavily on resolution and fine-tuning. The “mix” variant scores reflect multi-task fine-tuning, not zero-shot ability.

BenchmarkPaliGemma 3B (mix-448)Notes
MMVP (Multimodal Visual Perception)46.0%Mix-448 variant
TextVQA~73-78%Varies by resolution (higher at 896px)
AI2D (Diagram Understanding)~72%After fine-tuning on AI2D
COCO Captions (CIDEr)~140+After fine-tuning on COCO
RefCOCO (Object Localization)~90%After fine-tuning; outputs bounding boxes

Source: arXiv:2407.07726 and HuggingFace model card. Scores are approximate and vary by training configuration. PaliGemma shines when fine-tuned for a specific task — not as a general-purpose VLM.

Hardware Requirements

ConfigurationVRAM (Inference)VRAM (Fine-Tuning with LoRA)Notes
FP16 (224px)~6 GB~10 GBSmallest variant, fastest
FP16 (448px)~8 GB~14 GBBest balance
FP16 (896px)~16 GB~24 GB+High-detail, needs beefy GPU
4-bit quantized (448px)~3 GB~6 GBbitsandbytes / GPTQ

Higher resolution = more image tokens = more VRAM. The 896px variant produces 4096 image tokens, requiring significantly more memory. For most fine-tuning tasks, 448px with LoRA is the sweet spot.

System Requirements

Operating System
Ubuntu 20.04+, macOS Monterey+, Windows 11
RAM
8GB minimum (16GB recommended for fine-tuning)
Storage
10GB SSD
GPU
6GB+ VRAM recommended (runs on CPU but slow)
CPU
4+ cores

Setup with HuggingFace Transformers

Not Available on Ollama

PaliGemma is a vision-language model that requires image input. It is not available on Ollama and cannot be used as a text-only chat model. Use HuggingFace Transformers or Google's big_vision library instead.

Basic Inference (Mix variant)

The “mix” variant is pre-fine-tuned on multiple tasks and can be used directly.

pip install transformers torch pillow
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import torch

model_id = "google/paligemma-3b-mix-448"

processor = AutoProcessor.from_pretrained(model_id)
model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load an image
image = Image.open("photo.jpg")

# Ask a question about the image
prompt = "What objects are in this image?"
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=256)
result = processor.decode(output[0], skip_special_tokens=True)
print(result)

Note: You may need to accept the Gemma license on HuggingFace and use huggingface-cli login before downloading.

Task-Specific Prompt Formats

PaliGemma uses specific prompt prefixes for different tasks:

TaskPrompt FormatOutput
Image Captioning"caption en"Natural language description
Visual Q&A"What color is the car?"Short answer
Object Detection"detect cat"Bounding box coordinates
OCR"ocr"Extracted text from image
Segmentation"segment cat"Segmentation mask tokens

Fine-Tuning for Custom Tasks

PaliGemma's primary value is as a fine-tuning base. The “pt” (pretrained) variants are designed to be fine-tuned on your specific dataset for tasks like medical image classification, document understanding, satellite imagery analysis, etc.

LoRA Fine-Tuning Example:

from transformers import PaliGemmaForConditionalGeneration
from peft import LoraConfig, get_peft_model

model = PaliGemmaForConditionalGeneration.from_pretrained(
    "google/paligemma-3b-pt-448",
    torch_dtype=torch.float16
)

# Apply LoRA to language model layers
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
# Train on your image-text dataset...

Good Fine-Tuning Tasks

  • • Medical/scientific image classification
  • • Document OCR and form extraction
  • • Product image captioning for e-commerce
  • • Satellite/drone imagery analysis
  • • Custom object detection domains

Not Suited For

  • • General visual conversation (use LLaVA, Qwen-VL)
  • • Text-only tasks (use Gemma, Llama, Qwen)
  • • Video understanding
  • • Interactive chat applications
  • • Zero-shot complex reasoning about images

PaliGemma 2 (December 2024)

Google released PaliGemma 2 in December 2024, upgrading the language decoder from Gemma 2B to Gemma 2 (available in 3B, 10B, and 28B sizes). This significantly improves reasoning capabilities while keeping the same SigLIP vision encoder.

ModelVision EncoderLanguage DecoderTotal ParamsVRAM (FP16, 448px)
PaliGemma (v1)SigLIP 400MGemma 2B~3B~8 GB
PaliGemma 2 (3B)SigLIP 400MGemma 2 2B~3B~8 GB
PaliGemma 2 (10B)SigLIP 400MGemma 2 9B~10B~22 GB
PaliGemma 2 (28B)SigLIP 400MGemma 2 27B~28B~60 GB

Recommendation: For new projects in 2026, start with PaliGemma 2 (10B) at 448px. It offers significantly better reasoning than PaliGemma v1 while remaining runnable on a single RTX 3090/4090. See PaliGemma 2 on HuggingFace.

Strengths & Limitations

Strengths

  • Excellent fine-tuning base: Clean architecture designed for transfer learning across vision-language tasks
  • Multiple resolutions: 224/448/896px lets you trade speed for detail depending on task
  • Small enough for local: 3B params fits on consumer GPUs (6-8GB VRAM)
  • Versatile output: Can generate text, bounding boxes, and segmentation masks
  • Strong SigLIP encoder: The vision encoder is battle-tested and produces high-quality image representations

Limitations

  • Not a chatbot: Cannot have visual conversations like LLaVA or Qwen-VL — it's a transfer model
  • Requires fine-tuning: PT variants give mediocre results out-of-the-box; mix variant is usable but limited
  • No Ollama support: Requires Python/HuggingFace setup — not as simple as one-line install
  • Gemma 2B decoder is small: Limited reasoning compared to 7B+ language models (PaliGemma 2 fixes this)
  • Gemma license restrictions: Not Apache 2.0 — review Gemma Terms of Use before commercial deployment
ModelSizeRAM RequiredSpeedQualityCost/Month
PaliGemma 3B (v1)3B6-8GB~20 tok/s
68%
Free
PaliGemma 2 10B10B~22GB~12 tok/s
78%
Free
Qwen2-VL 7B7B~16GB~15 tok/s
80%
Free
LLaVA 1.6 7B7B~14GB~18 tok/s
72%
Free

Quality scores are editorial estimates for vision-language tasks. Qwen2-VL and LLaVA are general visual chat models. PaliGemma excels when fine-tuned for specific tasks but is weaker as a general VLM.

Frequently Asked Questions

Can I use PaliGemma as a visual chatbot?

Not really. PaliGemma is designed as a transfer model for specific vision tasks, not interactive visual conversation. For visual chat, use Qwen2-VL, LLaVA, or InternVL instead. PaliGemma excels when fine-tuned for a specific task like medical image classification or document OCR.

Should I use PaliGemma v1 or PaliGemma 2?

For new projects, use PaliGemma 2. The 10B variant (Gemma 2 9B decoder) offers much better reasoning while the SigLIP vision encoder remains the same. PaliGemma v1 (this page) is only worth using if you need the absolute smallest model (3B with Gemma 2B decoder) and PaliGemma 2 3B doesn't meet your needs.

Is PaliGemma available on Ollama?

No. PaliGemma requires image+text multimodal input and is not in Ollama's model library. Use HuggingFace Transformers (PaliGemmaForConditionalGeneration) or Google's big_vision library. Installation requires Python and pip.

What resolution should I use?

448px is the best default. 224px is faster but loses fine detail. 896px captures small text and fine details but uses 4x more VRAM and is much slower (4096 image tokens vs 1024). Use 896px only for OCR-heavy tasks or when you need to read small text in images.

Resources

Ready to Go Beyond Tutorials?

10 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.

Reading now
Join the discussion

Related Guides

Continue your local AI journey with these comprehensive guides

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📅 Published: October 28, 2025🔄 Last Updated: March 16, 2026✓ Manually Reviewed
🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators