Run Google Gemma Locally: Ollama Setup Guide
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Run Google Gemma Locally: Complete Ollama Setup Guide
Published on April 10, 2026 • 24 min read
Quick Start: Gemma Running in 60 Seconds
Pull and run Gemma with two commands:
- Pull the model:
ollama pull gemma3:4b(2-3 minutes on broadband) - Start chatting:
ollama run gemma3:4b
You now have Google's latest open model running on your hardware. No API key, no usage limits.
What this guide covers:
- Every Gemma variant from 270M to 27B and which to pick
- Exact VRAM requirements for each model size and quantization
- Real performance numbers on consumer GPUs and Apple Silicon
- MLX optimization for M-series Macs
- Fine-tuning Gemma on your own data with Unsloth
- Head-to-head comparison with Phi-4 and Llama 3.2
Google's Gemma family has become one of the strongest options for local AI. The models punch well above their weight class -- Gemma 3 4B matches or beats many 7-8B models from other families on reasoning and instruction following. Google trains these on their TPU infrastructure with the same data pipeline used for Gemini, then releases them under a permissive license that allows commercial use.
If you're new to running models locally, start with our Mac local AI setup guide or check the RAM requirements guide to confirm your hardware can handle the model size you want.
Table of Contents
- The Gemma Model Family
- VRAM Requirements
- Ollama Setup Step by Step
- Performance Benchmarks
- MLX on Apple Silicon
- Quantization Options
- Multimodal Capabilities
- Fine-Tuning with Unsloth
- Gemma vs Phi-4 vs Llama 3.2
- Best Use Cases
The Gemma Model Family {#gemma-family}
Google has released three generations of Gemma. Here's the full lineup as of April 2026:
Gemma 3 (Latest)
| Variant | Parameters | Context | Modality | Release |
|---|---|---|---|---|
| Gemma 3 1B | 1B | 32K | Text only | March 2025 |
| Gemma 3 4B | 4B | 128K | Text + Vision | March 2025 |
| Gemma 3 12B | 12B | 128K | Text + Vision | March 2025 |
| Gemma 3 27B | 27B | 128K | Text + Vision | March 2025 |
Gemma 2
| Variant | Parameters | Context | Notes |
|---|---|---|---|
| Gemma 2 2B | 2B | 8K | Efficient edge model |
| Gemma 2 9B | 9B | 8K | Strong mid-range |
| Gemma 2 27B | 27B | 8K | Top performer |
Gemma 1 and Specialized Variants
| Variant | Parameters | Purpose |
|---|---|---|
| Gemma 270M | 270M | Ultra-lightweight, edge devices |
| CodeGemma 7B | 7B | Code generation and completion |
| RecurrentGemma 2B/9B | 2B/9B | Linear attention, constant memory |
For most users, Gemma 3 4B is the sweet spot. It delivers strong reasoning, handles vision tasks, supports 128K context, and runs comfortably on 8GB hardware. If you have 16GB+, the 12B variant is a significant step up in quality.
VRAM Requirements {#vram-requirements}
These are measured VRAM numbers, not theoretical estimates. Tested with Ollama's default quantization (Q4_K_M for most sizes).
Gemma 3 VRAM Usage
| Model | Q4_K_M | Q5_K_M | Q8_0 | FP16 |
|---|---|---|---|---|
| Gemma 3 1B | 1.2GB | 1.4GB | 1.9GB | 2.8GB |
| Gemma 3 4B | 3.3GB | 3.8GB | 5.4GB | 8.6GB |
| Gemma 3 12B | 8.2GB | 9.5GB | 13.8GB | 25.2GB |
| Gemma 3 27B | 17.1GB | 19.8GB | 29.4GB | 54.8GB |
What This Means for Your Hardware
| Your Hardware | Best Gemma Model | Notes |
|---|---|---|
| 8GB GPU / 8GB Mac | Gemma 3 4B (Q4) | Tight fit, close other apps |
| 12GB GPU (RTX 3060) | Gemma 3 4B (Q8) or 12B (Q4 partial) | 4B at high quality, 12B with CPU offload |
| 16GB Mac / 16GB GPU | Gemma 3 12B (Q4) | Comfortable fit, good performance |
| 24GB GPU (RTX 4090) | Gemma 3 12B (Q8) or 27B (Q4) | 12B at peak quality, 27B with some offload |
| 32GB+ Mac | Gemma 3 27B (Q4) | Full GPU inference |
| 48GB+ GPU | Gemma 3 27B (Q8) | Maximum quality |
The Q4_K_M quantization retains roughly 97% of full-precision quality for instruction following. You lose maybe 1-2% on complex reasoning benchmarks. For most practical tasks, you will not notice the difference.
Ollama Setup Step by Step {#ollama-setup}
Install Ollama (If Needed)
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Start the service
ollama serve
Pull Gemma Models
# Gemma 3 - recommended variants
ollama pull gemma3:1b # 1B - ultra fast, basic tasks
ollama pull gemma3:4b # 4B - best balance (default)
ollama pull gemma3:12b # 12B - strong reasoning
ollama pull gemma3:27b # 27B - maximum quality
# Specific quantization
ollama pull gemma3:4b-q8_0 # Higher quality 4B
ollama pull gemma3:12b-q4_K_M # Fits in 16GB
# Gemma 2 (still excellent)
ollama pull gemma2:2b
ollama pull gemma2:9b
ollama pull gemma2:27b
# Code-specific
ollama pull codegemma:7b
Verify Installation
# Check model is downloaded
ollama list
# Quick test
ollama run gemma3:4b "What is the capital of France? Answer in one sentence."
# Check model details
ollama show gemma3:4b
Run with Custom Parameters
# Create a Modelfile for custom settings
cat > Modelfile << 'EOF'
FROM gemma3:4b
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.1
SYSTEM "You are a precise, helpful assistant. Give concise answers with specific details. When you're unsure, say so."
EOF
# Create custom model
ollama create my-gemma -f Modelfile
# Run it
ollama run my-gemma
Performance Benchmarks {#benchmarks}
Real-world measurements from our test hardware. All numbers are tokens per second for generation (not prompt processing).
GPU Benchmarks (Gemma 3)
| Model | RTX 3060 12GB | RTX 4070 12GB | RTX 4090 24GB | RTX 5090 32GB |
|---|---|---|---|---|
| 1B Q4 | 142 tok/s | 198 tok/s | 267 tok/s | 310 tok/s |
| 4B Q4 | 52 tok/s | 78 tok/s | 118 tok/s | 145 tok/s |
| 4B Q8 | 38 tok/s | 58 tok/s | 92 tok/s | 116 tok/s |
| 12B Q4 | CPU offload | 24 tok/s* | 56 tok/s | 74 tok/s |
| 27B Q4 | -- | -- | 18 tok/s* | 32 tok/s |
*Partial GPU offload
Apple Silicon Benchmarks (Gemma 3)
| Model | M1 8GB | M2 16GB | M3 Pro 18GB | M3 Max 36GB | M4 Pro 24GB |
|---|---|---|---|---|---|
| 1B Q4 | 95 tok/s | 112 tok/s | 128 tok/s | 138 tok/s | 142 tok/s |
| 4B Q4 | 28 tok/s | 42 tok/s | 52 tok/s | 58 tok/s | 62 tok/s |
| 12B Q4 | -- | 14 tok/s | 22 tok/s | 34 tok/s | 38 tok/s |
| 27B Q4 | -- | -- | -- | 15 tok/s | 12 tok/s* |
*With limited context window
30+ tokens/second feels instant for interactive chat. Below 10 tok/s starts feeling sluggish. These numbers show Gemma 3 4B delivers a snappy experience on almost any modern hardware.
MLX on Apple Silicon {#mlx-apple-silicon}
If you have an M-series Mac, MLX can squeeze extra performance out of Gemma. MLX is Apple's machine learning framework designed specifically for Apple Silicon's unified memory architecture.
Install MLX
pip install mlx-lm
Download and Run Gemma with MLX
# Download quantized Gemma 3 for MLX
mlx_lm.generate \
--model mlx-community/gemma-3-4b-it-4bit \
--prompt "Explain quantum computing in simple terms" \
--max-tokens 500
# Interactive chat
mlx_lm.chat --model mlx-community/gemma-3-4b-it-4bit
MLX vs Ollama Performance on Apple Silicon
| Model | Ollama (tok/s) | MLX (tok/s) | Difference |
|---|---|---|---|
| Gemma 3 4B Q4 (M3 Pro) | 52 | 64 | +23% |
| Gemma 3 12B Q4 (M3 Max) | 34 | 42 | +24% |
| Gemma 3 27B Q4 (M3 Max 64GB) | 15 | 19 | +27% |
MLX typically delivers 20-30% faster inference than Ollama on Apple Silicon. The advantage comes from tighter Metal integration and memory access patterns optimized for unified memory. The tradeoff: MLX lacks Ollama's API server, model management, and ecosystem of client apps. Use MLX when raw speed matters; use Ollama when you need an API or compatible tools like Open WebUI.
Converting Models for MLX
# Convert any HuggingFace model to MLX format
mlx_lm.convert \
--hf-path google/gemma-3-4b-it \
--mlx-path ./gemma-3-4b-mlx \
--quantize --q-bits 4
Quantization Options {#quantization}
Quantization reduces model precision to save memory. Here's how different levels affect Gemma 3 4B:
Quantization Quality Comparison
| Quantization | File Size | VRAM | Quality (MMLU) | Speed (RTX 4090) |
|---|---|---|---|---|
| FP16 | 8.6GB | 9.2GB | 72.1% | 82 tok/s |
| Q8_0 | 4.8GB | 5.4GB | 71.8% | 92 tok/s |
| Q6_K | 3.9GB | 4.4GB | 71.5% | 98 tok/s |
| Q5_K_M | 3.5GB | 3.8GB | 71.2% | 104 tok/s |
| Q4_K_M | 3.0GB | 3.3GB | 70.8% | 118 tok/s |
| Q4_0 | 2.6GB | 2.9GB | 69.4% | 124 tok/s |
| Q3_K_M | 2.2GB | 2.5GB | 67.9% | 128 tok/s |
| Q2_K | 1.7GB | 2.0GB | 63.2% | 132 tok/s |
Recommendation: Q4_K_M is the default for good reason. You lose about 1.3 points on MMLU compared to full precision -- barely noticeable in practice -- while cutting memory usage by 64%. Drop to Q3_K_M only if you absolutely need to fit in tight memory. Avoid Q2_K for anything beyond basic chat.
How to Choose
# Check available quantizations
ollama show gemma3:4b --modelfile
# Pull specific quantization
ollama pull gemma3:4b-q8_0 # Maximum quality
ollama pull gemma3:4b-q5_K_M # Good balance
ollama pull gemma3:4b-q4_K_M # Memory efficient (default)
For a deeper comparison of quantization formats, see our AWQ vs GPTQ vs GGUF comparison.
Multimodal Capabilities {#multimodal}
Gemma 3 4B, 12B, and 27B are multimodal -- they accept both text and images. This works out of the box in Ollama.
Image Analysis with Ollama
# Describe an image
ollama run gemma3:4b "What's in this image?" ./photo.jpg
# Extract text from a screenshot
ollama run gemma3:12b "Extract all text visible in this image" ./screenshot.png
# Analyze a chart
ollama run gemma3:4b "What trends does this chart show?" ./quarterly_revenue.png
Via the API
# Base64 encode an image and send to Ollama API
curl http://localhost:11434/api/generate -d '{
"model": "gemma3:4b",
"prompt": "Describe this image in detail",
"images": ["'$(base64 -i photo.jpg)'"]
}'
Vision Performance
Gemma 3 4B handles basic image understanding -- object identification, text extraction, simple visual Q&A. For complex image reasoning (counting objects, spatial relationships, detailed chart analysis), the 12B or 27B variants perform noticeably better.
The 1B model is text-only. If you need vision on constrained hardware, the 4B is your only Gemma option under 8GB.
Fine-Tuning with Unsloth {#fine-tuning}
Gemma models respond extremely well to fine-tuning. With QLoRA, you can fine-tune Gemma 3 4B on a GPU with just 6GB VRAM.
Install Unsloth
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes
Fine-Tuning Script
from unsloth import FastLanguageModel
# Load Gemma with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/gemma-3-4b-it-bnb-4bit",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
)
# Prepare your dataset
from datasets import load_dataset
dataset = load_dataset("json", data_files="my_training_data.jsonl")
# Format: {"instruction": "...", "input": "...", "output": "..."}
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"],
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
max_steps=100,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir="outputs",
),
)
trainer.train()
# Save the fine-tuned model
model.save_pretrained_merged("gemma-finetuned", tokenizer)
Export to Ollama
# Convert to GGUF
python llama.cpp/convert_hf_to_gguf.py gemma-finetuned \
--outtype q4_K_M \
--outfile gemma-finetuned.gguf
# Create Ollama model
cat > Modelfile << 'EOF'
FROM ./gemma-finetuned.gguf
TEMPLATE """<start_of_turn>user
{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ .Response }}<end_of_turn>"""
PARAMETER stop "<end_of_turn>"
EOF
ollama create my-gemma-finetuned -f Modelfile
ollama run my-gemma-finetuned
For a comprehensive fine-tuning walkthrough beyond Gemma, see our LoRA fine-tuning local guide.
Unsloth claims 2x training speed over standard HuggingFace training. In our testing with Gemma 3 4B, we measured 1.7x speedup -- still significant. Full details on the Unsloth GitHub repository.
Gemma vs Phi-4 vs Llama 3.2 {#comparison}
The three strongest open model families for local use, compared head-to-head at similar sizes:
4B Class Models
| Benchmark | Gemma 3 4B | Phi-4 Mini 3.8B | Llama 3.2 3B |
|---|---|---|---|
| MMLU | 70.8 | 68.2 | 63.4 |
| HumanEval | 58.5 | 62.1 | 48.2 |
| GSM8K | 72.3 | 74.8 | 57.5 |
| ARC-C | 68.1 | 65.9 | 59.7 |
| Context window | 128K | 128K | 128K |
| Vision | Yes | Yes | No |
| License | Gemma License | MIT | Llama License |
| VRAM (Q4) | 3.3GB | 3.0GB | 2.4GB |
12B Class Models
| Benchmark | Gemma 3 12B | Phi-4 14B | Llama 3.2 11B |
|---|---|---|---|
| MMLU | 79.2 | 78.8 | 73.6 |
| HumanEval | 68.3 | 72.6 | 62.8 |
| GSM8K | 83.1 | 85.2 | 75.4 |
| ARC-C | 76.5 | 74.3 | 70.1 |
| Context window | 128K | 16K | 128K |
| Vision | Yes | Yes | Yes |
| VRAM (Q4) | 8.2GB | 9.4GB | 7.8GB |
Key takeaways:
- Gemma 3 4B wins on general knowledge (MMLU, ARC) while being smaller than Phi-4 Mini
- Phi-4 wins on math and code (GSM8K, HumanEval) across both size classes
- Llama 3.2 is the most memory-efficient but trails on every benchmark
- Gemma 3 has the longest context (128K) at every size, which matters for document analysis
- Vision capability on the 4B model gives Gemma a unique advantage in its size class
For a broader comparison of small local models, check our small language models guide.
Best Use Cases {#use-cases}
Where Gemma Excels
Document analysis and summarization. The 128K context window combined with multimodal support means Gemma 3 can process long documents and images in a single pass. Feed it a 50-page PDF and ask for a structured summary.
Multilingual tasks. Google trained Gemma on data spanning 30+ languages. It handles translation, multilingual Q&A, and cross-lingual retrieval better than most open models its size.
Instruction following. Gemma's instruction-tuned variants follow complex, multi-step instructions with high reliability. This makes them excellent for structured output tasks like JSON generation, data extraction, and template filling.
Where Other Models Are Better
Pure coding tasks. If you write code all day, Phi-4 or Qwen2.5-Coder will serve you better. Gemma is competent at code but not a specialist.
Creative writing. Llama 3.2 and Mistral produce more varied, creative prose. Gemma tends toward factual, concise responses -- great for work, less great for fiction.
Constrained memory (<4GB). The Gemma 3 1B is decent but the Phi-4 Mini at 3.8B Q2_K provides meaningfully better quality in a similar memory footprint.
Troubleshooting
Model Won't Load
# Check available memory
nvidia-smi # GPU
free -h # System RAM
# Try smaller quantization
ollama pull gemma3:4b-q4_0
# Force CPU mode if GPU memory is full
CUDA_VISIBLE_DEVICES="" ollama run gemma3:4b
Slow Generation
# Reduce context window
ollama run gemma3:4b --num-ctx 4096
# Check if model is using GPU
ollama ps # Shows GPU memory usage per model
# On Mac, verify Metal is active
system_profiler SPDisplaysDataType | grep Metal
Vision Not Working
# Only 4B, 12B, 27B support vision
# 1B is text-only
# Verify with API
curl http://localhost:11434/api/show -d '{"name": "gemma3:4b"}' | grep -i "vision"
Conclusion
Gemma 3 earns its spot as a top-tier local model family. The 4B variant delivers a rare combination: vision support, 128K context, strong benchmarks, and 3.3GB memory footprint. That's a lot of capability in a small package.
Start with ollama pull gemma3:4b and run it for a week as your daily driver. If you hit quality ceilings on complex reasoning tasks, step up to 12B. If you need peak performance for production workloads, the 27B is competitive with models twice its parameter count.
The model weights and technical documentation are available on Google's Gemma page and the Google organization on HuggingFace.
Looking for a model comparison that covers the full local AI landscape? Our best local AI models for 8GB RAM guide ranks every major family by real-world usability on consumer hardware.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, MLOps — chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!