OLMo 2 Local Setup Guide (2026): AI2's Fully Open 7B / 13B / 32B on Consumer GPUs
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
OLMo 2 is the Allen Institute for AI's November 2024 release — the only frontier-quality language model where everything is open. Weights (Apache 2.0), training data (Dolma 2, all 7T tokens), training code (OLMo-core), every intermediate checkpoint, full evaluation harness (OLMES), and post-training recipes (Tulu 3) are publicly inspectable. For research labs, regulated industries, and anyone who needs supply-chain transparency in AI, OLMo 2 is the only practical choice in 2026.
This guide covers the full OLMo 2 family (7B / 13B / 32B), setup across Ollama / vLLM / llama.cpp, the fully-open data and training pipeline, fine-tuning with QLoRA, benchmarks vs Llama 3.1 and Qwen 2.5, and when OLMo's transparency advantages outweigh the slight performance gap.
Table of Contents
- What OLMo 2 Is
- The OLMo 2 Family: 7B / 13B / 32B
- Why "Fully Open" Matters
- Hardware Requirements & Quantization
- OLMo 2 vs Llama 3.1 vs Qwen 2.5
- Ollama Setup
- llama.cpp Setup with GGUF
- vLLM Setup
- Other Runtimes: LM Studio / oobabooga
- Dolma 2 Dataset
- Tulu 3 Post-Training Recipe
- Fine-Tuning OLMo 2
- System Prompts & Sampling
- When to Pick OLMo 2 (Decision Tree)
- Real Benchmarks
- Licensing
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What OLMo 2 Is {#what-it-is}
OLMo 2 (allenai/OLMo-2-* on HuggingFace) is the Allen Institute for AI's second-generation Open Language Model family. Architecture: standard decoder-only transformer (modified Llama-style with RMSNorm, SwiGLU, RoPE) with 7B, 13B, and 32B parameter variants. Native context: 4096 tokens (8K via RoPE scaling). Chat template: ChatML-compatible with OLMo-specific system tokens.
Released components alongside the weights:
- Dolma 2: full 7T-token training corpus
- OLMo-core: PyTorch training code with FSDP support
- OLMES: evaluation harness with hundreds of benchmark tasks
- Tulu 3: post-training pipeline (SFT + DPO + RLVR data and recipes)
- All intermediate checkpoints: every 1B tokens during pre-training
License: Apache 2.0 for weights; ODC-BY for Dolma 2 data.
The OLMo 2 Family: 7B / 13B / 32B {#family}
| Variant | Parameters | Context | VRAM (BF16 / Q4) | Use |
|---|---|---|---|---|
| OLMo 2 7B Base | 7B | 4K | 14 GB / 4.5 GB | Continued pretraining, research |
| OLMo 2 7B Instruct | 7B | 4K | 14 GB / 4.5 GB | Chat, general |
| OLMo 2 13B Base | 13B | 4K | 26 GB / 8 GB | Continued pretraining |
| OLMo 2 13B Instruct | 13B | 4K | 26 GB / 8 GB | Chat, general |
| OLMo 2 32B Base | 32B | 4K | 65 GB / 20 GB | Heavy reasoning |
| OLMo 2 32B Instruct | 32B | 4K | 65 GB / 20 GB | Chat, complex tasks |
| OLMoE-1B-7B | 7B (1B active MoE) | 4K | 14 GB / 4.5 GB | Efficient MoE option |
For most users: 13B Instruct on a 16 GB card is the sweet spot. For research: pull base + intermediate checkpoints from HuggingFace.
Why "Fully Open" Matters {#fully-open}
Most "open" LLMs publish weights and a model card. They don't publish:
- The training data (Llama, Qwen, Mistral, Phi keep this proprietary)
- The training code (Llama publishes inference code; training is internal)
- Intermediate checkpoints (Llama only releases final weights)
- The evaluation harness (most labs have internal benchmarks)
- The post-training recipe (DPO/RLHF data, hyperparameters often unpublished)
OLMo 2 publishes all five. Concrete consequences:
- Audit for copyright/PII: you can grep Dolma 2 for specific text strings before deployment in regulated industries.
- Reproduce any checkpoint: train from scratch with the same data — verify no supply-chain backdoor.
- Replicate scaling laws research: every intermediate checkpoint allows training-dynamics analysis impossible with closed models.
- Modify the post-training recipe: rerun Tulu 3 with your safety policies, helpfulness criteria, or domain focus.
For 95% of consumer use (chat, code assistance, casual RAG), this transparency is a nice-to-have. For defense, healthcare, EU public sector, academic research, or anything where data provenance is legally required, it's the only viable option.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Hardware Requirements & Quantization {#requirements}
| GPU VRAM | Best OLMo 2 Variant | Throughput on RTX 4090 |
|---|---|---|
| 6 GB | 7B Q4_K_M | ~85 tok/s |
| 8-12 GB | 7B Q5/Q8 or 13B Q4 | ~75 tok/s |
| 16 GB | 13B Q5_K_M | ~60 tok/s |
| 24 GB | 13B BF16 or 32B Q4 | ~40 tok/s (32B Q4) |
| 48 GB+ | 32B BF16 / FP16 | ~25 tok/s |
CPU-only inference: 7B Q4_K_M at ~4-8 tok/s on Ryzen 7 7800X3D (DDR5-6000). Apple M2/M3/M4: 7B and 13B comfortable; 32B needs M3 Max / M4 Max with 64+ GB unified memory.
OLMo 2 vs Llama 3.1 vs Qwen 2.5 {#comparison}
| Benchmark | OLMo 2 13B | Llama 3.1 8B | Qwen 2.5 14B | OLMo 2 32B | Llama 3.1 70B |
|---|---|---|---|---|---|
| MMLU | 67.5 | 73.0 | 79.7 | 78.0 | 86.0 |
| MMLU-Pro | 47.0 | 48.3 | 63.7 | 58.5 | 60.4 |
| GSM8K | 79.5 | 84.5 | 90.2 | 88.7 | 95.1 |
| MATH | 23.4 | 51.9 | 80.0 | 49.0 | 68.0 |
| HumanEval | 74.0 | 72.6 | 83.5 | 80.0 | 80.5 |
| IFEval | 72.0 | 80.4 | 81.0 | 80.0 | 87.5 |
| Context length | 4K | 131K | 131K | 4K | 131K |
| License clarity | Apache 2.0 | Llama Community | Tongyi Qianwen | Apache 2.0 | Llama Community |
| Data transparency | Full Dolma 2 | None | None | Full Dolma 2 | None |
OLMo 2 13B trades blows with Llama 3.1 8B; OLMo 2 32B is competitive with Llama 3.1 70B on reasoning at half the VRAM. Long context is the main weakness.
Ollama Setup {#ollama}
ollama pull olmo2:7b
ollama pull olmo2:13b
ollama run olmo2:13b "Explain RoPE scaling in 3 sentences."
Custom Modelfile:
FROM olmo2:13b
PARAMETER num_ctx 4096
PARAMETER temperature 0.6
PARAMETER min_p 0.05
PARAMETER repeat_penalty 1.05
SYSTEM """You are a precise research assistant. Cite sources when relevant."""
ollama create my-olmo -f Modelfile
ollama run my-olmo
llama.cpp Setup with GGUF {#llamacpp}
huggingface-cli download bartowski/OLMo-2-1124-13B-Instruct-GGUF \
OLMo-2-1124-13B-Instruct-Q5_K_M.gguf \
--local-dir ./models
./llama-cli \
-m models/OLMo-2-1124-13B-Instruct-Q5_K_M.gguf \
-ngl 999 -c 4096 -fa \
--temp 0.6 --min-p 0.05 \
-p "Summarize the OLMo 2 release in 3 bullet points."
For server mode:
./llama-server -m OLMo-2-1124-13B-Instruct-Q5_K_M.gguf -ngl 999 -c 4096 --port 8080
vLLM Setup {#vllm}
# BF16 13B (needs 30+ GB VRAM)
vllm serve allenai/OLMo-2-1124-13B-Instruct \
--max-model-len 4096 \
--gpu-memory-utilization 0.92
# AWQ-INT4 (12+ GB VRAM)
vllm serve neuralmagic/OLMo-2-1124-13B-Instruct-AWQ \
--quantization awq \
--max-model-len 4096
For multi-GPU 32B serving with tensor parallelism: --tensor-parallel-size 2 on 2x RTX 3090. See vLLM Complete Setup Guide.
Other Runtimes: LM Studio / oobabooga {#other-runtimes}
- LM Studio: search "OLMo 2" in model browser, choose Q5_K_M, click Load.
- oobabooga: download GGUF, place in models/, llama.cpp loader.
- KoboldCpp:
./koboldcpp --model OLMo-2-13B-Instruct-Q5_K_M.gguf --usecublas. - MLX (Apple):
mlx_lm.generate --model mlx-community/OLMo-2-1124-13B-Instruct-4bit --prompt "..."
See text-generation-webui guide and KoboldCpp guide.
Dolma 2 Dataset {#dolma}
The 7-trillion-token training corpus, downloadable from allenai/dolma:
| Source | Tokens | % of Corpus |
|---|---|---|
| Filtered Common Crawl | 4.8T | 68% |
| StackExchange | 200B | 3% |
| GitHub code | 400B | 6% |
| S2ORC papers | 600B | 8% |
| Wikipedia | 100B | 1.5% |
| Books (PD) | 100B | 1.5% |
| Other | 800B | 12% |
Quality filtering: GOPHER + C4 + custom toxicity classifier. Personally identifiable information (PII) filter applied at the URL and document level.
To audit:
huggingface-cli download allenai/dolma --repo-type dataset --local-dir ./dolma
# Each shard is ~10 GB compressed JSONL
zcat dolma/shard-0001.jsonl.gz | grep "your-search-string"
Tulu 3 Post-Training Recipe {#tulu-3}
OLMo 2 Instruct is created from OLMo 2 Base via the Tulu 3 pipeline:
- SFT (supervised fine-tuning) on 939K curated conversation examples
- DPO (direct preference optimization) on 271K preference pairs
- RLVR (RL with verifiable rewards) on math/code with verifier feedback
All three datasets are public on HuggingFace (allenai/tulu-3-sft-mixture, etc.). Hyperparameters and code are in the OLMo-core repo. To replicate Tulu 3 on a different base model:
git clone https://github.com/allenai/open-instruct
cd open-instruct
# Follow tulu-3 README — supports any HF base model
See DPO / ORPO / KTO Guide for the preference-tuning fundamentals.
Fine-Tuning OLMo 2 {#fine-tuning}
QLoRA with Unsloth:
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="allenai/OLMo-2-1124-13B-Instruct",
max_seq_length=4096,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model, r=32, lora_alpha=64, lora_dropout=0,
target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
)
dataset = load_dataset("json", data_files="train.jsonl", split="train")
trainer = SFTTrainer(
model=model, tokenizer=tokenizer, train_dataset=dataset,
max_seq_length=4096,
args=TrainingArguments(
per_device_train_batch_size=2, gradient_accumulation_steps=4,
warmup_ratio=0.03, num_train_epochs=3, learning_rate=2e-4,
bf16=True, logging_steps=10, output_dir="./olmo2_lora",
),
)
trainer.train()
OLMo 2 13B QLoRA on 1K examples: ~1.5 hours on RTX 4090. See QLoRA Fine-Tuning Guide.
For full fine-tuning with FSDP on a multi-GPU node, use the OLMo-core scripts directly — it's the same pipeline AI2 uses internally.
System Prompts & Sampling {#prompting}
ChatML-style template (OLMo-specific tokens):
<|system|>
You are a research assistant.
<|user|>
[user message]
<|assistant|>
Recommended sampling:
- General chat: temperature 0.6, min-p 0.05
- Code/reasoning: temperature 0.2, min-p 0.05
- Creative writing: temperature 0.8, min-p 0.05
OLMo 2 Instruct is trained with strong instruction-following data (Tulu 3) — clear, direct system prompts work best. See LLM Sampling Parameters for full sampler reference.
When to Pick OLMo 2 (Decision Tree) {#decision}
Pick OLMo 2 if:
- You need Apache 2.0 license clarity (no MAU thresholds, naming, or use restrictions)
- You need data transparency (regulated industries, defense, EU public sector)
- You're doing academic research that requires reproducibility
- You want to study training dynamics with intermediate checkpoints
- You're building on a fully-open foundation for downstream open-source work
Pick Llama 3.1 / Qwen 2.5 if:
- You need 131K context (RAG over books, long docs, codebases)
- You need top-tier multilingual performance (>10 languages)
- You're optimizing pure performance per VRAM without trust constraints
Pick Phi-4 if:
- You need strong math/reasoning at small VRAM
- You need MIT license (most permissive of all)
Real Benchmarks {#benchmarks}
Single-user, RTX 4090, Q5_K_M:
| Test | OLMo 2 13B | Llama 3.1 8B | Qwen 2.5 14B |
|---|---|---|---|
| MMLU | 67.5% | 73.0% | 79.7% |
| GSM8K | 79.5% | 84.5% | 90.2% |
| HumanEval | 74.0% | 72.6% | 83.5% |
| IFEval | 72.0% | 80.4% | 81.0% |
| Inference tok/s (Q5) | 60 | 127 | 52 |
| TTFT (1K prompt) | 180 ms | 110 ms | 220 ms |
OLMo 2 13B is competitive on most tasks; Qwen 2.5 14B leads on math and code with full benchmark transparency.
Licensing {#licensing}
OLMo 2 weights: Apache 2.0 — most permissive among research-quality models.
You can:
- Use commercially without restriction
- Modify and redistribute weights and derivatives
- Bundle into proprietary products
- Sell as paid service
- Train derivative models without restriction
- Patent improvements (with patent grant)
Dolma 2 data: ODC-BY 1.0 (Open Data Commons Attribution) — attribute and use freely.
Compare to Llama 3.1 (Meta Llama Community License with 700M MAU threshold + naming requirements), Qwen 2.5 (Tongyi Qianwen License with EU restrictions), Mistral (Apache 2.0 for some, MNPL for others). For strictest license cleanliness with no operational concerns, OLMo 2 is the safest choice in 2026.
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| Wrong chat format | Missing OLMo-specific tokens | Use Ollama Modelfile or vLLM with chat template auto-detect |
| OOM at 4K context | 32B model on 24 GB | Use Q4_K_M or split across 2 GPUs |
| Repetitive output | No min-p set | Set min-p 0.05 |
| Underperforms vs benchmark | Wrong checkpoint | Use -1124- (Nov 2024) variant — earlier OLMo 2 previews were weaker |
| Model unknown in vLLM | vLLM <0.7 | Upgrade vLLM to 0.7.2+ for native OLMo 2 support |
| Long context fails | Native 4K only | Use RoPE scaling --rope-scaling '{"type":"linear","factor":2.0}' for 8K |
FAQ {#faq}
See answers to common OLMo 2 questions below.
Sources: OLMo 2 release blog (AI2) | OLMo 2 on HuggingFace | OLMo-core training repo | Dolma 2 dataset | Tulu 3 paper (arXiv 2411.15124) | Internal benchmarks RTX 4090.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!