GPT-OSS: OpenAI's First Open-Source Model
OpenAI's historic open-source release. Two MoE models — 20B for local hardware, 120B for production. Apache 2.0 license. Runs on Ollama.
Overview
GPT-OSS marks a watershed moment: OpenAI's first open-source model family, released under the permissive Apache 2.0 license. After years of keeping their models closed, OpenAI released two state-of-the-art models that anyone can download, modify, and deploy without restrictions.
The family includes GPT-OSS 120B (117B total parameters, 5.1B active per token) for production-grade inference and GPT-OSS 20B (21B total parameters, 3.6B active per token) for local and edge deployment. Both use Mixture of Experts (MoE) architecture for efficient inference.
GPT-OSS 20B
- 21B total params, 3.6B active/token
- Runs on 16GB RAM (consumer hardware)
- Perfect for local development
- Matches o3-mini on common benchmarks
- 128K context window
GPT-OSS 120B
- 117B total params, 5.1B active/token
- Needs 80GB GPU (A100 or multi-GPU)
- Matches or exceeds o4-mini
- 97.9% on AIME 2025 (competition math)
- 128K context window
MoE Architecture Deep Dive
GPT-OSS uses a Mixture of Experts (MoE) Transformer architecture with several key innovations:
Technical Specifications
The MoE design means the model activates only a fraction of its parameters per token. GPT-OSS 120B activates 5.1B of its 117B parameters — just 4.4% — giving it the inference speed of a ~5B model with the quality of a much larger one. However, all 117B parameters must fit in memory, which is why VRAM requirements remain high.
The alternating dense and sparse attention (similar to GPT-3's approach) balances quality with efficiency. Grouped multi-query attention with a group size of 8 further reduces memory usage during inference, enabling longer context windows on the same hardware.
Benchmarks (GPT-OSS 120B)
Capability Profile
120B vs 20B Comparison
| Benchmark | GPT-OSS 120B | GPT-OSS 20B |
|---|---|---|
| MMLU-Pro | 90.0% | 85.3% |
| HumanEval | 88.3% | 81.7% |
| AIME 2025 | 97.9% | 98.7% |
| SWE-bench Verified | 62.4% | — |
| GPQA Diamond | 80.9% | — |
Note: 20B outperforms 120B on AIME 2025 (98.7% vs 97.9%), suggesting the smaller model may have specialized training for competition math. Source: arXiv 2508.10925.
Quick Start with Ollama
Use via OpenAI-Compatible API
Ollama exposes a Chat Completions API compatible with the OpenAI SDK:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required but unused
)
response = client.chat.completions.create(
model="gpt-oss:20b",
messages=[
{"role": "user", "content": "Explain MoE architectures"}
]
)
print(response.choices[0].message.content)Use with Open WebUI
For a ChatGPT-like interface, pair GPT-OSS with Open WebUI:
# Start Ollama with GPT-OSS ollama serve & ollama pull gpt-oss:20b # Launch Open WebUI docker run -d -p 3000:8080 \ --add-host=host.docker.internal:host-gateway \ -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \ ghcr.io/open-webui/open-webui:main
Hardware Requirements
GPT-OSS 20B
GPT-OSS 120B
Quantization Options (20B)
| Quantization | VRAM | Quality Loss | Best For |
|---|---|---|---|
| Q4_K_M | ~12GB | ~3-5% | RTX 3060/4060, 16GB Macs |
| Q5_K_M | ~15GB | ~2-3% | RTX 4070 Ti, 32GB Macs |
| Q8_0 | ~16GB | ~1% | RTX 4090, 32GB+ Macs |
| FP16 | ~42GB | 0% | Multi-GPU / research |
VRAM estimates based on standard MoE quantization formulas. Actual values may vary. See our quantization guide for details.
vs Other Models
MMLU-Pro scores shown. All models compared are open-weight and locally runnable.
When to Choose GPT-OSS
Best Use Cases
Local AI Development
The 20B model runs on consumer hardware and exposes an OpenAI-compatible API. Perfect for developing and testing AI applications locally before deploying to production.
Competition Math & Reasoning
97.9% on AIME 2025 makes GPT-OSS one of the strongest reasoning models. Excellent for complex analytical tasks, proofs, and multi-step problem solving.
Agentic Workflows
Strong tool-calling capabilities make GPT-OSS ideal for AI agent frameworks like CrewAI and LangGraph running fully offline.
Code Generation
88.3% HumanEval puts GPT-OSS among the top coding models. Pair it with Continue.dev for a free, private coding assistant.
Privacy-Sensitive Applications
Apache 2.0 license + local execution means zero data leaves your machine. Ideal for healthcare, legal, and financial applications with compliance requirements.
Edge & On-Device
The 20B model with Q4 quantization runs on 12GB of memory — feasible on phones, tablets, and embedded systems for on-device AI without connectivity.
Advanced Setup
Custom Modelfile
# GPT-OSS Modelfile for Ollama FROM gpt-oss:20b # Optimize for reasoning tasks PARAMETER temperature 0.3 PARAMETER top_p 0.9 PARAMETER num_ctx 32768 # System prompt for coding assistant SYSTEM """You are an expert software engineer. Write clean, well-documented code. Explain your reasoning step by step. When you're unsure, say so rather than guessing."""
# Build and run the custom model ollama create my-gpt-oss -f Modelfile ollama run my-gpt-oss
Multi-GPU Setup (120B)
For the 120B model on dual RTX 4090s:
# Ensure both GPUs are visible CUDA_VISIBLE_DEVICES=0,1 ollama serve # Pull and run — Ollama splits layers automatically ollama run gpt-oss:120b
With Agents SDK
GPT-OSS supports tool calling, making it compatible with the OpenAI Agents SDK:
from agents import Agent, Runner
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama"
)
agent = Agent(
name="local-agent",
model="gpt-oss:20b",
instructions="You are a helpful research assistant."
)
result = Runner.run_sync(agent, "Summarize the latest AI news")Sources
- OpenAI: Introducing GPT-OSS — Official announcement
- GPT-OSS Model Card (arXiv 2508.10925) — Architecture and benchmark details
- Hugging Face: openai/gpt-oss-120b — Model weights and documentation
- GitHub: openai/gpt-oss — Official repository
- Ollama: gpt-oss — Ollama model library
Was this helpful?
Related Guides
Continue your local AI journey with these comprehensive guides
Written by Pattanaik Ramswarup
AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset
I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.