GPT-OSS: OpenAI's First Open-Source Model
OpenAI's historic open-source release. Two MoE models — 20B for local hardware, 120B for production. Apache 2.0 license. Runs on Ollama.
Overview
GPT-OSS marks a watershed moment: OpenAI's first open-source model family, released under the permissive Apache 2.0 license. After years of keeping their models closed, OpenAI released two state-of-the-art models that anyone can download, modify, and deploy without restrictions.
The family includes GPT-OSS 120B (117B total parameters, 5.1B active per token) for production-grade inference and GPT-OSS 20B (21B total parameters, 3.6B active per token) for local and edge deployment. Both use Mixture of Experts (MoE) architecture for efficient inference.
GPT-OSS 20B
- 21B total params, 3.6B active/token
- Runs on 16GB RAM (consumer hardware)
- Perfect for local development
- Matches o3-mini on common benchmarks
- 128K context window
GPT-OSS 120B
- 117B total params, 5.1B active/token
- Needs 80GB GPU (A100 or multi-GPU)
- Matches or exceeds o4-mini
- 97.9% on AIME 2025 (competition math)
- 128K context window
MoE Architecture Deep Dive
GPT-OSS uses a Mixture of Experts (MoE) Transformer architecture with several key innovations:
Technical Specifications
The MoE design means the model activates only a fraction of its parameters per token. GPT-OSS 120B activates 5.1B of its 117B parameters — just 4.4% — giving it the inference speed of a ~5B model with the quality of a much larger one. However, all 117B parameters must fit in memory, which is why VRAM requirements remain high.
The alternating dense and sparse attention (similar to GPT-3's approach) balances quality with efficiency. Grouped multi-query attention with a group size of 8 further reduces memory usage during inference, enabling longer context windows on the same hardware.
Benchmarks (GPT-OSS 120B)
Capability Profile
120B vs 20B Comparison
| Benchmark | GPT-OSS 120B | GPT-OSS 20B |
|---|---|---|
| MMLU-Pro | 90.0% | 85.3% |
| HumanEval | 88.3% | 81.7% |
| AIME 2025 | 97.9% | 98.7% |
| SWE-bench Verified | 62.4% | — |
| GPQA Diamond | 80.9% | — |
Note: 20B outperforms 120B on AIME 2025 (98.7% vs 97.9%), suggesting the smaller model may have specialized training for competition math. Source: arXiv 2508.10925.
Quick Start with Ollama
Use via OpenAI-Compatible API
Ollama exposes a Chat Completions API compatible with the OpenAI SDK:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required but unused
)
response = client.chat.completions.create(
model="gpt-oss:20b",
messages=[
{"role": "user", "content": "Explain MoE architectures"}
]
)
print(response.choices[0].message.content)Use with Open WebUI
For a ChatGPT-like interface, pair GPT-OSS with Open WebUI:
# Start Ollama with GPT-OSS ollama serve & ollama pull gpt-oss:20b # Launch Open WebUI docker run -d -p 3000:8080 \ --add-host=host.docker.internal:host-gateway \ -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \ ghcr.io/open-webui/open-webui:main
Hardware Requirements
GPT-OSS 20B
GPT-OSS 120B
Quantization Options (20B)
| Quantization | VRAM | Quality Loss | Best For |
|---|---|---|---|
| Q4_K_M | ~12GB | ~3-5% | RTX 3060/4060, 16GB Macs |
| Q5_K_M | ~15GB | ~2-3% | RTX 4070 Ti, 32GB Macs |
| Q8_0 | ~16GB | ~1% | RTX 4090, 32GB+ Macs |
| FP16 | ~42GB | 0% | Multi-GPU / research |
VRAM estimates based on standard MoE quantization formulas. Actual values may vary. See our quantization guide for details.
vs Other Models
MMLU-Pro scores shown. All models compared are open-weight and locally runnable.
When to Choose GPT-OSS
Best Use Cases
Local AI Development
The 20B model runs on consumer hardware and exposes an OpenAI-compatible API. Perfect for developing and testing AI applications locally before deploying to production.
Competition Math & Reasoning
97.9% on AIME 2025 makes GPT-OSS one of the strongest reasoning models. Excellent for complex analytical tasks, proofs, and multi-step problem solving.
Agentic Workflows
Strong tool-calling capabilities make GPT-OSS ideal for AI agent frameworks like CrewAI and LangGraph running fully offline.
Code Generation
88.3% HumanEval puts GPT-OSS among the top coding models. Pair it with Continue.dev for a free, private coding assistant.
Privacy-Sensitive Applications
Apache 2.0 license + local execution means zero data leaves your machine. Ideal for healthcare, legal, and financial applications with compliance requirements.
Edge & On-Device
The 20B model with Q4 quantization runs on 12GB of memory — feasible on phones, tablets, and embedded systems for on-device AI without connectivity.
Advanced Setup
Custom Modelfile
# GPT-OSS Modelfile for Ollama FROM gpt-oss:20b # Optimize for reasoning tasks PARAMETER temperature 0.3 PARAMETER top_p 0.9 PARAMETER num_ctx 32768 # System prompt for coding assistant SYSTEM """You are an expert software engineer. Write clean, well-documented code. Explain your reasoning step by step. When you're unsure, say so rather than guessing."""
# Build and run the custom model ollama create my-gpt-oss -f Modelfile ollama run my-gpt-oss
Multi-GPU Setup (120B)
For the 120B model on dual RTX 4090s:
# Ensure both GPUs are visible CUDA_VISIBLE_DEVICES=0,1 ollama serve # Pull and run — Ollama splits layers automatically ollama run gpt-oss:120b
With Agents SDK
GPT-OSS supports tool calling, making it compatible with the OpenAI Agents SDK:
from agents import Agent, Runner
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama"
)
agent = Agent(
name="local-agent",
model="gpt-oss:20b",
instructions="You are a helpful research assistant."
)
result = Runner.run_sync(agent, "Summarize the latest AI news")Sources
- OpenAI: Introducing GPT-OSS — Official announcement
- GPT-OSS Model Card (arXiv 2508.10925) — Architecture and benchmark details
- Hugging Face: openai/gpt-oss-120b — Model weights and documentation
- GitHub: openai/gpt-oss — Official repository
- Ollama: gpt-oss — Ollama model library
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Was this helpful?
Related Guides
Continue your local AI journey with these comprehensive guides
Written by Pattanaik Ramswarup
Creator of Local AI Master
I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.