📅 Published: March 17, 2026🔄 Last Updated: March 17, 2026✓ Manually Reviewed
Apache 2.0MoEOpenAI

GPT-OSS: OpenAI's First Open-Source Model

OpenAI's historic open-source release. Two MoE models — 20B for local hardware, 120B for production. Apache 2.0 license. Runs on Ollama.

92
GPT-OSS 120B Overall
Excellent

Overview

GPT-OSS marks a watershed moment: OpenAI's first open-source model family, released under the permissive Apache 2.0 license. After years of keeping their models closed, OpenAI released two state-of-the-art models that anyone can download, modify, and deploy without restrictions.

The family includes GPT-OSS 120B (117B total parameters, 5.1B active per token) for production-grade inference and GPT-OSS 20B (21B total parameters, 3.6B active per token) for local and edge deployment. Both use Mixture of Experts (MoE) architecture for efficient inference.

GPT-OSS 20B

  • 21B total params, 3.6B active/token
  • Runs on 16GB RAM (consumer hardware)
  • Perfect for local development
  • Matches o3-mini on common benchmarks
  • 128K context window

GPT-OSS 120B

  • 117B total params, 5.1B active/token
  • Needs 80GB GPU (A100 or multi-GPU)
  • Matches or exceeds o4-mini
  • 97.9% on AIME 2025 (competition math)
  • 128K context window
Source note: Benchmark scores from the official GPT-OSS model card (arXiv 2508.10925) and OpenAI's announcement. Independent evaluations may show different numbers.

MoE Architecture Deep Dive

GPT-OSS uses a Mixture of Experts (MoE) Transformer architecture with several key innovations:

Technical Specifications

Architecture:Transformer + MoE
Attention:Grouped multi-query (group size 8)
Attention pattern:Alternating dense + locally banded sparse
Training:RL + distillation from o3/frontier
Context window:128K tokens
License:Apache 2.0

The MoE design means the model activates only a fraction of its parameters per token. GPT-OSS 120B activates 5.1B of its 117B parameters — just 4.4% — giving it the inference speed of a ~5B model with the quality of a much larger one. However, all 117B parameters must fit in memory, which is why VRAM requirements remain high.

The alternating dense and sparse attention (similar to GPT-3's approach) balances quality with efficiency. Grouped multi-query attention with a group size of 8 further reduces memory usage during inference, enabling longer context windows on the same hardware.

Benchmarks (GPT-OSS 120B)

Capability Profile

120B vs 20B Comparison

BenchmarkGPT-OSS 120BGPT-OSS 20B
MMLU-Pro90.0%85.3%
HumanEval88.3%81.7%
AIME 202597.9%98.7%
SWE-bench Verified62.4%
GPQA Diamond80.9%

Note: 20B outperforms 120B on AIME 2025 (98.7% vs 97.9%), suggesting the smaller model may have specialized training for competition math. Source: arXiv 2508.10925.

Quick Start with Ollama

Use via OpenAI-Compatible API

Ollama exposes a Chat Completions API compatible with the OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required but unused
)

response = client.chat.completions.create(
    model="gpt-oss:20b",
    messages=[
        {"role": "user", "content": "Explain MoE architectures"}
    ]
)
print(response.choices[0].message.content)

Use with Open WebUI

For a ChatGPT-like interface, pair GPT-OSS with Open WebUI:

# Start Ollama with GPT-OSS
ollama serve &
ollama pull gpt-oss:20b

# Launch Open WebUI
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  ghcr.io/open-webui/open-webui:main

Hardware Requirements

GPT-OSS 20B

GPT-OSS 120B

Quantization Options (20B)

QuantizationVRAMQuality LossBest For
Q4_K_M~12GB~3-5%RTX 3060/4060, 16GB Macs
Q5_K_M~15GB~2-3%RTX 4070 Ti, 32GB Macs
Q8_0~16GB~1%RTX 4090, 32GB+ Macs
FP16~42GB0%Multi-GPU / research

VRAM estimates based on standard MoE quantization formulas. Actual values may vary. See our quantization guide for details.

vs Other Models

MMLU-Pro scores shown. All models compared are open-weight and locally runnable.

When to Choose GPT-OSS

Choose GPT-OSS 20B when:You want OpenAI-quality reasoning on consumer hardware (16GB), need Apache 2.0 for commercial use, or want the best 20B-class model available.
Choose GPT-OSS 120B when:You need near-frontier quality (matches o4-mini) without API costs, have 80GB+ VRAM, or need the best open model for agentic workflows and tool calling.
Choose Llama 3.3 70B instead when:You have 24-48GB VRAM and want a proven dense model. Llama 3.3 fits on a single RTX 4090 at Q4 and has extensive community support. See our Llama 3.3 guide.
Choose DeepSeek V3 instead when:You need the largest open MoE model (671B total) for research or maximum quality. However, it requires more memory than GPT-OSS 120B.

Best Use Cases

Local AI Development

The 20B model runs on consumer hardware and exposes an OpenAI-compatible API. Perfect for developing and testing AI applications locally before deploying to production.

Competition Math & Reasoning

97.9% on AIME 2025 makes GPT-OSS one of the strongest reasoning models. Excellent for complex analytical tasks, proofs, and multi-step problem solving.

Agentic Workflows

Strong tool-calling capabilities make GPT-OSS ideal for AI agent frameworks like CrewAI and LangGraph running fully offline.

Code Generation

88.3% HumanEval puts GPT-OSS among the top coding models. Pair it with Continue.dev for a free, private coding assistant.

Privacy-Sensitive Applications

Apache 2.0 license + local execution means zero data leaves your machine. Ideal for healthcare, legal, and financial applications with compliance requirements.

Edge & On-Device

The 20B model with Q4 quantization runs on 12GB of memory — feasible on phones, tablets, and embedded systems for on-device AI without connectivity.

Advanced Setup

Custom Modelfile

# GPT-OSS Modelfile for Ollama
FROM gpt-oss:20b

# Optimize for reasoning tasks
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 32768

# System prompt for coding assistant
SYSTEM """You are an expert software engineer. Write clean,
well-documented code. Explain your reasoning step by step.
When you're unsure, say so rather than guessing."""
# Build and run the custom model
ollama create my-gpt-oss -f Modelfile
ollama run my-gpt-oss

Multi-GPU Setup (120B)

For the 120B model on dual RTX 4090s:

# Ensure both GPUs are visible
CUDA_VISIBLE_DEVICES=0,1 ollama serve

# Pull and run — Ollama splits layers automatically
ollama run gpt-oss:120b

With Agents SDK

GPT-OSS supports tool calling, making it compatible with the OpenAI Agents SDK:

from agents import Agent, Runner
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

agent = Agent(
    name="local-agent",
    model="gpt-oss:20b",
    instructions="You are a helpful research assistant."
)

result = Runner.run_sync(agent, "Summarize the latest AI news")

Sources

My 77K Dataset Insights Delivered Weekly

Get exclusive access to real dataset optimization strategies and AI model performance tips.

Was this helpful?

PR

Written by Pattanaik Ramswarup

AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset

I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.

✓ 10+ Years in ML/AI✓ 77K Dataset Creator✓ Open Source Contributor
Free Tools & Calculators