Free course — 2 free chapters of every course. No credit card.Start learning free
📅 Published: March 17, 2026🔄 Last Updated: March 17, 2026✓ Manually Reviewed
Apache 2.0MoEOpenAI

GPT-OSS: OpenAI's First Open-Source Model

OpenAI's historic open-source release. Two MoE models — 20B for local hardware, 120B for production. Apache 2.0 license. Runs on Ollama.

92
GPT-OSS 120B Overall
Excellent

Overview

GPT-OSS marks a watershed moment: OpenAI's first open-source model family, released under the permissive Apache 2.0 license. After years of keeping their models closed, OpenAI released two state-of-the-art models that anyone can download, modify, and deploy without restrictions.

The family includes GPT-OSS 120B (117B total parameters, 5.1B active per token) for production-grade inference and GPT-OSS 20B (21B total parameters, 3.6B active per token) for local and edge deployment. Both use Mixture of Experts (MoE) architecture for efficient inference.

GPT-OSS 20B

  • 21B total params, 3.6B active/token
  • Runs on 16GB RAM (consumer hardware)
  • Perfect for local development
  • Matches o3-mini on common benchmarks
  • 128K context window

GPT-OSS 120B

  • 117B total params, 5.1B active/token
  • Needs 80GB GPU (A100 or multi-GPU)
  • Matches or exceeds o4-mini
  • 97.9% on AIME 2025 (competition math)
  • 128K context window
Source note: Benchmark scores from the official GPT-OSS model card (arXiv 2508.10925) and OpenAI's announcement. Independent evaluations may show different numbers.

MoE Architecture Deep Dive

GPT-OSS uses a Mixture of Experts (MoE) Transformer architecture with several key innovations:

Technical Specifications

Architecture:Transformer + MoE
Attention:Grouped multi-query (group size 8)
Attention pattern:Alternating dense + locally banded sparse
Training:RL + distillation from o3/frontier
Context window:128K tokens
License:Apache 2.0

The MoE design means the model activates only a fraction of its parameters per token. GPT-OSS 120B activates 5.1B of its 117B parameters — just 4.4% — giving it the inference speed of a ~5B model with the quality of a much larger one. However, all 117B parameters must fit in memory, which is why VRAM requirements remain high.

The alternating dense and sparse attention (similar to GPT-3's approach) balances quality with efficiency. Grouped multi-query attention with a group size of 8 further reduces memory usage during inference, enabling longer context windows on the same hardware.

Benchmarks (GPT-OSS 120B)

Capability Profile

120B vs 20B Comparison

BenchmarkGPT-OSS 120BGPT-OSS 20B
MMLU-Pro90.0%85.3%
HumanEval88.3%81.7%
AIME 202597.9%98.7%
SWE-bench Verified62.4%
GPQA Diamond80.9%

Note: 20B outperforms 120B on AIME 2025 (98.7% vs 97.9%), suggesting the smaller model may have specialized training for competition math. Source: arXiv 2508.10925.

Quick Start with Ollama

Use via OpenAI-Compatible API

Ollama exposes a Chat Completions API compatible with the OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required but unused
)

response = client.chat.completions.create(
    model="gpt-oss:20b",
    messages=[
        {"role": "user", "content": "Explain MoE architectures"}
    ]
)
print(response.choices[0].message.content)

Use with Open WebUI

For a ChatGPT-like interface, pair GPT-OSS with Open WebUI:

# Start Ollama with GPT-OSS
ollama serve &
ollama pull gpt-oss:20b

# Launch Open WebUI
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  ghcr.io/open-webui/open-webui:main

Hardware Requirements

GPT-OSS 20B

GPT-OSS 120B

Quantization Options (20B)

QuantizationVRAMQuality LossBest For
Q4_K_M~12GB~3-5%RTX 3060/4060, 16GB Macs
Q5_K_M~15GB~2-3%RTX 4070 Ti, 32GB Macs
Q8_0~16GB~1%RTX 4090, 32GB+ Macs
FP16~42GB0%Multi-GPU / research

VRAM estimates based on standard MoE quantization formulas. Actual values may vary. See our quantization guide for details.

vs Other Models

MMLU-Pro scores shown. All models compared are open-weight and locally runnable.

When to Choose GPT-OSS

Choose GPT-OSS 20B when:You want OpenAI-quality reasoning on consumer hardware (16GB), need Apache 2.0 for commercial use, or want the best 20B-class model available.
Choose GPT-OSS 120B when:You need near-frontier quality (matches o4-mini) without API costs, have 80GB+ VRAM, or need the best open model for agentic workflows and tool calling.
Choose Llama 3.3 70B instead when:You have 24-48GB VRAM and want a proven dense model. Llama 3.3 fits on a single RTX 4090 at Q4 and has extensive community support. See our Llama 3.3 guide.
Choose DeepSeek V3 instead when:You need the largest open MoE model (671B total) for research or maximum quality. However, it requires more memory than GPT-OSS 120B.

Best Use Cases

Local AI Development

The 20B model runs on consumer hardware and exposes an OpenAI-compatible API. Perfect for developing and testing AI applications locally before deploying to production.

Competition Math & Reasoning

97.9% on AIME 2025 makes GPT-OSS one of the strongest reasoning models. Excellent for complex analytical tasks, proofs, and multi-step problem solving.

Agentic Workflows

Strong tool-calling capabilities make GPT-OSS ideal for AI agent frameworks like CrewAI and LangGraph running fully offline.

Code Generation

88.3% HumanEval puts GPT-OSS among the top coding models. Pair it with Continue.dev for a free, private coding assistant.

Privacy-Sensitive Applications

Apache 2.0 license + local execution means zero data leaves your machine. Ideal for healthcare, legal, and financial applications with compliance requirements.

Edge & On-Device

The 20B model with Q4 quantization runs on 12GB of memory — feasible on phones, tablets, and embedded systems for on-device AI without connectivity.

Advanced Setup

Custom Modelfile

# GPT-OSS Modelfile for Ollama
FROM gpt-oss:20b

# Optimize for reasoning tasks
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 32768

# System prompt for coding assistant
SYSTEM """You are an expert software engineer. Write clean,
well-documented code. Explain your reasoning step by step.
When you're unsure, say so rather than guessing."""
# Build and run the custom model
ollama create my-gpt-oss -f Modelfile
ollama run my-gpt-oss

Multi-GPU Setup (120B)

For the 120B model on dual RTX 4090s:

# Ensure both GPUs are visible
CUDA_VISIBLE_DEVICES=0,1 ollama serve

# Pull and run — Ollama splits layers automatically
ollama run gpt-oss:120b

With Agents SDK

GPT-OSS supports tool calling, making it compatible with the OpenAI Agents SDK:

from agents import Agent, Runner
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

agent = Agent(
    name="local-agent",
    model="gpt-oss:20b",
    instructions="You are a helpful research assistant."
)

result = Runner.run_sync(agent, "Summarize the latest AI news")

Sources

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators