What is GPT-OSS and why is it significant?

GPT-OSS is OpenAI's first open-source model family, released under the Apache 2.0 license. It includes a 20B and 120B parameter MoE model. The 120B model matches or exceeds OpenAI o4-mini on many benchmarks while being fully open-weight, meaning anyone can download, modify, and deploy it locally without API costs.

Can I run GPT-OSS on my computer?

The GPT-OSS 20B model needs only 16GB of memory (system RAM or VRAM), making it runnable on most modern PCs and Apple Silicon Macs. The 120B model needs about 80GB, requiring a high-end GPU like an A100 or multiple consumer GPUs. For most users, the 20B model is the practical local option.

How does GPT-OSS compare to Llama 3.3 70B?

GPT-OSS 120B generally outperforms Llama 3.3 70B on reasoning (97.9% vs ~83% on AIME 2025), coding (88.3% vs 81.7% HumanEval), and knowledge (90.0% vs 86.0% MMLU-Pro). However, Llama 3.3 70B is a dense model that's easier to run on consumer hardware (42GB VRAM at Q4), while GPT-OSS 120B needs 80GB despite its MoE efficiency.

What is GPT-OSS's MoE architecture?

GPT-OSS uses Mixture of Experts (MoE). The 120B model has 117B total parameters but only activates 5.1B per token. The 20B model has 21B total but activates 3.6B per token. This means inference speed is like a much smaller model, but you still need enough memory to hold all expert weights.

Is GPT-OSS truly open source?

Yes. GPT-OSS is released under the Apache 2.0 license, which is one of the most permissive open-source licenses. You can use it commercially, modify it, and redistribute it without restrictions. This is a major departure from OpenAI's historically closed approach.

How do I run GPT-OSS with Ollama?

Install Ollama, then run: ollama pull gpt-oss:20b followed by ollama run gpt-oss:20b. For the 120B model: ollama pull gpt-oss:120b. Ollama handles quantization and hardware optimization automatically. You can also use the OpenAI-compatible API at localhost:11434/v1.

What are GPT-OSS's best use cases?

GPT-OSS excels at reasoning tasks (97.9% AIME 2025), coding (88.3% HumanEval), and tool calling for agentic workflows. The 20B model is ideal for local development, privacy-sensitive applications, and on-device inference. The 120B model suits production deployments needing near-frontier quality without API dependency.

How much VRAM does GPT-OSS need?

GPT-OSS 20B: ~12GB at Q4_K_M, ~16GB at Q8_0, or 16GB system RAM for CPU inference. GPT-OSS 120B: ~65GB at Q4_K_M, ~80GB at FP16. The 20B fits on an RTX 4090 (24GB) easily. The 120B requires an A100 80GB or dual RTX 4090s with model splitting.

★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds

📅 Published: March 17, 2026🔄 Last Updated: March 17, 2026✓ Manually Reviewed

Apache 2.0MoEOpenAI

GPT-OSS: OpenAI's First Open-Source Model

OpenAI's historic open-source release. Two MoE models — 20B for local hardware, 120B for production. Apache 2.0 license. Runs on Ollama.

GPT-OSS 120B Overall

Excellent

Overview

GPT-OSS marks a watershed moment: OpenAI's first open-source model family, released under the permissive Apache 2.0 license. After years of keeping their models closed, OpenAI released two state-of-the-art models that anyone can download, modify, and deploy without restrictions.

The family includes GPT-OSS 120B (117B total parameters, 5.1B active per token) for production-grade inference and GPT-OSS 20B (21B total parameters, 3.6B active per token) for local and edge deployment. Both use Mixture of Experts (MoE) architecture for efficient inference.

GPT-OSS 20B

21B total params, 3.6B active/token
Runs on 16GB RAM (consumer hardware)
Perfect for local development
Matches o3-mini on common benchmarks
128K context window

GPT-OSS 120B

117B total params, 5.1B active/token
Needs 80GB GPU (A100 or multi-GPU)
Matches or exceeds o4-mini
97.9% on AIME 2025 (competition math)
128K context window

Source note: Benchmark scores from the official GPT-OSS model card (arXiv 2508.10925) and OpenAI's announcement. Independent evaluations may show different numbers.

MoE Architecture Deep Dive

GPT-OSS uses a Mixture of Experts (MoE) Transformer architecture with several key innovations:

Technical Specifications

Architecture:Transformer + MoE

Attention:Grouped multi-query (group size 8)

Attention pattern:Alternating dense + locally banded sparse

Training:RL + distillation from o3/frontier

Context window:128K tokens

License:Apache 2.0

The MoE design means the model activates only a fraction of its parameters per token. GPT-OSS 120B activates 5.1B of its 117B parameters — just 4.4% — giving it the inference speed of a ~5B model with the quality of a much larger one. However, all 117B parameters must fit in memory, which is why VRAM requirements remain high.

The alternating dense and sparse attention (similar to GPT-3's approach) balances quality with efficiency. Grouped multi-query attention with a group size of 8 further reduces memory usage during inference, enabling longer context windows on the same hardware.

Benchmarks (GPT-OSS 120B)

Capability Profile

120B vs 20B Comparison

Benchmark	GPT-OSS 120B	GPT-OSS 20B
MMLU-Pro	90.0%	85.3%
HumanEval	88.3%	81.7%
AIME 2025	97.9%	98.7%
SWE-bench Verified	62.4%	—
GPQA Diamond	80.9%	—

Note: 20B outperforms 120B on AIME 2025 (98.7% vs 97.9%), suggesting the smaller model may have specialized training for competition math. Source: arXiv 2508.10925.

Quick Start with Ollama

Use via OpenAI-Compatible API

Ollama exposes a Chat Completions API compatible with the OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required but unused
)

response = client.chat.completions.create(
    model="gpt-oss:20b",
    messages=[
        {"role": "user", "content": "Explain MoE architectures"}
    ]
)
print(response.choices[0].message.content)

Use with Open WebUI

For a ChatGPT-like interface, pair GPT-OSS with Open WebUI:

# Start Ollama with GPT-OSS
ollama serve &
ollama pull gpt-oss:20b

# Launch Open WebUI
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  ghcr.io/open-webui/open-webui:main

Hardware Requirements

GPT-OSS 20B

GPT-OSS 120B

Quantization Options (20B)

Quantization	VRAM	Quality Loss	Best For
Q4_K_M	~12GB	~3-5%	RTX 3060/4060, 16GB Macs
Q5_K_M	~15GB	~2-3%	RTX 4070 Ti, 32GB Macs
Q8_0	~16GB	~1%	RTX 4090, 32GB+ Macs
FP16	~42GB	0%	Multi-GPU / research

VRAM estimates based on standard MoE quantization formulas. Actual values may vary. See our quantization guide for details.

vs Other Models

MMLU-Pro scores shown. All models compared are open-weight and locally runnable.

When to Choose GPT-OSS

Choose GPT-OSS 20B when:You want OpenAI-quality reasoning on consumer hardware (16GB), need Apache 2.0 for commercial use, or want the best 20B-class model available.

Choose GPT-OSS 120B when:You need near-frontier quality (matches o4-mini) without API costs, have 80GB+ VRAM, or need the best open model for agentic workflows and tool calling.

Choose Llama 3.3 70B instead when:You have 24-48GB VRAM and want a proven dense model. Llama 3.3 fits on a single RTX 4090 at Q4 and has extensive community support. See our Llama 3.3 guide.

Choose DeepSeek V3 instead when:You need the largest open MoE model (671B total) for research or maximum quality. However, it requires more memory than GPT-OSS 120B.

Best Use Cases

Local AI Development

The 20B model runs on consumer hardware and exposes an OpenAI-compatible API. Perfect for developing and testing AI applications locally before deploying to production.

Competition Math & Reasoning

97.9% on AIME 2025 makes GPT-OSS one of the strongest reasoning models. Excellent for complex analytical tasks, proofs, and multi-step problem solving.

Agentic Workflows

Strong tool-calling capabilities make GPT-OSS ideal for AI agent frameworks like CrewAI and LangGraph running fully offline.

Code Generation

88.3% HumanEval puts GPT-OSS among the top coding models. Pair it with Continue.dev for a free, private coding assistant.

Privacy-Sensitive Applications

Apache 2.0 license + local execution means zero data leaves your machine. Ideal for healthcare, legal, and financial applications with compliance requirements.

Edge & On-Device

The 20B model with Q4 quantization runs on 12GB of memory — feasible on phones, tablets, and embedded systems for on-device AI without connectivity.

Advanced Setup

Custom Modelfile

# GPT-OSS Modelfile for Ollama
FROM gpt-oss:20b

# Optimize for reasoning tasks
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 32768

# System prompt for coding assistant
SYSTEM """You are an expert software engineer. Write clean,
well-documented code. Explain your reasoning step by step.
When you're unsure, say so rather than guessing."""

# Build and run the custom model
ollama create my-gpt-oss -f Modelfile
ollama run my-gpt-oss

Multi-GPU Setup (120B)

For the 120B model on dual RTX 4090s:

# Ensure both GPUs are visible
CUDA_VISIBLE_DEVICES=0,1 ollama serve

# Pull and run — Ollama splits layers automatically
ollama run gpt-oss:120b

With Agents SDK

GPT-OSS supports tool calling, making it compatible with the OpenAI Agents SDK:

from agents import Agent, Runner
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

agent = Agent(
    name="local-agent",
    model="gpt-oss:20b",
    instructions="You are a helpful research assistant."
)

result = Runner.run_sync(agent, "Summarize the latest AI news")

Sources

OpenAI: Introducing GPT-OSS — Official announcement
GPT-OSS Model Card (arXiv 2508.10925) — Architecture and benchmark details
Hugging Face: openai/gpt-oss-120b — Model weights and documentation
GitHub: openai/gpt-oss — Official repository
Ollama: gpt-oss — Ollama model library

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Explore the Learning Path See pricing

Was this helpful?

Related Guides

Continue your local AI journey with these comprehensive guides

Guide

Best Ollama Models

Top models to run locally with Ollama.

Model

Llama 3.3 70B Guide

Run Meta's best open model locally.

Hardware

VRAM Requirements

How much VRAM for each model.

Technical

AWQ vs GPTQ vs GGUF

Quantization formats explained.

View All Local AI Guides

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

GitHub LinkedIn Twitter

📚

Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯

AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Start free Browse courses first