NVIDIA · Open-Weights Model

NVIDIA Nemotron 3 Ultra: The 550B Open-Weights MoE, Reviewed

Name: NVIDIA Nemotron 3 Ultra
Author: NVIDIA

NVIDIA unveiled the Nemotron 3 family at Computex 2026, headlined by Nemotron 3 Ultra — a 550-billion-parameter Mixture-of-Experts model with only 55B active per token, built on a Hybrid Mamba-Attention architecture with LatentMoE routing and Multi-Token Prediction. It scores roughly 48 on the Artificial Analysis Intelligence Index, which NVIDIA positions as the leading US open-weights intelligence, and the weights ship openly under the OpenMDW-1.1 license. This page covers all three tiers — Nano, Super, and Ultra (plus the multimodal Nano Omni) — the architecture, the benchmark claims, and what it actually takes to run them yourself.

📅 Published: June 20, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed

Good news for self-hosters: unlike most frontier-tier models, Nemotron 3 is open weights — Ultra's checkpoints are on Hugging Face (nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16), and NVIDIA released training recipes and datasets too. See the running it locally section for the hardware reality check.

Key takeaways

→Ultra = 550B total / 55B active — a sparse MoE, so it computes like a ~55B model but stores like a 550B one.
→Hybrid Mamba-Attention + LatentMoE + MTP — Mamba-2 for cheap long context, attention for recall, Multi-Token Prediction for native speculative decoding.
→~48 on the Artificial Analysis Intelligence Index — NVIDIA reports it as the leading US open-weights model on that index.
→1M-token context and, per a pre-release DeepInfra endpoint, 300+ tok/s in BF16.
→Open weights under OpenMDW-1.1 (a permissive Linux Foundation AI model license) — commercial use allowed.

Quick verdict

Nemotron 3 Ultra is the most capable open-weights model to come out of a US lab in this cycle, and the headline number — ~48 on the Artificial Analysis Intelligence Index — puts it well clear of the smaller open models (NVIDIA cites gpt-oss-120b at ~33 and its own Super at ~36 for context). The cleverness is in the architecture: a sparse 550B/55B-active MoE means it reasons at frontier quality while only paying the compute cost of a ~55B model per token, and the Mamba-Attention hybrid keeps the 1M-token context affordable.

The honest caveat is scale. "Open weights" does not mean "runs on your laptop" — at 550B parameters, Ultra is a server-class model. The tier most people will actually self-host is Nano (30B total, ~3B active) or the multimodal Nano Omni, with Super (120B / ~12B active) for mid-range boxes. If you want a frontier-class open model you can run on a workstation, look at DeepSeek V4 or GLM-5 alongside this. NVIDIA is the "leading US open weights" story here, not necessarily the absolute global leader.

Nemotron 3 Ultra specs at a glance

Vendor	NVIDIA
Family launch	Computex 2026 (Ultra weights open-sourced June 2026)
Total parameters	550 billion (Mixture-of-Experts)
Active parameters / token	55 billion
Architecture	Hybrid Mamba-Attention MoE · LatentMoE routing · MTP layers
Reported layout	108 layers · model dim 8,192 · 512 experts (top-22 active)
Context window	1,000,000 tokens
Pretraining scale	~20 trillion text tokens (NVIDIA-reported)
AA Intelligence Index	~48 (NVIDIA / Artificial Analysis-reported)
License	OpenMDW-1.1 (open weights, commercial use)
Local self-hostable?	Yes — but server-class hardware (see below)
Weights	`nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16` (Hugging Face)
Access	Hugging Face · NVIDIA NIM / build.nvidia.com · OpenRouter · DeepInfra

Sources: NVIDIA Newsroom "Nemotron 3 family" announcement, the NVIDIA Nemotron 3 Ultra Technical Report (for the 108-layer / 8,192-dim / 512-expert / top-22 details and the ~20T-token pretraining), and the Hugging Face model card. NVIDIA's December 2025 family press release rounds the lineup to "about 500B / up to 50B active"; the model card and marketed figure for Ultra is 550B / 55B active.

The architecture: why it's fast

Nemotron 3 Ultra is not a vanilla transformer. It stacks three ideas, each aimed at making a 550B model cheap to run:

Hybrid Mamba-Attention

Mamba-2 state-space layers carry the long sequence with sub-quadratic scaling — that's what keeps the 1M-token context from blowing up in cost — while selective attention layers preserve precise recall where it matters. The hybrid is the reason long-context throughput stays high.

LatentMoE routing

Of 512 experts, only the top ~22 fire per token. LatentMoE routes in a compressed latent space (NVIDIA describes trading away hidden-dimension width for accuracy-per-parameter), which is how Ultra packs 550B of knowledge while activating just 55B per forward pass.

Multi-Token Prediction (MTP)

MTP heads predict several future tokens in one forward pass, giving Ultra native speculative decoding — no separate draft model required. On a pre-release DeepInfra endpoint, this helped it serve 300+ tokens/sec in BF16.

The full Nemotron 3 family

Nemotron 3 ships in three sizes, plus a multimodal Nano variant. The smaller you go, the more realistically you can self-host it:

Tier	Params (total / active)	When	Best for
Nano	30B / ~3B	Dec 2025	Light, efficient tasks; 1M context; most self-hostable
Nano Omni	30B / ~3B (multimodal)	~Apr 28–29, 2026	Open vision + audio + language in one model; document/video/audio agents
Super	120B / ~12B	Mar 2026	Mid-range enterprise reasoning; ~36 AA Intelligence Index
Ultra	550B / 55B	Computex 2026	Frontier reasoning + long-running agents; ~48 AA Intelligence Index

Nano Omni is the interesting one for builders: it's an open multimodal model that takes text, images, audio, video, documents, charts, and GUI screenshots as input on a single 30B-A3B hybrid MoE, and NVIDIA reports up to ~9× higher throughput than other open multimodal models. It collapses a typical vision + ASR + LLM stack into one model — handy for agents that need to "see and hear" without three separate services.

Sources: NVIDIA Newsroom and NVIDIA Technical/AI blogs (Nano Omni, Super, Ultra). Parameter figures vary slightly between NVIDIA's rounded press copy and the Hugging Face repo names; we use the repo/marketed numbers where they differ.

Benchmarks & throughput

Most headline numbers here are NVIDIA's own claims or single-source third-party measurements, so treat them as vendor-reported until independent trackers settle. Where that's the case, we've flagged it.

Metric	Nemotron 3 Ultra	Notes
Artificial Analysis Intelligence Index	~48	NVIDIA cites it as leading US open-weights; vs Super ~36, gpt-oss-120b ~33.
Throughput (BF16)	300+ tok/s	Measured on a pre-release DeepInfra endpoint — single source.
Decode-heavy speedup	~5.9×	NVIDIA-reported vs GLM-5.1 on an 8K-in / 64K-out workload.
Context window	1,000,000	Tokens; Mamba-hybrid keeps long-context cost down.
Pretraining tokens	~20T	NVIDIA-reported text-token count.

Sources: NVIDIA Newsroom / Technical Blog, Artificial Analysis Intelligence Index, MarkTechPost architecture write-up, and DeepInfra's Nemotron 3 Ultra release post. NVIDIA reports the throughput and speedup figures; verify against independent benchmarks before relying on them.

Running Nemotron 3 locally — the hardware reality

This is the part the launch coverage glosses over. "Open weights" is real and valuable — you can download, fine-tune, and self-host under OpenMDW-1.1 — but the three tiers live in very different worlds:

Tier	Rough memory to serve	Realistic where
Nano (30B / 3B)	~18–24 GB at 4-bit	A single 24 GB GPU (RTX 4090/5090) or a 32 GB+ Apple Silicon Mac
Super (120B / 12B)	~60–80 GB at 4-bit	2× 48 GB GPUs, an H100/H200, or a 128 GB unified-memory Mac
Ultra (550B / 55B)	Hundreds of GB even quantized	Multi-GPU server / data-center node — not a workstation model

These are ballpark figures, not NVIDIA specs — MoE models only activate a slice of their weights per token, but you still have to hold all of them in memory, so total-parameter count drives the VRAM you need. For most individuals and small teams, the practical Nemotron 3 you run at home is Nano or Nano Omni; Ultra is something you rent on a NIM/DeepInfra/OpenRouter endpoint and reserve self-hosting for when data residency demands it.

How to choose a tier

If you want…	Pick	Why
A model you can actually run on one GPU	Nemotron 3 Nano	30B/3B-active, 1M context, fits a 24 GB card at 4-bit.
Vision + audio + text in one open model	Nemotron 3 Nano Omni	Unified multimodal; replaces a vision+ASR+LLM stack.
Mid-range enterprise reasoning on prem	Nemotron 3 Super	120B / ~12B active; ~36 AA index; fits a single H100-class node.
Frontier reasoning for long-running agents	Nemotron 3 Ultra	550B/55B; ~48 AA index; rent it, self-host only if you must.
Frontier-class open model on a workstation	DeepSeek V4	Comparable open alternative with strong quant support.
Reasoning on a smaller footprint	GLM-5	MIT-licensed; lighter hardware bill than Ultra.

Run open weights on your own hardware

Nemotron 3 Ultra needs a server, but Nano and Nano Omni run on a single GPU — and so do DeepSeek V4, GLM-5, and the rest of the open-weight field. The Local AI Master deployment course walks you through quantization, serving, and fine-tuning open models locally — full data privacy, zero per-token cost.

See the deployment course →

Related models & guides

→ Nemotron-70B — NVIDIA's earlier dense open model
→ DeepSeek V4 — frontier-class open-weight alternative you can self-host
→ GLM-5 — MIT-licensed reasoning model, smaller footprint
→ Best open-source LLMs of 2026
→ Best local AI models for programming
→ Model comparisons hub

🎯

AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Start free Browse courses first

Or own it for life — Lifetime $149 $599, pay once

Training your whole team? Get a team quote →

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

GitHub LinkedIn Twitter

NVIDIA Nemotron 3 Ultra: The 550B Open-Weights MoE, Reviewed

Key takeaways

Quick verdict

Nemotron 3 Ultra specs at a glance

The architecture: why it's fast

Hybrid Mamba-Attention

LatentMoE routing

Multi-Token Prediction (MTP)

The full Nemotron 3 family

Benchmarks & throughput

Running Nemotron 3 locally — the hardware reality

How to choose a tier

Run open weights on your own hardware

Related models & guides

Go from reading about AI to building with AI

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Found your model? Now build something with it.

NVIDIA Nemotron 3 Ultra: The 550B Open-Weights MoE, Reviewed

Key takeaways

Quick verdict

Nemotron 3 Ultra specs at a glance

The architecture: why it's fast

Hybrid Mamba-Attention

LatentMoE routing

Multi-Token Prediction (MTP)

The full Nemotron 3 family

Benchmarks & throughput

Running Nemotron 3 locally — the hardware reality

How to choose a tier

Where Nemotron 3 sits in the open-weight landscape

Run open weights on your own hardware

Related models & guides

Go from reading about AI to building with AI

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Found your model? Now build something with it.