★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds

NVIDIA · Open-Weights Model

NVIDIA Nemotron 3 Ultra: The 550B Open-Weights MoE, Reviewed

NVIDIA unveiled the Nemotron 3 family at Computex 2026, headlined by Nemotron 3 Ultra — a 550-billion-parameter Mixture-of-Experts model with only 55B active per token, built on a Hybrid Mamba-Attention architecture with LatentMoE routing and Multi-Token Prediction. It scores roughly 48 on the Artificial Analysis Intelligence Index, which NVIDIA positions as the leading US open-weights intelligence, and the weights ship openly under the OpenMDW-1.1 license. This page covers all three tiers — Nano, Super, and Ultra (plus the multimodal Nano Omni) — the architecture, the benchmark claims, and what it actually takes to run them yourself.

📅 Published: June 20, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed

Good news for self-hosters: unlike most frontier-tier models, Nemotron 3 is open weights — Ultra's checkpoints are on Hugging Face (nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16), and NVIDIA released training recipes and datasets too. See the running it locally section for the hardware reality check.

Key takeaways

  • Ultra = 550B total / 55B active — a sparse MoE, so it computes like a ~55B model but stores like a 550B one.
  • Hybrid Mamba-Attention + LatentMoE + MTP — Mamba-2 for cheap long context, attention for recall, Multi-Token Prediction for native speculative decoding.
  • ~48 on the Artificial Analysis Intelligence Index — NVIDIA reports it as the leading US open-weights model on that index.
  • 1M-token context and, per a pre-release DeepInfra endpoint, 300+ tok/s in BF16.
  • Open weights under OpenMDW-1.1 (a permissive Linux Foundation AI model license) — commercial use allowed.

Quick verdict

Nemotron 3 Ultra is the most capable open-weights model to come out of a US lab in this cycle, and the headline number — ~48 on the Artificial Analysis Intelligence Index — puts it well clear of the smaller open models (NVIDIA cites gpt-oss-120b at ~33 and its own Super at ~36 for context). The cleverness is in the architecture: a sparse 550B/55B-active MoE means it reasons at frontier quality while only paying the compute cost of a ~55B model per token, and the Mamba-Attention hybrid keeps the 1M-token context affordable.

The honest caveat is scale. "Open weights" does not mean "runs on your laptop" — at 550B parameters, Ultra is a server-class model. The tier most people will actually self-host is Nano (30B total, ~3B active) or the multimodal Nano Omni, with Super (120B / ~12B active) for mid-range boxes. If you want a frontier-class open model you can run on a workstation, look at DeepSeek V4 or GLM-5 alongside this. NVIDIA is the "leading US open weights" story here, not necessarily the absolute global leader.

Nemotron 3 Ultra specs at a glance

VendorNVIDIA
Family launchComputex 2026 (Ultra weights open-sourced June 2026)
Total parameters550 billion (Mixture-of-Experts)
Active parameters / token55 billion
ArchitectureHybrid Mamba-Attention MoE · LatentMoE routing · MTP layers
Reported layout108 layers · model dim 8,192 · 512 experts (top-22 active)
Context window1,000,000 tokens
Pretraining scale~20 trillion text tokens (NVIDIA-reported)
AA Intelligence Index~48 (NVIDIA / Artificial Analysis-reported)
LicenseOpenMDW-1.1 (open weights, commercial use)
Local self-hostable?Yes — but server-class hardware (see below)
Weightsnvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 (Hugging Face)
AccessHugging Face · NVIDIA NIM / build.nvidia.com · OpenRouter · DeepInfra

Sources: NVIDIA Newsroom "Nemotron 3 family" announcement, the NVIDIA Nemotron 3 Ultra Technical Report (for the 108-layer / 8,192-dim / 512-expert / top-22 details and the ~20T-token pretraining), and the Hugging Face model card. NVIDIA's December 2025 family press release rounds the lineup to "about 500B / up to 50B active"; the model card and marketed figure for Ultra is 550B / 55B active.

The architecture: why it's fast

Nemotron 3 Ultra is not a vanilla transformer. It stacks three ideas, each aimed at making a 550B model cheap to run:

Hybrid Mamba-Attention

Mamba-2 state-space layers carry the long sequence with sub-quadratic scaling — that's what keeps the 1M-token context from blowing up in cost — while selective attention layers preserve precise recall where it matters. The hybrid is the reason long-context throughput stays high.

LatentMoE routing

Of 512 experts, only the top ~22 fire per token. LatentMoE routes in a compressed latent space (NVIDIA describes trading away hidden-dimension width for accuracy-per-parameter), which is how Ultra packs 550B of knowledge while activating just 55B per forward pass.

Multi-Token Prediction (MTP)

MTP heads predict several future tokens in one forward pass, giving Ultra native speculative decoding — no separate draft model required. On a pre-release DeepInfra endpoint, this helped it serve 300+ tokens/sec in BF16.

The full Nemotron 3 family

Nemotron 3 ships in three sizes, plus a multimodal Nano variant. The smaller you go, the more realistically you can self-host it:

TierParams (total / active)WhenBest for
Nano30B / ~3BDec 2025Light, efficient tasks; 1M context; most self-hostable
Nano Omni30B / ~3B (multimodal)~Apr 28–29, 2026Open vision + audio + language in one model; document/video/audio agents
Super120B / ~12BMar 2026Mid-range enterprise reasoning; ~36 AA Intelligence Index
Ultra550B / 55BComputex 2026Frontier reasoning + long-running agents; ~48 AA Intelligence Index

Nano Omni is the interesting one for builders: it's an open multimodal model that takes text, images, audio, video, documents, charts, and GUI screenshots as input on a single 30B-A3B hybrid MoE, and NVIDIA reports up to ~9× higher throughput than other open multimodal models. It collapses a typical vision + ASR + LLM stack into one model — handy for agents that need to "see and hear" without three separate services.

Sources: NVIDIA Newsroom and NVIDIA Technical/AI blogs (Nano Omni, Super, Ultra). Parameter figures vary slightly between NVIDIA's rounded press copy and the Hugging Face repo names; we use the repo/marketed numbers where they differ.

Benchmarks & throughput

Most headline numbers here are NVIDIA's own claims or single-source third-party measurements, so treat them as vendor-reported until independent trackers settle. Where that's the case, we've flagged it.

MetricNemotron 3 UltraNotes
Artificial Analysis Intelligence Index~48NVIDIA cites it as leading US open-weights; vs Super ~36, gpt-oss-120b ~33.
Throughput (BF16)300+ tok/sMeasured on a pre-release DeepInfra endpoint — single source.
Decode-heavy speedup~5.9×NVIDIA-reported vs GLM-5.1 on an 8K-in / 64K-out workload.
Context window1,000,000Tokens; Mamba-hybrid keeps long-context cost down.
Pretraining tokens~20TNVIDIA-reported text-token count.

Sources: NVIDIA Newsroom / Technical Blog, Artificial Analysis Intelligence Index, MarkTechPost architecture write-up, and DeepInfra's Nemotron 3 Ultra release post. NVIDIA reports the throughput and speedup figures; verify against independent benchmarks before relying on them.

Running Nemotron 3 locally — the hardware reality

This is the part the launch coverage glosses over. "Open weights" is real and valuable — you can download, fine-tune, and self-host under OpenMDW-1.1 — but the three tiers live in very different worlds:

TierRough memory to serveRealistic where
Nano (30B / 3B)~18–24 GB at 4-bitA single 24 GB GPU (RTX 4090/5090) or a 32 GB+ Apple Silicon Mac
Super (120B / 12B)~60–80 GB at 4-bit2× 48 GB GPUs, an H100/H200, or a 128 GB unified-memory Mac
Ultra (550B / 55B)Hundreds of GB even quantizedMulti-GPU server / data-center node — not a workstation model

These are ballpark figures, not NVIDIA specs — MoE models only activate a slice of their weights per token, but you still have to hold all of them in memory, so total-parameter count drives the VRAM you need. For most individuals and small teams, the practical Nemotron 3 you run at home is Nano or Nano Omni; Ultra is something you rent on a NIM/DeepInfra/OpenRouter endpoint and reserve self-hosting for when data residency demands it.

How to choose a tier

If you want…PickWhy
A model you can actually run on one GPUNemotron 3 Nano30B/3B-active, 1M context, fits a 24 GB card at 4-bit.
Vision + audio + text in one open modelNemotron 3 Nano OmniUnified multimodal; replaces a vision+ASR+LLM stack.
Mid-range enterprise reasoning on premNemotron 3 Super120B / ~12B active; ~36 AA index; fits a single H100-class node.
Frontier reasoning for long-running agentsNemotron 3 Ultra550B/55B; ~48 AA index; rent it, self-host only if you must.
Frontier-class open model on a workstationDeepSeek V4Comparable open alternative with strong quant support.
Reasoning on a smaller footprintGLM-5MIT-licensed; lighter hardware bill than Ultra.

Run open weights on your own hardware

Nemotron 3 Ultra needs a server, but Nano and Nano Omni run on a single GPU — and so do DeepSeek V4, GLM-5, and the rest of the open-weight field. The Local AI Master deployment course walks you through quantization, serving, and fine-tuning open models locally — full data privacy, zero per-token cost.

See the deployment course →

Related models & guides

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once
LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
More on AI Models Directory
See the full AI Models Directory guide.
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Found your model? Now build something with it.

20 hands-on courses — RAG, agents, fine-tuning — all running locally. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators