NVIDIA · Open-Weights Model
NVIDIA Nemotron 3 Ultra: The 550B Open-Weights MoE, Reviewed
NVIDIA unveiled the Nemotron 3 family at Computex 2026, headlined by Nemotron 3 Ultra — a 550-billion-parameter Mixture-of-Experts model with only 55B active per token, built on a Hybrid Mamba-Attention architecture with LatentMoE routing and Multi-Token Prediction. It scores roughly 48 on the Artificial Analysis Intelligence Index, which NVIDIA positions as the leading US open-weights intelligence, and the weights ship openly under the OpenMDW-1.1 license. This page covers all three tiers — Nano, Super, and Ultra (plus the multimodal Nano Omni) — the architecture, the benchmark claims, and what it actually takes to run them yourself.
Good news for self-hosters: unlike most frontier-tier models, Nemotron 3 is open weights — Ultra's checkpoints are on Hugging Face (nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16), and NVIDIA released training recipes and datasets too. See the running it locally section for the hardware reality check.
Key takeaways
- →Ultra = 550B total / 55B active — a sparse MoE, so it computes like a ~55B model but stores like a 550B one.
- →Hybrid Mamba-Attention + LatentMoE + MTP — Mamba-2 for cheap long context, attention for recall, Multi-Token Prediction for native speculative decoding.
- →~48 on the Artificial Analysis Intelligence Index — NVIDIA reports it as the leading US open-weights model on that index.
- →1M-token context and, per a pre-release DeepInfra endpoint, 300+ tok/s in BF16.
- →Open weights under OpenMDW-1.1 (a permissive Linux Foundation AI model license) — commercial use allowed.
Quick verdict
Nemotron 3 Ultra is the most capable open-weights model to come out of a US lab in this cycle, and the headline number — ~48 on the Artificial Analysis Intelligence Index — puts it well clear of the smaller open models (NVIDIA cites gpt-oss-120b at ~33 and its own Super at ~36 for context). The cleverness is in the architecture: a sparse 550B/55B-active MoE means it reasons at frontier quality while only paying the compute cost of a ~55B model per token, and the Mamba-Attention hybrid keeps the 1M-token context affordable.
The honest caveat is scale. "Open weights" does not mean "runs on your laptop" — at 550B parameters, Ultra is a server-class model. The tier most people will actually self-host is Nano (30B total, ~3B active) or the multimodal Nano Omni, with Super (120B / ~12B active) for mid-range boxes. If you want a frontier-class open model you can run on a workstation, look at DeepSeek V4 or GLM-5 alongside this. NVIDIA is the "leading US open weights" story here, not necessarily the absolute global leader.
Nemotron 3 Ultra specs at a glance
| Vendor | NVIDIA |
| Family launch | Computex 2026 (Ultra weights open-sourced June 2026) |
| Total parameters | 550 billion (Mixture-of-Experts) |
| Active parameters / token | 55 billion |
| Architecture | Hybrid Mamba-Attention MoE · LatentMoE routing · MTP layers |
| Reported layout | 108 layers · model dim 8,192 · 512 experts (top-22 active) |
| Context window | 1,000,000 tokens |
| Pretraining scale | ~20 trillion text tokens (NVIDIA-reported) |
| AA Intelligence Index | ~48 (NVIDIA / Artificial Analysis-reported) |
| License | OpenMDW-1.1 (open weights, commercial use) |
| Local self-hostable? | Yes — but server-class hardware (see below) |
| Weights | nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 (Hugging Face) |
| Access | Hugging Face · NVIDIA NIM / build.nvidia.com · OpenRouter · DeepInfra |
Sources: NVIDIA Newsroom "Nemotron 3 family" announcement, the NVIDIA Nemotron 3 Ultra Technical Report (for the 108-layer / 8,192-dim / 512-expert / top-22 details and the ~20T-token pretraining), and the Hugging Face model card. NVIDIA's December 2025 family press release rounds the lineup to "about 500B / up to 50B active"; the model card and marketed figure for Ultra is 550B / 55B active.
The architecture: why it's fast
Nemotron 3 Ultra is not a vanilla transformer. It stacks three ideas, each aimed at making a 550B model cheap to run:
Hybrid Mamba-Attention
Mamba-2 state-space layers carry the long sequence with sub-quadratic scaling — that's what keeps the 1M-token context from blowing up in cost — while selective attention layers preserve precise recall where it matters. The hybrid is the reason long-context throughput stays high.
LatentMoE routing
Of 512 experts, only the top ~22 fire per token. LatentMoE routes in a compressed latent space (NVIDIA describes trading away hidden-dimension width for accuracy-per-parameter), which is how Ultra packs 550B of knowledge while activating just 55B per forward pass.
Multi-Token Prediction (MTP)
MTP heads predict several future tokens in one forward pass, giving Ultra native speculative decoding — no separate draft model required. On a pre-release DeepInfra endpoint, this helped it serve 300+ tokens/sec in BF16.
The full Nemotron 3 family
Nemotron 3 ships in three sizes, plus a multimodal Nano variant. The smaller you go, the more realistically you can self-host it:
| Tier | Params (total / active) | When | Best for |
|---|---|---|---|
| Nano | 30B / ~3B | Dec 2025 | Light, efficient tasks; 1M context; most self-hostable |
| Nano Omni | 30B / ~3B (multimodal) | ~Apr 28–29, 2026 | Open vision + audio + language in one model; document/video/audio agents |
| Super | 120B / ~12B | Mar 2026 | Mid-range enterprise reasoning; ~36 AA Intelligence Index |
| Ultra | 550B / 55B | Computex 2026 | Frontier reasoning + long-running agents; ~48 AA Intelligence Index |
Nano Omni is the interesting one for builders: it's an open multimodal model that takes text, images, audio, video, documents, charts, and GUI screenshots as input on a single 30B-A3B hybrid MoE, and NVIDIA reports up to ~9× higher throughput than other open multimodal models. It collapses a typical vision + ASR + LLM stack into one model — handy for agents that need to "see and hear" without three separate services.
Sources: NVIDIA Newsroom and NVIDIA Technical/AI blogs (Nano Omni, Super, Ultra). Parameter figures vary slightly between NVIDIA's rounded press copy and the Hugging Face repo names; we use the repo/marketed numbers where they differ.
Benchmarks & throughput
Most headline numbers here are NVIDIA's own claims or single-source third-party measurements, so treat them as vendor-reported until independent trackers settle. Where that's the case, we've flagged it.
| Metric | Nemotron 3 Ultra | Notes |
|---|---|---|
| Artificial Analysis Intelligence Index | ~48 | NVIDIA cites it as leading US open-weights; vs Super ~36, gpt-oss-120b ~33. |
| Throughput (BF16) | 300+ tok/s | Measured on a pre-release DeepInfra endpoint — single source. |
| Decode-heavy speedup | ~5.9× | NVIDIA-reported vs GLM-5.1 on an 8K-in / 64K-out workload. |
| Context window | 1,000,000 | Tokens; Mamba-hybrid keeps long-context cost down. |
| Pretraining tokens | ~20T | NVIDIA-reported text-token count. |
Sources: NVIDIA Newsroom / Technical Blog, Artificial Analysis Intelligence Index, MarkTechPost architecture write-up, and DeepInfra's Nemotron 3 Ultra release post. NVIDIA reports the throughput and speedup figures; verify against independent benchmarks before relying on them.
Running Nemotron 3 locally — the hardware reality
This is the part the launch coverage glosses over. "Open weights" is real and valuable — you can download, fine-tune, and self-host under OpenMDW-1.1 — but the three tiers live in very different worlds:
| Tier | Rough memory to serve | Realistic where |
|---|---|---|
| Nano (30B / 3B) | ~18–24 GB at 4-bit | A single 24 GB GPU (RTX 4090/5090) or a 32 GB+ Apple Silicon Mac |
| Super (120B / 12B) | ~60–80 GB at 4-bit | 2× 48 GB GPUs, an H100/H200, or a 128 GB unified-memory Mac |
| Ultra (550B / 55B) | Hundreds of GB even quantized | Multi-GPU server / data-center node — not a workstation model |
These are ballpark figures, not NVIDIA specs — MoE models only activate a slice of their weights per token, but you still have to hold all of them in memory, so total-parameter count drives the VRAM you need. For most individuals and small teams, the practical Nemotron 3 you run at home is Nano or Nano Omni; Ultra is something you rent on a NIM/DeepInfra/OpenRouter endpoint and reserve self-hosting for when data residency demands it.
How to choose a tier
| If you want… | Pick | Why |
|---|---|---|
| A model you can actually run on one GPU | Nemotron 3 Nano | 30B/3B-active, 1M context, fits a 24 GB card at 4-bit. |
| Vision + audio + text in one open model | Nemotron 3 Nano Omni | Unified multimodal; replaces a vision+ASR+LLM stack. |
| Mid-range enterprise reasoning on prem | Nemotron 3 Super | 120B / ~12B active; ~36 AA index; fits a single H100-class node. |
| Frontier reasoning for long-running agents | Nemotron 3 Ultra | 550B/55B; ~48 AA index; rent it, self-host only if you must. |
| Frontier-class open model on a workstation | DeepSeek V4 | Comparable open alternative with strong quant support. |
| Reasoning on a smaller footprint | GLM-5 | MIT-licensed; lighter hardware bill than Ultra. |
Run open weights on your own hardware
Nemotron 3 Ultra needs a server, but Nano and Nano Omni run on a single GPU — and so do DeepSeek V4, GLM-5, and the rest of the open-weight field. The Local AI Master deployment course walks you through quantization, serving, and fine-tuning open models locally — full data privacy, zero per-token cost.
See the deployment course →Related models & guides
- → Nemotron-70B — NVIDIA's earlier dense open model
- → DeepSeek V4 — frontier-class open-weight alternative you can self-host
- → GLM-5 — MIT-licensed reasoning model, smaller footprint
- → Best open-source LLMs of 2026
- → Best local AI models for programming
- → Model comparisons hub
Go from reading about AI to building with AI
20 structured courses. Hands-on projects. Runs on your machine. Start free.
Written by the Local AI Master Team
The team behind Local AI Master
We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.