Nemotron 70B Local Setup Guide (2026): NVIDIA's RLHF-Refined Llama 3.1 70B
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Llama-3.1-Nemotron-70B-Instruct is NVIDIA's October 2024 RLHF-refined fine-tune of Meta's Llama 3.1 70B. Same architecture, same parameter count — but post-trained with NVIDIA's HelpSteer2 preference dataset and a custom reward model. The result: at release it topped Arena Hard (85.0), AlpacaEval 2 LC (57.6), and MT-Bench (8.98), beating GPT-4o, Claude 3.5 Sonnet, and the base Llama 3.1 70B on chat-quality benchmarks. For agentic, conversational, and instruction-following workloads in 2026, it's still one of the strongest open-weight 70B-class models.
This guide covers the full Nemotron family (Mini-4B, 70B Instruct, 70B Reward, Nemotron-4 340B), setup across vLLM / TensorRT-LLM / Ollama, multi-GPU and quantized deployment, fine-tuning with QLoRA, and how to use HelpSteer2 + Nemotron-Reward to RLHF your own models.
Table of Contents
- What Nemotron Is
- The Nemotron Family
- HelpSteer2 + RLAIF: How NVIDIA Tuned It
- Hardware Requirements & Quantization
- Nemotron 70B vs Llama 3.1 70B vs GPT-4o
- vLLM Setup (Multi-GPU)
- TensorRT-LLM Setup (FP8 H100)
- Ollama Setup
- llama.cpp Setup with GGUF
- Quantized Variants (AWQ, GPTQ, FP8)
- Nemotron-Mini 4B for Edge / Agents
- Nemotron-Reward for Custom RLHF
- Fine-Tuning Nemotron 70B
- Function Calling & Structured Output
- System Prompts & Sampling
- Real Benchmarks
- Licensing
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What Nemotron Is {#what-it-is}
Nemotron-70B (nvidia/Llama-3.1-Nemotron-70B-Instruct-HF on HuggingFace) is a Llama 3.1 70B Instruct derivative with NVIDIA's full post-training pipeline applied. Architecture: identical to Llama 3.1 70B (decoder-only, GQA, RoPE, 131K context, 80 layers). What changed: NVIDIA replaced Meta's RLHF with their own HelpSteer2 + Nemotron-Reward iterative DPO process.
License: Llama 3.1 Community License (inherits from Meta) + NVIDIA Open Model License rider. Free for commercial use under typical conditions; the 700M-MAU clause from Meta still applies.
The Nemotron Family {#family}
| Variant | Parameters | Context | Best For |
|---|---|---|---|
| Nemotron-Mini-4B-Instruct | 4B | 4K | Edge agents, fast tasks |
| Llama-3.1-Nemotron-70B-Instruct | 70B | 131K | Flagship chat / agents |
| Llama-3.1-Nemotron-70B-Reward | 70B | 131K | Reward model for RLHF |
| Nemotron-4-340B-Base | 340B | 4K | Research foundation |
| Nemotron-4-340B-Instruct | 340B | 4K | Top-tier chat (8x H100) |
| Nemotron-4-340B-Reward | 340B | 4K | Reward model (rare use) |
| Llama-Nemotron-Ultra (2025) | 253B | 131K | Reasoning successor |
For most users: 70B Instruct. For agentic workflows on a single 24GB GPU: Nemotron-Mini 4B. For research labs: 340B. Llama-Nemotron-Ultra is NVIDIA's 2025 reasoning successor — see Llama 4 Local Setup.
HelpSteer2 + RLAIF: How NVIDIA Tuned It {#training}
NVIDIA's post-training pipeline:
- Start from Llama 3.1 70B Instruct (Meta's checkpoint)
- HelpSteer2 dataset — 10,681 conversations, each rated 0-4 on 5 attributes (helpfulness, correctness, coherence, complexity, verbosity) by 3+ annotators
- Train reward model (Llama-3.1-Nemotron-70B-Reward) on HelpSteer2
- Generate synthetic preference pairs via the model itself + reward scoring
- Iterative DPO/RLAIF rounds with the reward model as judge
- Output: Llama-3.1-Nemotron-70B-Instruct
Why iteration helps: each round closes the gap between the model's outputs and the reward signal. After 3-5 rounds, marginal gains diminish.
The full recipe is documented in NVIDIA's NeMo-Aligner repo and is reproducible on any Llama base.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Hardware Requirements & Quantization {#requirements}
| GPU Setup | Quant | Throughput |
|---|---|---|
| 1x H100 80GB | FP8 (TRT-LLM) | 100-130 tok/s |
| 1x H100 80GB | INT8 | 80-100 tok/s |
| 2x A100 80GB | BF16 | 60-80 tok/s |
| 1x A100 80GB | INT4 AWQ | 50-70 tok/s |
| 2x RTX 3090 (NVLink) | INT4 AWQ | 25-35 tok/s |
| 2x RTX 4090 (no NVLink) | INT4 AWQ | 20-30 tok/s |
| 1x RTX 4090 + 64 GB RAM | Q4_K_M GGUF (split) | 4-8 tok/s |
| CPU (128 GB RAM) | Q4_K_M GGUF | 1-3 tok/s |
For self-hosters: 2x RTX 3090 with NVLink + AWQ via vLLM is the cost-effective sweet spot. For production: H100 with TensorRT-LLM FP8.
Nemotron 70B vs Llama 3.1 70B vs GPT-4o {#comparison}
| Benchmark | Nemotron 70B | Llama 3.1 70B | GPT-4o (May 2024) | Claude 3.5 Sonnet |
|---|---|---|---|---|
| Arena Hard | 85.0 | 55.7 | 79.2 | 79.3 |
| AlpacaEval 2 LC | 57.6 | 38.1 | 57.5 | 52.4 |
| MT-Bench | 8.98 | 8.78 | 8.74 | 9.10 |
| MMLU | 86.0 | 86.0 | 88.7 | 88.3 |
| MMLU-Pro | 60.4 | 60.4 | 73.3 | 75.1 |
| GSM8K | 94.5 | 95.1 | 95.8 | 96.4 |
| HumanEval | 83.5 | 80.5 | 90.2 | 92.0 |
| Context length | 131K | 131K | 128K | 200K |
Nemotron decisively wins on subjective chat benchmarks (Arena Hard, AlpacaEval 2 LC). Knowledge / math / code performance is essentially identical to base Llama 3.1 70B (post-training rarely moves these). For pure raw capability ceiling, GPT-4o / Claude 3.5 Sonnet are still ahead on MMLU-Pro and HumanEval.
vLLM Setup (Multi-GPU) {#vllm}
# 2x A100/H100 BF16
vllm serve nvidia/Llama-3.1-Nemotron-70B-Instruct-HF \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--enable-auto-tool-choice \
--tool-call-parser llama3_json
INT4 AWQ for tighter VRAM:
vllm serve neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-quantized.w4a16 \
--quantization compressed-tensors \
--max-model-len 32768
For 2x RTX 3090:
vllm serve hugging-quants/Llama-3.1-Nemotron-70B-Instruct-AWQ-INT4 \
--quantization awq \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--gpu-memory-utilization 0.95
See vLLM Complete Setup Guide.
TensorRT-LLM Setup (FP8 H100) {#tensorrt}
For maximum H100 throughput:
git clone https://github.com/NVIDIA/TensorRT-LLM
cd TensorRT-LLM/examples/llama
# Convert HF checkpoint to TRT-LLM format
python convert_checkpoint.py \
--model_dir /models/Nemotron-70B-Instruct-HF \
--output_dir /trt_ckpt/nemotron-70b-fp8 \
--dtype bfloat16 \
--use_fp8 \
--tp_size 1
# Build engine
trtllm-build \
--checkpoint_dir /trt_ckpt/nemotron-70b-fp8 \
--output_dir /trt_engines/nemotron-70b-fp8 \
--gemm_plugin fp8 \
--max_input_len 4096 \
--max_output_len 2048 \
--max_batch_size 16
# Serve via Triton
Throughput: ~120 tok/s single-stream, 1500+ tok/s aggregate at batch 16 on a single H100. See TensorRT-LLM Setup Guide.
Ollama Setup {#ollama}
ollama pull nemotron:70b
ollama run nemotron:70b "Write a polite refusal email to a vendor missing SLAs."
Modelfile customization:
FROM nemotron:70b
PARAMETER num_ctx 16384
PARAMETER temperature 0.5
PARAMETER min_p 0.05
PARAMETER repeat_penalty 1.05
SYSTEM """You are a precise, helpful assistant. Be concise unless asked otherwise."""
Ollama auto-splits across multiple GPUs. For 2x RTX 3090 it's the simplest path.
llama.cpp Setup with GGUF {#llamacpp}
huggingface-cli download bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-GGUF \
Llama-3.1-Nemotron-70B-Instruct-HF-Q4_K_M.gguf \
--local-dir ./models
./llama-cli \
-m models/Llama-3.1-Nemotron-70B-Instruct-HF-Q4_K_M.gguf \
-ngl 999 -c 16384 -fa \
--temp 0.5 --min-p 0.05 \
-p "Plan a 5-step migration from MySQL 5.7 to PostgreSQL 16."
For split GPU + CPU:
./llama-cli -m nemotron-70b-Q4_K_M.gguf -ngl 60 -c 8192
-ngl 60 puts ~60 of 80 layers on GPU; rest on CPU. Use this when single GPU has insufficient VRAM.
Quantized Variants (AWQ, GPTQ, FP8) {#quants}
| Quant | VRAM (70B) | Speed (vs BF16) | Quality Loss |
|---|---|---|---|
| BF16 | 140 GB | 1.0x | 0% |
| FP8 (H100) | 70 GB | 1.4x | <0.5% |
| INT8 (W8A8) | 70 GB | 1.2x | <1% |
| AWQ INT4 | 38 GB | 2.0x | 1-2% |
| GPTQ INT4 | 38 GB | 1.8x | 1-2% |
| GGUF Q5_K_M | 49 GB | 1.5x | <1% |
| GGUF Q4_K_M | 40 GB | 1.7x | 1-2% |
| GGUF Q3_K_M | 32 GB | 1.9x | 3-5% |
Recommended: AWQ INT4 for vLLM, FP8 for TensorRT-LLM on H100, Q4_K_M for llama.cpp. See AWQ vs GPTQ vs GGUF.
Nemotron-Mini 4B for Edge / Agents {#mini}
ollama run nemotron-mini
Use cases:
- Local agentic loops where 70B is too slow
- On-device function calling (4 GB VRAM)
- Embedded systems for routing / classification
- Real-time chat at 100+ tok/s
Quality is below Llama 3.1 8B on knowledge tasks but competitive on tool-use and structured output — Nemotron-Mini was specifically tuned for agent workloads.
Nemotron-Reward for Custom RLHF {#reward}
The Llama-3.1-Nemotron-70B-Reward model scores any prompt+response on the 5 HelpSteer2 axes. Use it to:
- Train your own preference dataset: feed your model outputs through the reward model, keep top-rated.
- Filter synthetic SFT data: discard low-quality generations before fine-tuning.
- Run DPO/PPO on any base model: use Nemotron-Reward as the judge.
from transformers import AutoModelForCausalLM, AutoTokenizer
reward_model = AutoModelForCausalLM.from_pretrained(
"nvidia/Llama-3.1-Nemotron-70B-Reward-HF",
torch_dtype="bfloat16", device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("nvidia/Llama-3.1-Nemotron-70B-Reward-HF")
# Score a response
messages = [
{"role": "user", "content": "Explain RoPE in 2 sentences."},
{"role": "assistant", "content": "..."}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
score = reward_model(inputs).logits[0][-1].item() # higher = better
Fine-Tuning Nemotron 70B {#fine-tuning}
QLoRA on 2x RTX 3090 (NVLink) with Axolotl:
base_model: nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 64
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
datasets:
- path: ./train.jsonl
type: chat_template
sequence_len: 2048
sample_packing: true
micro_batch_size: 1
gradient_accumulation_steps: 8
num_epochs: 3
learning_rate: 1e-4
warmup_ratio: 0.03
optimizer: paged_adamw_8bit
bf16: auto
flash_attention: true
output_dir: ./nemotron_lora
accelerate launch --num_processes 2 -m axolotl.cli.train config.yml
Time: ~36 hours for 1K examples on 2x RTX 3090. See QLoRA Fine-Tuning Guide.
Function Calling & Structured Output {#tools}
vllm serve nvidia/Llama-3.1-Nemotron-70B-Instruct-HF \
--tensor-parallel-size 2 \
--enable-auto-tool-choice \
--tool-call-parser llama3_json
Then OpenAI-compatible:
{
"model": "nemotron",
"messages": [...],
"tools": [{"type": "function", "function": {...}}],
"tool_choice": "auto"
}
Nemotron's HelpSteer2 tuning improved tool-call argument reliability — fewer hallucinated parameters than base Llama 3.1 70B in our agentic tests. See Ollama Function Calling.
System Prompts & Sampling {#prompting}
Llama 3.1 chat template:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
[user message]<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Recommended sampling:
- Chat / instruction: temperature 0.5, min-p 0.05, top-p 0.9
- Code/reasoning: temperature 0.2, min-p 0.05
- Creative: temperature 0.8, min-p 0.05
Nemotron 70B was tuned with HelpSteer2's "verbosity" axis — it tends toward longer, more structured responses than base Llama. Add "be concise" to system prompt if you want shorter outputs.
Real Benchmarks {#benchmarks}
Single-user, vLLM, AWQ INT4, 2x RTX 3090 NVLink:
| Test | Nemotron 70B | Llama 3.1 70B | Llama 3.1 405B |
|---|---|---|---|
| Arena Hard | 85.0% | 55.7% | 69.3% |
| MT-Bench | 8.98 | 8.78 | 8.99 |
| MMLU | 86.0% | 86.0% | 88.6% |
| HumanEval | 83.5% | 80.5% | 89.0% |
| Throughput (single user) | ~30 tok/s | ~30 tok/s | n/a (too big) |
| TTFT (1K prompt, 2x 3090) | 380 ms | 380 ms | n/a |
For chat workloads at 70B, Nemotron beats both base Llama 70B and 405B on subjective quality.
Licensing {#licensing}
Inherits Llama 3.1 Community License (Meta) plus NVIDIA Open Model License rider.
You can:
- Use commercially under 700M MAU threshold
- Modify and redistribute
- Fine-tune and build derivative models
- Publish derivatives under permissive terms (with attribution)
You cannot:
- Deploy at >700M monthly active users without separate Meta agreement
- Use to improve another LLM without naming attribution
- Violate Meta's Acceptable Use Policy
For most enterprise use: license is functionally similar to Apache 2.0 in practice. For consumer apps with unclear MAU forecasts: Apache 2.0 alternatives like OLMo 2 or Mistral may be safer.
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| OOM on 2x RTX 3090 | BF16 doesn't fit | Use AWQ INT4 quant |
| Slow tensor parallel | NVLink missing | Add NVLink bridge or use INT4 + smaller TP |
| Tool calls malformed | Wrong parser | --tool-call-parser llama3_json |
| Verbose responses | HelpSteer2 verbosity axis | Add "be concise" system prompt |
| TRT-LLM FP8 build fails | Old CUDA | Need CUDA 12.4+, TRT 10+ |
| Repetitive output | No min-p | Set min-p 0.05 |
| Tokenizer issues in Ollama | Cache stale | ollama rm nemotron && ollama pull nemotron:70b |
FAQ {#faq}
See answers to common Nemotron 70B questions below.
Sources: NVIDIA Nemotron 70B announcement | HelpSteer2 dataset | HelpSteer2 paper (arXiv 2406.08673) | NeMo-Aligner repo | Nemotron-Reward | Internal benchmarks 2x RTX 3090 + AWQ.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!