★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Models

Nemotron 70B Local Setup Guide (2026): NVIDIA's RLHF-Refined Llama 3.1 70B

May 2, 2026
24 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

Llama-3.1-Nemotron-70B-Instruct is NVIDIA's October 2024 RLHF-refined fine-tune of Meta's Llama 3.1 70B. Same architecture, same parameter count — but post-trained with NVIDIA's HelpSteer2 preference dataset and a custom reward model. The result: at release it topped Arena Hard (85.0), AlpacaEval 2 LC (57.6), and MT-Bench (8.98), beating GPT-4o, Claude 3.5 Sonnet, and the base Llama 3.1 70B on chat-quality benchmarks. For agentic, conversational, and instruction-following workloads in 2026, it's still one of the strongest open-weight 70B-class models.

This guide covers the full Nemotron family (Mini-4B, 70B Instruct, 70B Reward, Nemotron-4 340B), setup across vLLM / TensorRT-LLM / Ollama, multi-GPU and quantized deployment, fine-tuning with QLoRA, and how to use HelpSteer2 + Nemotron-Reward to RLHF your own models.

Table of Contents

  1. What Nemotron Is
  2. The Nemotron Family
  3. HelpSteer2 + RLAIF: How NVIDIA Tuned It
  4. Hardware Requirements & Quantization
  5. Nemotron 70B vs Llama 3.1 70B vs GPT-4o
  6. vLLM Setup (Multi-GPU)
  7. TensorRT-LLM Setup (FP8 H100)
  8. Ollama Setup
  9. llama.cpp Setup with GGUF
  10. Quantized Variants (AWQ, GPTQ, FP8)
  11. Nemotron-Mini 4B for Edge / Agents
  12. Nemotron-Reward for Custom RLHF
  13. Fine-Tuning Nemotron 70B
  14. Function Calling & Structured Output
  15. System Prompts & Sampling
  16. Real Benchmarks
  17. Licensing
  18. Troubleshooting
  19. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What Nemotron Is {#what-it-is}

Nemotron-70B (nvidia/Llama-3.1-Nemotron-70B-Instruct-HF on HuggingFace) is a Llama 3.1 70B Instruct derivative with NVIDIA's full post-training pipeline applied. Architecture: identical to Llama 3.1 70B (decoder-only, GQA, RoPE, 131K context, 80 layers). What changed: NVIDIA replaced Meta's RLHF with their own HelpSteer2 + Nemotron-Reward iterative DPO process.

License: Llama 3.1 Community License (inherits from Meta) + NVIDIA Open Model License rider. Free for commercial use under typical conditions; the 700M-MAU clause from Meta still applies.


The Nemotron Family {#family}

VariantParametersContextBest For
Nemotron-Mini-4B-Instruct4B4KEdge agents, fast tasks
Llama-3.1-Nemotron-70B-Instruct70B131KFlagship chat / agents
Llama-3.1-Nemotron-70B-Reward70B131KReward model for RLHF
Nemotron-4-340B-Base340B4KResearch foundation
Nemotron-4-340B-Instruct340B4KTop-tier chat (8x H100)
Nemotron-4-340B-Reward340B4KReward model (rare use)
Llama-Nemotron-Ultra (2025)253B131KReasoning successor

For most users: 70B Instruct. For agentic workflows on a single 24GB GPU: Nemotron-Mini 4B. For research labs: 340B. Llama-Nemotron-Ultra is NVIDIA's 2025 reasoning successor — see Llama 4 Local Setup.


HelpSteer2 + RLAIF: How NVIDIA Tuned It {#training}

NVIDIA's post-training pipeline:

  1. Start from Llama 3.1 70B Instruct (Meta's checkpoint)
  2. HelpSteer2 dataset — 10,681 conversations, each rated 0-4 on 5 attributes (helpfulness, correctness, coherence, complexity, verbosity) by 3+ annotators
  3. Train reward model (Llama-3.1-Nemotron-70B-Reward) on HelpSteer2
  4. Generate synthetic preference pairs via the model itself + reward scoring
  5. Iterative DPO/RLAIF rounds with the reward model as judge
  6. Output: Llama-3.1-Nemotron-70B-Instruct

Why iteration helps: each round closes the gap between the model's outputs and the reward signal. After 3-5 rounds, marginal gains diminish.

The full recipe is documented in NVIDIA's NeMo-Aligner repo and is reproducible on any Llama base.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Hardware Requirements & Quantization {#requirements}

GPU SetupQuantThroughput
1x H100 80GBFP8 (TRT-LLM)100-130 tok/s
1x H100 80GBINT880-100 tok/s
2x A100 80GBBF1660-80 tok/s
1x A100 80GBINT4 AWQ50-70 tok/s
2x RTX 3090 (NVLink)INT4 AWQ25-35 tok/s
2x RTX 4090 (no NVLink)INT4 AWQ20-30 tok/s
1x RTX 4090 + 64 GB RAMQ4_K_M GGUF (split)4-8 tok/s
CPU (128 GB RAM)Q4_K_M GGUF1-3 tok/s

For self-hosters: 2x RTX 3090 with NVLink + AWQ via vLLM is the cost-effective sweet spot. For production: H100 with TensorRT-LLM FP8.


Nemotron 70B vs Llama 3.1 70B vs GPT-4o {#comparison}

BenchmarkNemotron 70BLlama 3.1 70BGPT-4o (May 2024)Claude 3.5 Sonnet
Arena Hard85.055.779.279.3
AlpacaEval 2 LC57.638.157.552.4
MT-Bench8.988.788.749.10
MMLU86.086.088.788.3
MMLU-Pro60.460.473.375.1
GSM8K94.595.195.896.4
HumanEval83.580.590.292.0
Context length131K131K128K200K

Nemotron decisively wins on subjective chat benchmarks (Arena Hard, AlpacaEval 2 LC). Knowledge / math / code performance is essentially identical to base Llama 3.1 70B (post-training rarely moves these). For pure raw capability ceiling, GPT-4o / Claude 3.5 Sonnet are still ahead on MMLU-Pro and HumanEval.


vLLM Setup (Multi-GPU) {#vllm}

# 2x A100/H100 BF16
vllm serve nvidia/Llama-3.1-Nemotron-70B-Instruct-HF \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.92 \
    --enable-auto-tool-choice \
    --tool-call-parser llama3_json

INT4 AWQ for tighter VRAM:

vllm serve neuralmagic/Llama-3.1-Nemotron-70B-Instruct-HF-quantized.w4a16 \
    --quantization compressed-tensors \
    --max-model-len 32768

For 2x RTX 3090:

vllm serve hugging-quants/Llama-3.1-Nemotron-70B-Instruct-AWQ-INT4 \
    --quantization awq \
    --tensor-parallel-size 2 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.95

See vLLM Complete Setup Guide.


TensorRT-LLM Setup (FP8 H100) {#tensorrt}

For maximum H100 throughput:

git clone https://github.com/NVIDIA/TensorRT-LLM
cd TensorRT-LLM/examples/llama

# Convert HF checkpoint to TRT-LLM format
python convert_checkpoint.py \
    --model_dir /models/Nemotron-70B-Instruct-HF \
    --output_dir /trt_ckpt/nemotron-70b-fp8 \
    --dtype bfloat16 \
    --use_fp8 \
    --tp_size 1

# Build engine
trtllm-build \
    --checkpoint_dir /trt_ckpt/nemotron-70b-fp8 \
    --output_dir /trt_engines/nemotron-70b-fp8 \
    --gemm_plugin fp8 \
    --max_input_len 4096 \
    --max_output_len 2048 \
    --max_batch_size 16

# Serve via Triton

Throughput: ~120 tok/s single-stream, 1500+ tok/s aggregate at batch 16 on a single H100. See TensorRT-LLM Setup Guide.


Ollama Setup {#ollama}

ollama pull nemotron:70b
ollama run nemotron:70b "Write a polite refusal email to a vendor missing SLAs."

Modelfile customization:

FROM nemotron:70b
PARAMETER num_ctx 16384
PARAMETER temperature 0.5
PARAMETER min_p 0.05
PARAMETER repeat_penalty 1.05
SYSTEM """You are a precise, helpful assistant. Be concise unless asked otherwise."""

Ollama auto-splits across multiple GPUs. For 2x RTX 3090 it's the simplest path.


llama.cpp Setup with GGUF {#llamacpp}

huggingface-cli download bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-GGUF \
    Llama-3.1-Nemotron-70B-Instruct-HF-Q4_K_M.gguf \
    --local-dir ./models

./llama-cli \
    -m models/Llama-3.1-Nemotron-70B-Instruct-HF-Q4_K_M.gguf \
    -ngl 999 -c 16384 -fa \
    --temp 0.5 --min-p 0.05 \
    -p "Plan a 5-step migration from MySQL 5.7 to PostgreSQL 16."

For split GPU + CPU:

./llama-cli -m nemotron-70b-Q4_K_M.gguf -ngl 60 -c 8192

-ngl 60 puts ~60 of 80 layers on GPU; rest on CPU. Use this when single GPU has insufficient VRAM.


Quantized Variants (AWQ, GPTQ, FP8) {#quants}

QuantVRAM (70B)Speed (vs BF16)Quality Loss
BF16140 GB1.0x0%
FP8 (H100)70 GB1.4x<0.5%
INT8 (W8A8)70 GB1.2x<1%
AWQ INT438 GB2.0x1-2%
GPTQ INT438 GB1.8x1-2%
GGUF Q5_K_M49 GB1.5x<1%
GGUF Q4_K_M40 GB1.7x1-2%
GGUF Q3_K_M32 GB1.9x3-5%

Recommended: AWQ INT4 for vLLM, FP8 for TensorRT-LLM on H100, Q4_K_M for llama.cpp. See AWQ vs GPTQ vs GGUF.


Nemotron-Mini 4B for Edge / Agents {#mini}

ollama run nemotron-mini

Use cases:

  • Local agentic loops where 70B is too slow
  • On-device function calling (4 GB VRAM)
  • Embedded systems for routing / classification
  • Real-time chat at 100+ tok/s

Quality is below Llama 3.1 8B on knowledge tasks but competitive on tool-use and structured output — Nemotron-Mini was specifically tuned for agent workloads.


Nemotron-Reward for Custom RLHF {#reward}

The Llama-3.1-Nemotron-70B-Reward model scores any prompt+response on the 5 HelpSteer2 axes. Use it to:

  1. Train your own preference dataset: feed your model outputs through the reward model, keep top-rated.
  2. Filter synthetic SFT data: discard low-quality generations before fine-tuning.
  3. Run DPO/PPO on any base model: use Nemotron-Reward as the judge.
from transformers import AutoModelForCausalLM, AutoTokenizer

reward_model = AutoModelForCausalLM.from_pretrained(
    "nvidia/Llama-3.1-Nemotron-70B-Reward-HF",
    torch_dtype="bfloat16", device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("nvidia/Llama-3.1-Nemotron-70B-Reward-HF")

# Score a response
messages = [
    {"role": "user", "content": "Explain RoPE in 2 sentences."},
    {"role": "assistant", "content": "..."}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
score = reward_model(inputs).logits[0][-1].item()  # higher = better

See DPO / ORPO / KTO Guide.


Fine-Tuning Nemotron 70B {#fine-tuning}

QLoRA on 2x RTX 3090 (NVLink) with Axolotl:

base_model: nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 64
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

datasets:
  - path: ./train.jsonl
    type: chat_template

sequence_len: 2048
sample_packing: true
micro_batch_size: 1
gradient_accumulation_steps: 8
num_epochs: 3
learning_rate: 1e-4
warmup_ratio: 0.03
optimizer: paged_adamw_8bit
bf16: auto
flash_attention: true
output_dir: ./nemotron_lora
accelerate launch --num_processes 2 -m axolotl.cli.train config.yml

Time: ~36 hours for 1K examples on 2x RTX 3090. See QLoRA Fine-Tuning Guide.


Function Calling & Structured Output {#tools}

vllm serve nvidia/Llama-3.1-Nemotron-70B-Instruct-HF \
    --tensor-parallel-size 2 \
    --enable-auto-tool-choice \
    --tool-call-parser llama3_json

Then OpenAI-compatible:

{
  "model": "nemotron",
  "messages": [...],
  "tools": [{"type": "function", "function": {...}}],
  "tool_choice": "auto"
}

Nemotron's HelpSteer2 tuning improved tool-call argument reliability — fewer hallucinated parameters than base Llama 3.1 70B in our agentic tests. See Ollama Function Calling.


System Prompts & Sampling {#prompting}

Llama 3.1 chat template:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

[user message]<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Recommended sampling:

  • Chat / instruction: temperature 0.5, min-p 0.05, top-p 0.9
  • Code/reasoning: temperature 0.2, min-p 0.05
  • Creative: temperature 0.8, min-p 0.05

Nemotron 70B was tuned with HelpSteer2's "verbosity" axis — it tends toward longer, more structured responses than base Llama. Add "be concise" to system prompt if you want shorter outputs.


Real Benchmarks {#benchmarks}

Single-user, vLLM, AWQ INT4, 2x RTX 3090 NVLink:

TestNemotron 70BLlama 3.1 70BLlama 3.1 405B
Arena Hard85.0%55.7%69.3%
MT-Bench8.988.788.99
MMLU86.0%86.0%88.6%
HumanEval83.5%80.5%89.0%
Throughput (single user)~30 tok/s~30 tok/sn/a (too big)
TTFT (1K prompt, 2x 3090)380 ms380 msn/a

For chat workloads at 70B, Nemotron beats both base Llama 70B and 405B on subjective quality.


Licensing {#licensing}

Inherits Llama 3.1 Community License (Meta) plus NVIDIA Open Model License rider.

You can:

  • Use commercially under 700M MAU threshold
  • Modify and redistribute
  • Fine-tune and build derivative models
  • Publish derivatives under permissive terms (with attribution)

You cannot:

  • Deploy at >700M monthly active users without separate Meta agreement
  • Use to improve another LLM without naming attribution
  • Violate Meta's Acceptable Use Policy

For most enterprise use: license is functionally similar to Apache 2.0 in practice. For consumer apps with unclear MAU forecasts: Apache 2.0 alternatives like OLMo 2 or Mistral may be safer.


Troubleshooting {#troubleshooting}

SymptomCauseFix
OOM on 2x RTX 3090BF16 doesn't fitUse AWQ INT4 quant
Slow tensor parallelNVLink missingAdd NVLink bridge or use INT4 + smaller TP
Tool calls malformedWrong parser--tool-call-parser llama3_json
Verbose responsesHelpSteer2 verbosity axisAdd "be concise" system prompt
TRT-LLM FP8 build failsOld CUDANeed CUDA 12.4+, TRT 10+
Repetitive outputNo min-pSet min-p 0.05
Tokenizer issues in OllamaCache staleollama rm nemotron && ollama pull nemotron:70b

FAQ {#faq}

See answers to common Nemotron 70B questions below.


Sources: NVIDIA Nemotron 70B announcement | HelpSteer2 dataset | HelpSteer2 paper (arXiv 2406.08673) | NeMo-Aligner repo | Nemotron-Reward | Internal benchmarks 2x RTX 3090 + AWQ.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 2, 2026🔄 Last Updated: May 2, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes Nemotron 70B + vLLM 2x GPU production deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators