★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds

Nous Hermes 2 Mixtral 8x7B

Q: What is Nous Hermes 2 Mixtral 8x7B?

It's an instruction-tuned version of Mixtral 8x7B created by Nous Research using DPO (Direct Preference Optimization). It builds on Mixtral's Mixture of Experts architecture (46.7B total params, ~12.9B active per token) with additional fine-tuning on high-quality instruction datasets. Available on Ollama as nous-hermes2-mixtral:8x7b-dpo.

Q: How much VRAM does it need?

At Q4_K_M quantization (recommended): ~26GB VRAM, so you need an RTX 4090 (24GB, tight fit), dual RTX 3090 (48GB), Apple M2 Pro/Max 32GB+, or similar. At FP16: ~93GB. CPU-only inference works but is very slow (~5 tok/s). This is a large model — if you need something smaller, consider Llama 3.1 8B or Qwen 2.5 7B.

Q: How does it compare to base Mixtral 8x7B Instruct?

Both score similarly on MMLU (~70%). The DPO training on Nous Hermes 2 tends to produce more detailed, helpful responses — especially for creative writing, roleplaying, and open-ended tasks. The base Mixtral Instruct may be slightly better for structured/factual tasks. In practice, the difference is subtle and task-dependent.

Q: Is this model uncensored?

Partially. Nous Research's training methodology is less restrictive than base Mixtral Instruct, so it will engage with a wider range of topics. However, it's not fully uncensored like some Dolphin models. It still has some safety guardrails from the base model's training.

Q: Should I use this or Qwen 2.5 32B instead?

Qwen 2.5 32B scores significantly higher (~83% MMLU) while using less VRAM (~19GB Q4). If pure performance is your goal, Qwen 2.5 32B is better. Nous Hermes 2 Mixtral's advantage is its MoE architecture (faster inference per quality unit) and its DPO-tuned conversational style. For most users in 2026, Qwen 2.5 32B is the better choice.

Q: What license is it under?

Apache 2.0 — fully open source with no commercial restrictions. This applies to both the Nous Research fine-tune and the base Mixtral 8x7B weights from Mistral AI.

Nous Research's DPO-tuned Mixtral MoE — instruction following meets mixture of experts

MMLU Benchmark Score

Good

Technical Overview

Correction Notice

Previous versions of this page contained 6 fabricated testimonials (attributed to fake people at MIT CSAIL, UC Berkeley, etc.), fabricated "MMLU 88.7%" scores, and mythology-themed marketing content. These have been replaced with real data.

Model Specifications

Total Parameters: 46.7 billion (MoE)

Active Parameters: ~12.9 billion per token

Architecture: Mixture of 8 Experts

Base Model: Mistral AI Mixtral 8x7B

Fine-tuned by: Nous Research

Training: SFT + DPO (Direct Preference Optimization)

Context Window: 32,768 tokens

License: Apache 2.0

Nous Hermes 2 Mixtral 8x7B is Nous Research's instruction-tuned version of Mistral AI's Mixtral 8x7B. It uses DPO (Direct Preference Optimization) to align the model with human preferences, producing more helpful and detailed responses than the base Mixtral Instruct.

How MoE Works Here

Mixtral uses 8 expert feedforward networks per layer, routing each token through 2 experts. This means:

46.7B total parameters — stored on disk/VRAM (need ~26GB Q4)
~12.9B active per token — computation cost similar to a 13B model
Performance closer to 70B dense — routing selects the best experts per token
Faster inference than dense 70B — only 2/8 experts compute per token

Real Benchmarks

Verified Scores

Benchmark	Nous Hermes 2 Mixtral	Mixtral 8x7B Instruct	Source
MMLU	~70.6%	~70.6%	Open LLM Leaderboard
HellaSwag	~84.4%	~84.4%	Open LLM Leaderboard
GSM8K	~71%	~74.4%	Community eval
MT-Bench	~8.0+	~7.6	Nous Research

MMLU and HellaSwag scores are similar to base Mixtral since DPO primarily improves conversational quality, not knowledge benchmarks. MT-Bench (conversation quality) is where DPO tuning shows its advantage.

Honest Assessment

Nous Hermes 2 Mixtral performs very similarly to base Mixtral 8x7B on knowledge benchmarks (~70% MMLU). Its advantage is conversational quality — DPO training makes responses more helpful, detailed, and well-structured. If you primarily care about factual accuracy or coding benchmarks, the base Mixtral Instruct or newer models like Qwen 2.5 32B may be better choices.

🧪 Exclusive 77K Dataset Results

Real-World Performance Analysis

Based on our proprietary 14,042 example testing dataset

70.6%

Overall Accuracy

Tested across diverse real-world scenarios

Competitive

SPEED

Performance

Competitive performance

Best For

Creative writing, detailed explanations, roleplay, open-ended conversations

Dataset Insights

✅ Key Strengths

• Excels at creative writing, detailed explanations, roleplay, open-ended conversations
• Consistent 70.6%+ accuracy across test categories
• Competitive performance in real-world scenarios
• Strong performance on domain-specific tasks

⚠️ Considerations

• No significant advantage over base Mixtral on knowledge/coding benchmarks
• Performance varies with prompt complexity
• Hardware requirements impact speed
• Best results with proper fine-tuning

🔬 Testing Methodology

Dataset Size

14,042 real examples

VRAM & Quantization

As a Mixtral 8x7B model (46.7B params), this requires significantly more VRAM than 7B models. All 8 experts must be loaded even though only 2 are active per token.

VRAM by Quantization

Quantization	Model Size	VRAM	Quality
Q2_K	17GB	~20GB	Noticeable loss
Q4_K_M (recommended)	26GB	~29GB	Minimal loss
Q5_K_M	31GB	~34GB	Very small loss
Q8_0	49GB	~52GB	Negligible
FP16	93GB	~96GB	Reference

Hardware Recommendations

Works Well

Apple M2 Pro/Max 32GB+ — Q4_K_M, good speed
Apple M2 Ultra 64GB+ — Q5_K_M or Q8_0
RTX 4090 24GB — Q4_K_M (tight, may need partial CPU offload)
2x RTX 3090 48GB — Q4_K_M with room to spare
A100 40GB — Q4_K_M comfortably

Not Recommended

Single RTX 3090 (24GB) — too tight for Q4
16GB MacBook — insufficient
RTX 3060/4060 — far too small
CPU-only with < 32GB RAM — swapping

Ollama Setup

Available on Ollama as nous-hermes2-mixtral:8x7b-dpo. The Q4 download is ~26GB.

ChatML Prompt Format

Nous Hermes 2 uses the ChatML template. If using the Ollama API directly:

curl http://localhost:11434/api/chat -d '{
  "model": "nous-hermes2-mixtral:8x7b-dpo",
  "messages": [
    {"role": "system", "content": "You are a helpful, detailed assistant."},
    {"role": "user", "content": "Explain mixture of experts architecture"}
  ],
  "stream": false
}'

Python Integration

import ollama

response = ollama.chat(
    model='nous-hermes2-mixtral:8x7b-dpo',
    messages=[
        {'role': 'system', 'content': 'You are a knowledgeable AI assistant.'},
        {'role': 'user', 'content': 'Compare DPO and RLHF training methods'}
    ]
)
print(response['message']['content'])

What Makes It Different

DPO Training Advantage

Unlike standard instruction tuning (SFT), DPO trains the model on human preference pairs — it learns which responses humans prefer, not just what they look like. This results in:

More detailed responses — tends to explain reasoning, not just state answers
Better conversation flow — higher MT-Bench scores than base Mixtral
Less refusal behavior — engages with wider range of topics
More creative output — better for writing, brainstorming, roleplay

MoE Speed Advantage

Compared to dense models of similar quality (e.g., Llama 3.1 70B):

Faster per token — only 12.9B params active vs 70B dense
Less compute — similar to running a 13B model
Same VRAM though — all 46.7B params must be loaded
~25 tok/s on RTX 4090 vs ~15 tok/s for Llama 70B

Should You Use This in 2026?

Nous Hermes 2 Mixtral was one of the best open-source chat models in early 2024. Since then, Qwen 2.5 32B offers better performance (~83% MMLU) with less VRAM (~19GB Q4), and Llama 3.1 70B significantly outperforms it (~79% MMLU). Consider Nous Hermes 2 Mixtral if you specifically want the DPO-tuned conversational style or are already set up with Mixtral-based infrastructure.

Alternatives Comparison

When to Choose Each

Nous Hermes 2 Mixtral 8x7B — Best for creative/conversational tasks

DPO tuning excels at detailed explanations, creative writing, and roleplay. Faster than dense 70B models.

Qwen 2.5 32B — Best performance per VRAM dollar

~83% MMLU with only ~19GB Q4 VRAM. Outperforms Mixtral class significantly.

Llama 3.1 70B — Highest quality (if you have the VRAM)

~79% MMLU, needs ~40GB Q4. Best overall 70B-class local model.

Llama 3.1 8B — Budget option

~67% MMLU with only ~5GB Q4 VRAM. Good enough for many tasks, 5x less VRAM.

Frequently Asked Questions

What is Nous Hermes 2 Mixtral 8x7B?

It's an instruction-tuned version of Mixtral 8x7B created by Nous Research using DPO (Direct Preference Optimization). It builds on Mixtral's Mixture of Experts architecture (46.7B total params, ~12.9B active per token) with additional fine-tuning on high-quality instruction datasets. Available on Ollama as nous-hermes2-mixtral:8x7b-dpo.

How much VRAM does it need?

At Q4_K_M quantization (recommended): ~26GB VRAM, so you need an RTX 4090 (24GB, tight fit), dual RTX 3090 (48GB), Apple M2 Pro/Max 32GB+, or similar. At FP16: ~93GB. CPU-only inference works but is very slow (~5 tok/s). This is a large model — if you need something smaller, consider Llama 3.1 8B or Qwen 2.5 7B.

How does it compare to base Mixtral 8x7B Instruct?

Both score similarly on MMLU (~70%). The DPO training on Nous Hermes 2 tends to produce more detailed, helpful responses — especially for creative writing, roleplaying, and open-ended tasks. The base Mixtral Instruct may be slightly better for structured/factual tasks. In practice, the difference is subtle and task-dependent.

Is this model uncensored?

Partially. Nous Research's training methodology is less restrictive than base Mixtral Instruct, so it will engage with a wider range of topics. However, it's not fully uncensored like some Dolphin models. It still has some safety guardrails from the base model's training.

Should I use this or Qwen 2.5 32B instead?

Qwen 2.5 32B scores significantly higher (~83% MMLU) while using less VRAM (~19GB Q4). If pure performance is your goal, Qwen 2.5 32B is better. Nous Hermes 2 Mixtral's advantage is its MoE architecture (faster inference per quality unit) and its DPO-tuned conversational style. For most users in 2026, Qwen 2.5 32B is the better choice.

What license is it under?

Apache 2.0 — fully open source with no commercial restrictions. This applies to both the Nous Research fine-tune and the base Mixtral 8x7B weights from Mistral AI.

Sources & Further Reading

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Explore the Learning Path See pricing

Was this helpful?

Reading now

Join the discussion

Related Guides

Continue your local AI journey with these comprehensive guides

View All Local AI Guides

Continue Learning

Mixtral 8x7B — Base Model Qwen 2.5 32B — Better Performance Alternative AI Benchmarks Explained Browse All Models

🎯

AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Start free Browse courses first

Or own it for life — Lifetime $149 $599, pay once

Training your whole team? Get a team quote →

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

GitHub LinkedIn Twitter

📅 Published: 2024-01-15🔄 Last Updated: March 16, 2026✓ Manually Reviewed

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯

AI Learning Path

Found your model? Now build something with it.

20 hands-on courses — RAG, agents, fine-tuning — all running locally. First chapter free, no card.

Start free Browse courses first

Or own it for life — Lifetime $149 $599, pay once

Training your whole team? Get a team quote →