Mixture of Experts (MoE) Explained: How DeepSeek & Llama 4 Work
Before we dive deeper...
Get your free AI Starter Kit
Join 12,000+ developers. Instant download: Career Roadmap + Fundamentals Cheat Sheets.
MoE Quick Summary
What is Mixture of Experts?
Mixture of Experts (MoE) is an architecture that achieves better performance without proportionally increasing compute by using sparse activation—only some parts of the model process each input.
The Core Idea
Input Token → Router → Select Top-K Experts → Process → Combine Outputs
↓
[Expert 1] [Expert 2] [Expert 3] ... [Expert N]
✓ ✓ ✗ ✗
(active) (active) (inactive) (inactive)
Instead of one massive network, MoE uses:
- Multiple expert networks (typically 8-64 smaller networks)
- A router that decides which experts to use
- Sparse activation where only top-K experts (usually 2) process each token
Dense vs MoE: The Key Difference
Dense Model (Traditional)
- All parameters active for every token
- 70B model = 70B computations per token
- Quality scales with size, but so does cost
MoE Model (Sparse)
- Only some experts active per token
- 671B total, 37B active = 37B computations per token
- Quality of 671B, speed of 37B
Real-World Example: Mixtral 8x7B
| Metric | Mixtral 8x7B | Dense Equivalent |
|---|---|---|
| Total Parameters | 47B | 47B |
| Active Parameters | 13B | 47B |
| Inference Speed | Like 13B | Like 47B |
| Quality | Like 45B | 47B |
| VRAM Needed | 28GB | 28GB |
Result: Mixtral achieves near-47B quality at 13B speed.
Major MoE Models in 2026
| Model | Total Params | Active Params | Experts | VRAM (Q4) |
|---|---|---|---|---|
| Mixtral 8x7B | 47B | 13B | 8 | 28GB |
| Mixtral 8x22B | 141B | 39B | 8 | 85GB |
| DeepSeek V3 | 671B | 37B | 256 | 24GB* |
| DeepSeek R1 | 671B | 37B | 256 | 24GB* |
| Llama 4 Scout | 109B | 17B | 16 | 12GB |
| Llama 4 Maverick | 400B | 17B | 128 | 24GB |
*With advanced quantization and MoE-specific optimizations
How Expert Routing Works
The Router Network
# Simplified router logic
def route(token_embedding, expert_weights):
# Router scores each expert
scores = softmax(token_embedding @ router_weights)
# Select top-K experts
top_k_indices = top_k(scores, k=2)
top_k_weights = scores[top_k_indices]
# Normalize weights
top_k_weights = top_k_weights / sum(top_k_weights)
return top_k_indices, top_k_weights
Combining Expert Outputs
def forward(token, experts, router):
indices, weights = router(token)
output = 0
for idx, weight in zip(indices, weights):
expert_output = experts[idx](token)
output += weight * expert_output
return output
Advantages of MoE
1. Better Quality per FLOP
- 671B capacity with 37B compute
- Learn more specialized representations
- Each expert can focus on different tasks
2. Faster Inference
- Only 2 experts active per token (typically)
- 5-10x faster than dense model of same capacity
- Better for interactive applications
3. Scaling Efficiency
- Add experts without proportional compute increase
- Easier to scale to trillion parameters
- Better hardware utilization
Disadvantages of MoE
1. Higher Total VRAM
- All experts must fit in memory
- 671B model needs weights for 671B
- Quantization helps but still larger than dense
2. Load Balancing Challenges
- Some experts may be overused, others underused
- Requires auxiliary losses to balance
- Can reduce effective capacity
3. Training Complexity
- More hyperparameters to tune
- Expert collapse possible
- Harder to debug and optimize
VRAM Considerations for Local Use
What Fits in VRAM
| GPU | VRAM | Best MoE Model |
|---|---|---|
| RTX 4070 Ti | 16GB | Mixtral 8x7B Q4 |
| RTX 4090 | 24GB | DeepSeek V3 Q4, Llama 4 Maverick Q4 |
| RTX 5090 | 32GB | Llama 4 Maverick Q5 |
| Dual 4090 | 48GB | Mixtral 8x22B Q4 |
Why MoE Models Need Full VRAM
Even though only 2 experts are active, all expert weights must be in memory because:
- Different tokens route to different experts
- Can't predict which experts will be needed
- Swapping experts from disk would be too slow
MoE in Practice
Running Mixtral Locally
# Install Ollama
ollama pull mixtral:8x7b
# Run with MoE optimizations
ollama run mixtral:8x7b
Running DeepSeek V3
# DeepSeek's MoE is highly optimized
ollama run deepseek-v3
Checking Expert Activation
# Some frameworks expose expert usage stats
model.get_expert_usage()
# Shows which experts activated most
The Future of MoE
MoE is becoming the dominant architecture because:
- Frontier labs use it: GPT-4 (rumored), Gemini, DeepSeek all use MoE
- Efficiency scales: Can reach trillion parameters affordably
- Hardware catching up: Better support in CUDA, MLX, etc.
- Research advances: Better routing, load balancing, training
Key Takeaways
- MoE = Multiple experts, sparse activation
- Active params << Total params (e.g., 37B active in 671B model)
- Speed of small model, quality of large model
- Need VRAM for all params but compute only for active
- Dominant architecture for frontier models in 2025-2026
Next Steps
- Run DeepSeek V3 (MoE model) locally
- Try Llama 4 (MoE architecture)
- Compare models by architecture
- Choose hardware for MoE models
MoE represents the most significant architectural shift in LLMs since transformers. Understanding it helps you choose models wisely and optimize your local AI setup.
Ready to start your AI career?
Get the complete roadmap
Download the AI Starter Kit: Career path, fundamentals, and cheat sheets used by 12K+ developers.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!