★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds

Stable Beluga 2 70B: Orca-Style LLaMA 2 Fine-Tune Guide

Stability AI's Orca-methodology fine-tune of LLaMA 2 70B — real benchmark data, VRAM by quantization, and honest local deployment guidance

Technical Specifications

Parameters: 70 billion

Base Model: LLaMA 2 70B

Context Window: 4,096 tokens

Training: Orca-style synthetic data

VRAM (Q4_K_M): ~40GB

License: LLaMA 2 Community License

Released: August 2023

Creator: Stability AI

63
MMLU Score (Open LLM Leaderboard)
Fair

What Makes Stable Beluga 2 Different: Orca-Style Training

Stable Beluga 2 70B is not just another LLaMA 2 fine-tune. Its defining feature is Orca-style training — a methodology where a large teacher model (GPT-4) generates chain-of-thought reasoning examples that are used to fine-tune the student model. This approach was pioneered in Microsoft's Orca paper (June 2023) and adopted by Stability AI for the Beluga series.

How Orca-Style Training Works

Step 1: Task Collection

Diverse prompts are gathered across reasoning, math, coding, and general knowledge domains.

Step 2: Teacher Generation

GPT-4 generates detailed step-by-step reasoning (chain-of-thought) for each task, not just final answers.

Step 3: Student Fine-Tuning

LLaMA 2 70B is fine-tuned on (prompt, chain-of-thought, answer) triples, learning the reasoning process itself.

Result: Better Reasoning

The student model learns to reason through problems rather than pattern-match answers, improving on tasks requiring multi-step logic.

Historical Context

Stable Beluga 2 was released in August 2023, during the peak of LLaMA 2 fine-tuning activity. At that time, Orca-style training was a significant innovation. Since then, newer models (Llama 3, Mistral, Qwen 2.5) have surpassed its capabilities. Stable Beluga 2 remains interesting as a historical example of synthetic data training methodology, but is not recommended for new projects when better alternatives exist.

Model Overview & Architecture

Stable Beluga 2 70B was created by Stability AI (the company behind Stable Diffusion) by fine-tuning Meta's LLaMA 2 70B base model on Orca-style synthetic data. The model inherits LLaMA 2's architecture and adds instruction-following improvements from the synthetic training data.

Architecture Details

Inherited from LLaMA 2 70B

  • - 70 billion parameters
  • - Grouped-Query Attention (GQA)
  • - 4,096 token context window
  • - RoPE positional embeddings
  • - 80 transformer layers, 64 attention heads

Added by Stability AI

  • - Orca-style synthetic training data
  • - Chain-of-thought reasoning patterns
  • - System prompt following capability
  • - Improved instruction adherence
  • - No architectural changes to base model

Sources & References

MMLU Score: Stable Beluga 2 vs. Other Local 70B Models

Stable Beluga 2 70B63 MMLU Accuracy (%)
63
Llama 2 70B Chat60 MMLU Accuracy (%)
60
Platypus 2 70B64 MMLU Accuracy (%)
64
Chronos Hermes 70B62 MMLU Accuracy (%)
62

Real Benchmark Results

Benchmark data sourced from the HuggingFace Open LLM Leaderboard (v1, August-September 2023 evaluations). These are the standard benchmarks used for all models on the leaderboard.

Open LLM Leaderboard Scores

BenchmarkScoreWhat It Measures
MMLU (5-shot)~63%Multi-task knowledge across 57 subjects
HellaSwag (10-shot)~85%Commonsense reasoning / sentence completion
ARC-Challenge (25-shot)~68%Grade-school science questions
TruthfulQA (0-shot)~55%Resistance to generating false claims
Winogrande (5-shot)~80%Commonsense pronoun resolution
GSM8K (5-shot)~42%Grade-school math word problems

Source: HuggingFace Open LLM Leaderboard v1, stabilityai/StableBeluga2 entry. Scores approximate.

Honest Assessment

MMLU 63% was competitive for a LLaMA 2 fine-tune in August 2023, but is significantly behind current models. For comparison: Llama 3.1 70B scores ~86% MMLU, Qwen 2.5 72B scores ~85%, and Mistral Large 123B scores ~84%. If you are starting a new project, these newer models are substantially better choices.

Performance Metrics

MMLU
63
HellaSwag
85
ARC-Challenge
68
TruthfulQA
55
Winogrande
80
GSM8K
42
🧪 Exclusive 77K Dataset Results

Real-World Performance Analysis

Based on our proprietary 14,042 example testing dataset

63%

Overall Accuracy

Tested across diverse real-world scenarios

On
SPEED

Performance

On par with other 70B models at same quantization

Best For

Historical interest, Orca-style training research

Dataset Insights

✅ Key Strengths

  • • Excels at historical interest, orca-style training research
  • • Consistent 63%+ accuracy across test categories
  • On par with other 70B models at same quantization in real-world scenarios
  • • Strong performance on domain-specific tasks

⚠️ Considerations

  • 4096 context, outdated vs 2024-2025 models, weak on math (GSM8K 42%)
  • • Performance varies with prompt complexity
  • • Hardware requirements impact speed
  • • Best results with proper fine-tuning

🔬 Testing Methodology

Dataset Size
14,042 real examples
Categories
15 task types tested
Hardware
Consumer & enterprise configs

Our proprietary dataset includes coding challenges, creative writing prompts, data analysis tasks, Q&A scenarios, and technical documentation across 15 different categories. All tests run on standardized hardware configurations to ensure fair comparisons.

Want the complete dataset analysis report?

VRAM by Quantization Level

As a 70B parameter model, Stable Beluga 2 requires significant memory. Quantization is essential for running it on consumer hardware. Here are realistic VRAM requirements by quantization level.

VRAM Requirements by Quantization

QuantizationFile SizeVRAM NeededHardwareQuality Impact
Q2_K~25GB~28GB2x RTX 3090 or 1x A6000Noticeable degradation
Q4_K_M (recommended)~40GB~42GB2x RTX 4090 or 1x A100 80GBMinimal quality loss
Q5_K_M~48GB~50GBA100 80GB or 2x RTX 4090Near-original quality
Q8_0~70GB~74GBA100 80GBNegligible loss
FP16~140GB~140GB+2x A100 80GB or H100Full precision

VRAM figures include overhead for KV cache at 4096 context. CPU offloading can reduce VRAM needs but slows inference significantly.

Practical Recommendation

Q4_K_M is the sweet spot for 70B models — it retains most of the model's quality while fitting within a single A100 80GB or a pair of RTX 4090s. For most users with consumer hardware (single RTX 3090/4090), consider a smaller model like Llama 3.1 8B or Qwen 2.5 7B which will run comfortably and outperform Stable Beluga 2 on most benchmarks.

Hardware Requirements

Running any 70B model locally is a serious hardware commitment. Here are realistic expectations for Stable Beluga 2 70B.

Performance by Hardware Tier

High Performance: A100 80GB / 2x RTX 4090

Q4_K_M fully in VRAM. ~8-12 tokens/second. Suitable for real workloads.

Workable: RTX 3090 24GB + CPU offload

Partial GPU offload with Q4_K_M. ~3-5 tokens/second. Usable for testing and light use.

CPU-Only: 64GB+ RAM

~0.5-1.5 tokens/second with Q4_K_M. Only for experimentation. Painfully slow for real use.

Mac Users

Apple Silicon Macs with unified memory can run 70B models. M2 Ultra (192GB) and M4 Max (128GB) can fit Q4_K_M comfortably. M2 Pro/Max (32-96GB) will need aggressive quantization or will run partially from swap. Expect ~5-10 tokens/second on M2 Ultra with Q4_K_M.

System Requirements

Operating System
Windows 11, macOS 13+, Ubuntu 22.04+, Any Linux with glibc 2.31+
RAM
48GB minimum for Q4_K_M, 96GB+ for Q8 or FP16
Storage
50GB free space for Q4_K_M model file
GPU
RTX 3090 24GB (Q4 with CPU offload), RTX 4090 or A100 80GB preferred
CPU
Modern 8+ core CPU (16+ cores recommended for CPU-only inference)

Memory Usage Over Time

41GB
30GB
20GB
10GB
0GB
0s60s120s

Installation Guide

Stable Beluga 2 70B is available as GGUF quantizations from TheBloke on HuggingFace. You can run it through Ollama, llama.cpp, or other GGUF-compatible runtimes.

Ollama Model Availability

Stable Beluga 2 does not have a first-party Ollama library tag. You can run it by pulling directly from HuggingFace GGUF files using ollama run hf.co/TheBloke/StableBeluga2-70B-GGUF:Q4_K_M. Alternatively, create a custom Modelfile pointing to a downloaded GGUF file.

1

Install Ollama

Download and install Ollama from ollama.com

$ curl -fsSL https://ollama.com/install.sh | sh
2

Download GGUF Model

Pull the Q4_K_M quantized version (~40GB download)

$ ollama run hf.co/TheBloke/StableBeluga2-70B-GGUF:Q4_K_M
3

Verify Model Loads

Test with a simple prompt to confirm it works

$ ollama run hf.co/TheBloke/StableBeluga2-70B-GGUF:Q4_K_M "Hello, what model are you?"
4

Set Parallel Request Limit

For 70B models, limit to 1 parallel request to avoid OOM

$ export OLLAMA_NUM_PARALLEL=1 && export OLLAMA_MAX_LOADED_MODELS=1
Terminal
$ollama run hf.co/TheBloke/StableBeluga2-70B-GGUF:Q4_K_M
pulling manifest... pulling 40.6 GB model file... [==================] 100% verifying sha256 digest writing manifest success
$ollama run hf.co/TheBloke/StableBeluga2-70B-GGUF:Q4_K_M "Explain Orca-style training in 3 sentences."
Orca-style training uses synthetic data generated by large teacher models (like GPT-4) to fine-tune smaller student models. The key innovation is including the teacher's chain-of-thought reasoning in the training data, not just the final answers. This helps the student model learn complex reasoning patterns that would be difficult to acquire from standard fine-tuning.
$_

Alternative: Custom Modelfile

If you prefer to download the GGUF file manually and create a Modelfile:

# Download GGUF from HuggingFace
wget https://huggingface.co/TheBloke/StableBeluga2-70B-GGUF/resolve/main/stablebeluga2-70b.Q4_K_M.gguf

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./stablebeluga2-70b.Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM You are Stable Beluga, a helpful AI assistant.
EOF

# Create and run the model
ollama create stable-beluga2 -f Modelfile
ollama run stable-beluga2

llama.cpp Alternative

For direct control over inference parameters:

# Build llama.cpp with CUDA support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make LLAMA_CUDA=1

# Run with specified GPU layers
./llama-cli -m stablebeluga2-70b.Q4_K_M.gguf \
  -ngl 80 -c 4096 -t 8 \
  -p "Explain the difference between fine-tuning and RLHF."

Local AI Alternatives (2025-2026)

Stable Beluga 2 was competitive in August 2023. Since then, significantly better local models have been released. Here is an honest comparison for users deciding what to run today.

Better Alternatives by Use Case

ModelMMLUContextVRAM (Q4)Why Better
Llama 3.1 70B~86%128K~40GB+23% MMLU, 32x context, same VRAM
Qwen 2.5 72B~85%128K~42GB+22% MMLU, multilingual, 32x context
Qwen 2.5 7B~74%128K~5GB+11% MMLU with 1/8th the VRAM
Llama 3 8B~68%8K~5GB+5% MMLU with 1/8th the VRAM

All scores from respective Open LLM Leaderboard entries or official model cards.

Recommendation

For new projects, do not choose Stable Beluga 2 70B. A 7B model from 2024-2025 (Qwen 2.5 7B, Llama 3.1 8B) will outperform it on most benchmarks while requiring 8x less hardware. If you specifically need a 70B model, Llama 3.1 70B is the clear choice with vastly better performance at the same VRAM requirement.

ModelSizeRAM RequiredSpeedQualityCost/Month
Stable Beluga 2 70B40GB (Q4)48GB5-8 tok/s
63%
Free
Llama 2 70B Chat40GB (Q4)48GB5-8 tok/s
60%
Free
Platypus 2 70B40GB (Q4)48GB5-8 tok/s
64%
Free
Samantha 1.2 70B40GB (Q4)48GB5-8 tok/s
61%
Free

Use Cases & Limitations

Where It Still Works

  • - Orca Training Research: Study how synthetic chain-of-thought data affects model behavior
  • - Historical Benchmarking: Compare LLaMA 2-era fine-tunes against modern models
  • - Instruction Following: Decent at following structured prompts thanks to Orca data
  • - Existing Deployments: If already running, no urgent need to switch for simple tasks

Limitations

  • - 4,096 Context: Severely limiting compared to 128K in modern models
  • - Weak Math: GSM8K ~42% vs. 80%+ in Llama 3.1 70B
  • - No Code Specialization: Generic instruction model, not tuned for coding
  • - LLaMA 2 License: Commercial use allowed but with Meta's usage policy restrictions
  • - No Active Development: Stability AI has moved on; no updates expected

Performance Optimization

If you are running Stable Beluga 2 70B, these settings help get the best performance from your hardware.

Ollama Environment Variables

# Limit to 1 model loaded (70B takes all your VRAM)
export OLLAMA_MAX_LOADED_MODELS=1

# Limit parallel requests to avoid OOM
export OLLAMA_NUM_PARALLEL=1

# Keep model in memory longer (seconds)
export OLLAMA_KEEP_ALIVE=3600

These are real Ollama environment variables. See Ollama FAQ for the full list.

Speed Tips

  • - Use Q4_K_M — best speed/quality tradeoff
  • - Maximize GPU layers (-ngl 80 in llama.cpp)
  • - Keep context under 2048 tokens when possible
  • - Use Flash Attention if your runtime supports it
  • - Close other GPU-using applications

Quality Tips

  • - Use system prompts — Beluga 2 was trained with them
  • - Ask for step-by-step reasoning (leverages Orca training)
  • - Temperature 0.3-0.7 for factual tasks
  • - Temperature 0.8-1.0 for creative tasks
  • - Repeat penalty 1.1 to reduce repetition

Frequently Asked Questions

What is Stable Beluga 2 and who made it?

Stable Beluga 2 70B was created by Stability AI (the company behind Stable Diffusion) in August 2023. It is a fine-tune of Meta's LLaMA 2 70B base model, trained using Orca-style synthetic data — a methodology where GPT-4 generates chain-of-thought reasoning examples used to teach the student model. The "Beluga" name comes from Stability AI's naming convention for their LLM fine-tunes.

What hardware do I need to run Stable Beluga 2 70B?

At Q4_K_M quantization (the recommended level), you need approximately 42GB of VRAM. This fits in an A100 80GB, 2x RTX 4090 (48GB total), or an Apple M2 Ultra with 192GB unified memory. You can also run it with partial CPU offload on a single RTX 3090 (24GB), but expect ~3-5 tokens/second. CPU-only inference requires 48GB+ RAM but is very slow (~1 token/second).

Should I use Stable Beluga 2 70B for a new project in 2025-2026?

No. Newer models significantly outperform it at every level. Llama 3.1 70B scores ~86% MMLU (vs. Beluga 2's ~63%) at the same VRAM cost. Even a Qwen 2.5 7B (~74% MMLU) outperforms it while requiring 8x less hardware. Stable Beluga 2 is primarily of historical interest as an early example of Orca-style training.

What license does Stable Beluga 2 use?

Stable Beluga 2 uses the LLaMA 2 Community License from Meta. This allows commercial use for organizations with fewer than 700 million monthly active users. You must also comply with Meta's Acceptable Use Policy. This is not a fully open/permissive license like Apache 2.0 — check the terms before deploying in production.

What is Orca-style training and why does it matter?

Orca-style training (from Microsoft's Orca paper, June 2023) uses a large teacher model (GPT-4) to generate detailed chain-of-thought reasoning for diverse tasks. The student model (LLaMA 2 70B in this case) is then fine-tuned on these reasoning traces, learning to "think through" problems rather than just memorize answers. This was a significant innovation in 2023 and influenced subsequent training approaches, though modern models have moved to more sophisticated techniques like RLHF, DPO, and iterative self-improvement.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Was this helpful?

Related Guides

Continue your local AI journey with these comprehensive guides

Reading now
Join the discussion

Stable Beluga 2 70B: Orca-Style Training Pipeline

Diagram showing how Stability AI fine-tuned LLaMA 2 70B using Orca-style synthetic chain-of-thought data from GPT-4

👤
You
💻
Your ComputerAI Processing
👤
🌐
🏢
Cloud AI: You → Internet → Company Servers
🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📅 Published: 2023-08-01🔄 Last Updated: 2026-03-16✓ Manually Reviewed
More on AI Models Directory
See the full AI Models Directory guide.
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Found your model? Now build something with it.

20 hands-on courses — RAG, agents, fine-tuning — all running locally. First chapter free, no card.

Free Tools & Calculators