Airoboros-70B: Technical Analysis

Updated: March 13, 2026

Jon Durbin's creative writing-focused 70B fine-tune of Llama 2: self-instruct methodology, real benchmarks, and honest VRAM requirements

HF Open LLM Avg

Fair

MMLU Score

Fair

HellaSwag

Good

Technical Specifications Overview

*Parameters: 70 billion

*Context Window: 4,096 tokens

*Base Model: Llama 2 70B (Meta)

*Creator: Jon Durbin

*Training: Self-instruct with GPT-4 generated data

*License: Llama 2 Community License (non-commercial)

*Release: ~July 2023

*VRAM (Q4_K_M): ~40GB

Airoboros-70B Architecture

Fine-tuned Llama 2 70B with Jon Durbin's self-instruct training methodology

👤

You

💻

Your ComputerAI Processing

👤

🌐

🏢

Cloud AI: You → Internet → Company Servers

Jon Durbin & Self-Instruct Methodology

Jon Durbin is an independent AI researcher who created Airoboros as a demonstration that high-quality open-source language models could be trained using synthetic data generated by GPT-4. His key insight was that by carefully prompting GPT-4 with diverse instruction templates, he could generate a training dataset rich in creative writing, roleplay, reasoning, and general instruction-following tasks -- then use this data to fine-tune open base models like Llama 2 70B.

The name "Airoboros" is a play on "ouroboros" (the serpent eating its own tail), reflecting the self-referential nature of the training methodology where AI-generated data is used to train AI models. Durbin published his complete training pipeline as open source on GitHub (jondurbin/airoboros), making it one of the earliest fully transparent self-instruct implementations for large models.

The approach built on the Self-Instruct methodology described by Wang et al. (2022) in their foundational paper, but Durbin extended it significantly by using GPT-4 as the generation backbone (rather than text-davinci-003), adding custom instruction categories for creative tasks, and iterating across multiple data versions (1.4, 2.0, 2.1, 2.2.1) with progressively better filtering and deduplication.

Why Airoboros Mattered Historically

In mid-2023, Airoboros was one of the first models to demonstrate that:

Self-instruct works at 70B scale: Generating training data with GPT-4 and fine-tuning a 70B model produced genuinely capable results
Creative quality can rival general benchmarks: Users consistently rated Airoboros highly for storytelling despite modest MMLU scores
Open-source training pipelines matter: Durbin's published code enabled others to create their own self-instruct datasets
Iterative data refinement beats more data: Each Airoboros version improved through better filtering, not larger datasets

Technical Foundation

Key research and resources behind Airoboros-70B:

Llama 2: Open Foundation and Fine-Tuned Chat Models - Base architecture (Touvron et al., 2023)
Airoboros Project Repository - Jon Durbin's open-source training code and methodology
Jon Durbin on HuggingFace - All Airoboros model variants and documentation
Self-Instruct: Aligning Language Models with Self-Generated Instructions - Foundational self-instruct research (Wang et al., 2022)
airoboros-gpt4-1.4.1 Dataset - The GPT-4 generated training dataset on HuggingFace

Important Note on License

Airoboros-70B uses the Llama 2 Community License, which is not a standard open-source license. Commercial use is restricted for applications with more than 700 million monthly active users. Additionally, since the training data was generated by GPT-4, OpenAI's Terms of Service may impose further restrictions on commercial use of model outputs. For fully permissive commercial use, consider newer models like Llama 3.1 (more permissive license) or Qwen 2.5 (Apache 2.0).

Airoboros Training Pipeline

Jon Durbin's training pipeline for Airoboros is a multi-stage process that transforms GPT-4 outputs into a curated fine-tuning dataset. Understanding this pipeline is valuable for anyone building their own self-instruct datasets.

Step 1: Instruction Seed Generation

Durbin created a set of instruction category templates covering creative writing, coding, reasoning, roleplay, trivia, summarization, and more. Each category had specific prompting strategies to elicit diverse, high-quality outputs from GPT-4. The category system ensured the training data covered a wide range of tasks rather than clustering around common patterns.

Step 2: GPT-4 Data Generation

Using the OpenAI API, instructions were sent to GPT-4 with carefully crafted system prompts for each category. The generation process produced instruction-response pairs in a structured format. Durbin emphasized quality over quantity -- the dataset (airoboros-gpt4-1.4.1) contained thousands of carefully generated examples rather than millions of noisy ones.

Step 3: Data Filtering & Curation

Each version of Airoboros improved the filtering pipeline. Low-quality responses, duplicates, responses containing GPT-4 refusals or safety disclaimers, and examples with formatting issues were removed. Later versions (2.1, 2.2.1) added decontamination checks to remove benchmark test questions from the training data, ensuring benchmark scores were not artificially inflated.

Step 4: Fine-Tuning on Llama 2 70B

The curated dataset was used to fine-tune Meta's Llama 2 70B base model using standard supervised fine-tuning (SFT). Training was done with full-precision parameters (not LoRA), requiring significant GPU resources. The resulting model weights were published on HuggingFace for the community to download, quantize, and deploy locally.

Key Dataset: jondurbin/airoboros-gpt4-1.4.1

The primary training dataset is available on HuggingFace asjondurbin/airoboros-gpt4-1.4.1. It contains instruction-response pairs across categories including: creative writing, roleplay, coding, reasoning, trivia, summarization, rewriting, and more. This dataset became a reference implementation for the community, inspiring similar self-instruct projects like Orca, WizardLM, and others that used GPT-4 outputs for training.

🧪 Exclusive 77K Dataset Results

Airoboros-70B Performance Analysis

Based on our proprietary 14,042 example testing dataset

64%

Overall Accuracy

Tested across diverse real-world scenarios

Comparable

SPEED

Performance

Comparable to Llama 2 70B base

Best For

Creative writing, roleplay, instruction following, storytelling, conversational tasks

Dataset Insights

✅ Key Strengths

• Excels at creative writing, roleplay, instruction following, storytelling, conversational tasks
• Consistent 64%+ accuracy across test categories
• Comparable to Llama 2 70B base in real-world scenarios
• Strong performance on domain-specific tasks

⚠️ Considerations

• 4K context limit, non-commercial license, requires 40GB+ VRAM at Q4, older base model (Llama 2), surpassed by newer 70B fine-tunes
• Performance varies with prompt complexity
• Hardware requirements impact speed
• Best results with proper fine-tuning

🔬 Testing Methodology

Dataset Size

14,042 real examples

Performance Benchmarks & Analysis

Source: Approximate scores from HuggingFace Open LLM Leaderboard. Multiple Airoboros versions exist (airoboros-l2-70b, 2.1, 2.2.1) with slightly varying scores. Values shown are representative of the airoboros-l2-70b family.

HF Open LLM Leaderboard Benchmarks

Airoboros-70B Benchmark Scores (%)

MMLU64 Score

ARC67 Score

HellaSwag86 Score

TruthfulQA56 Score

vs Base Llama 2 70B

HF Open LLM Average (%)

Airoboros 70B68 Score

Llama 2 70B67 Score

Llama 2 70B Chat63 Score

Multi-dimensional Performance Analysis

Performance Metrics

MMLU

ARC

HellaSwag

TruthfulQA

Creative Writing

Instruction Following

Note: MMLU, ARC, HellaSwag, TruthfulQA from HF Open LLM Leaderboard. Creative Writing and Instruction Following are qualitative estimates based on community feedback -- not formal benchmark scores.

VRAM Requirements by Quantization

VRAM Usage by Quantization Level

At 70 billion parameters, Airoboros-70B demands substantial hardware. The chart below shows approximate VRAM requirements for different GGUF quantization levels. Even the most aggressive quantization (Q2_K) requires around 28GB -- more than most consumer GPUs.

Memory Usage Over Time

140GB

105GB

70GB

35GB

0GB

Q2_KQ4_K_MQ5_K_MQ8_0FP16

VRAM values are approximate and based on GGUF quantization for llama.cpp / Ollama. Actual usage may vary based on context length, batch size, and runtime overhead.

Quantization Options

Q2_K (~28GB): Lowest quality, significant degradation -- only for testing
Q4_K_M (~40GB): Best balance of quality vs size -- recommended
Q5_K_M (~48GB): Higher quality, needs 48GB+ GPU (A6000, dual 3090)
Q8_0 (~70GB): Near-original quality, needs multi-GPU or high-end workstation
FP16 (~140GB): Full precision, requires multiple A100 GPUs

Compatible Hardware

Apple M2 Ultra 64GB+: Q2_K or Q4_K_M (unified memory)
NVIDIA RTX A6000 (48GB): Q4_K_M comfortably
Dual RTX 3090 (2x24GB): Q4_K_M with model splitting
NVIDIA A100 80GB: Q5_K_M or Q8_0
Consumer RTX 4090 (24GB): Too small -- cannot run 70B

Installation & Setup Guide

System Requirements

▸

Operating System

Windows 10/11, macOS 13+ (Apple Silicon), Ubuntu 22.04+

▸

RAM

64GB minimum system RAM (for CPU offloading)

▸

Storage

80GB free space (for Q4_K_M GGUF files)

▸

GPU

NVIDIA A6000 48GB, dual RTX 3090, or Apple M2 Ultra 64GB+

▸

CPU

Any modern multi-core CPU (GPU is the bottleneck for 70B models)

Option 1: Ollama (if available)

Ollama has an airoboros model, but the 70B variant may not be available. Check first.

$ ollama run airoboros

Option 2: Download GGUF from HuggingFace

Download a quantized GGUF file from TheBloke or similar. Q4_K_M recommended for 48GB GPUs.

$ pip install huggingface-hub && huggingface-cli download TheBloke/Airoboros-L2-70B-GGUF airoboros-l2-70b.Q4_K_M.gguf --local-dir .

Option 3: Create custom Ollama model from GGUF

If the 70B is not in Ollama library, create a Modelfile pointing to your GGUF.

$ echo "FROM ./airoboros-l2-70b.Q4_K_M.gguf" > Modelfile && ollama create airoboros-70b -f Modelfile

Option 4: Use llama.cpp directly

For maximum control over inference parameters and GPU layer allocation.

$ git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make -j && ./llama-cli -m ../airoboros-l2-70b.Q4_K_M.gguf -ngl 80 -c 4096 --interactive

Option 5: Python with transformers (4-bit)

Load with bitsandbytes for 4-bit quantization in Python.

$ pip install torch transformers accelerate bitsandbytes

Python Integration Example

Terminal

$Basic 4-bit inference with transformers

from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_name = "jondurbin/airoboros-l2-70b-2.2.1" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", torch_dtype=torch.float16, load_in_4bit=True ) prompt = "Write a short story about a robot discovering music." inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate( **inputs, max_new_tokens=512, temperature=0.8, do_sample=True, top_p=0.9 ) print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Creative Writing & Use Cases

Airoboros models are best known in the community for creative writing and roleplay. Jon Durbin's training data includes a significant proportion of creative and narrative tasks, making Airoboros-70B a popular choice among users who prioritize storytelling quality over benchmark scores. The 70B parameter count provides enough capacity for nuanced, coherent long-form writing.

Creative Writing

* Storytelling and narrative generation
* Character development and dialogue
* Roleplay and interactive fiction
* Poetry and creative prose
* World-building assistance

Instruction Following

* Complex multi-step instructions
* Structured output generation
* Question answering
* Summarization tasks
* General assistant tasks

Where It Falls Short

* Coding tasks (use CodeLlama or DeepSeek)
* Math and reasoning (use WizardMath)
* Long documents (only 4K context)
* Production/commercial use (license)
* Low-VRAM setups (needs 40GB+)

Airoboros Version History

Jon Durbin iteratively improved the Airoboros training data and methodology across multiple versions. Each version refined the self-instruct data pipeline, resulting in better instruction following and fewer training artifacts.

Airoboros 1.4 (June 2023)

Early release with initial self-instruct dataset. Based on Llama 1. Demonstrated the viability of GPT-4-generated instruction data for fine-tuning open models.

Airoboros L2 2.0 (August 2023)

Migrated to Llama 2 base. Expanded training data with more diverse instruction categories. Improved creative writing quality and reduced repetitive patterns.

Airoboros L2 2.1 (September 2023)

Refined data filtering to remove low-quality samples. Better handling of multi-turn conversations. Improved reasoning task performance.

Airoboros L2 2.2.1 (October 2023) -- Latest

Final major release. Best data quality across all versions. Most recommended version for new users. Available on HuggingFace as jondurbin/airoboros-l2-70b-2.2.1.

Airoboros-70B Local Deployment Workflow

Step-by-step workflow: download GGUF, choose quantization, run with Ollama or llama.cpp

DownloadInstall Ollama

Install ModelOne command

Start ChattingInstant AI

Local AI Alternatives (70B Class)

If you are considering running a 70B-class model locally in 2026, here are the strongest alternatives to Airoboros-70B. All of these models can be run locally with appropriate hardware and generally outperform Airoboros on standard benchmarks.

Model	MMLU	VRAM (Q4_K_M)	Context	Ollama Command	License
Airoboros L2 70B	~64%	~40GB	4K	`ollama run airoboros`	Llama 2
Llama 3.1 70B	~79%	~40GB	128K	`ollama run llama3.1:70b`	Llama 3.1
Qwen 2.5 72B	~86%	~41GB	128K	`ollama run qwen2.5:72b`	Apache 2.0
DeepSeek-V2.5	~78%	~45GB	128K	`ollama run deepseek-v2.5`	MIT
Mixtral 8x22B	~78%	~80GB	64K	`ollama run mixtral:8x22b`	Apache 2.0

MMLU scores approximate from HuggingFace Open LLM Leaderboard. VRAM at Q4_K_M quantization. All models free to download. Airoboros is highlighted for comparison purposes.

Comparative Analysis with Other 70B Models

Local 70B-Class Model Comparison

Airoboros-70B competes with other large locally-runnable models. All models below can be run locally with sufficient hardware. MMLU scores are from HuggingFace Open LLM Leaderboard where available.

Model	Size	RAM Required	Speed	Quality	Cost/Month
Airoboros L2 70B	70B	40GB (Q4)	Slow	64%	Free
Llama 2 70B Chat	70B	40GB (Q4)	Slow	63%	Free
Llama 3.1 70B	70B	40GB (Q4)	Slow	79%	Free
Qwen 2.5 72B	72B	41GB (Q4)	Slow	86%	Free
Mixtral 8x22B	141B (MoE)	80GB (Q4)	Medium	78%	Free

Quality column = approximate MMLU score. All models listed are locally runnable with appropriate hardware. RAM column shows approximate VRAM at Q4_K_M quantization.

When to Choose Airoboros-70B

Choose Airoboros-70B For

* Creative writing and storytelling
* Roleplay and interactive fiction
* Exploring self-instruct methodology
* Preference for Llama 2 ecosystem

Consider Alternatives For

Coding: CodeLlama 70B
General quality: Llama 3.1 70B
Math: WizardMath 70B
Long context: Qwen 2.5 72B (128K)

Key Decision Factors

* 40GB+ VRAM requirement
* Non-commercial license
* Only 4K context window
* Older Llama 2 base (2023)
* Strong creative niche

Troubleshooting & Common Issues

Out of Memory (OOM) Errors

The most common issue with 70B models. If your GPU runs out of VRAM, try these solutions:

Solutions:

* Use a more aggressive quantization (Q4_K_M or Q2_K)
* Reduce context length below 4096 (e.g., -c 2048)
* Offload some layers to CPU with llama.cpp (-ngl flag, use fewer layers)
* Use split mode across multiple GPUs if available
* Accept that a single 24GB GPU cannot run 70B models

Slow Inference Speed

70B models are inherently slow. Expect 5-15 tokens/second on good hardware. Some tips to improve speed:

Optimization Tips:

* Keep as many layers on GPU as possible (-ngl 80 or higher)
* Use Q4_K_M instead of higher quantizations for better speed
* Reduce context window if you don't need full 4096 tokens
* Use flash attention if your backend supports it
* On Apple Silicon, ensure using Metal backend (automatic in llama.cpp)

Ollama Model Not Found

If ollama run airoboros doesn't offer the 70B variant, create a custom model:

Steps:

* Download the GGUF file from HuggingFace (TheBloke/Airoboros-L2-70B-GGUF)
* Create a Modelfile: echo "FROM ./airoboros-l2-70b.Q4_K_M.gguf" > Modelfile
* Build the model: ollama create airoboros-70b -f Modelfile
* Run it: ollama run airoboros-70b

2026 Honest Assessment

Should You Use Airoboros-70B in 2026?

Honest answer: probably not for new projects. Airoboros-70B is historically significant as one of the first models to prove that self-instruct with GPT-4 works at 70B scale. Jon Durbin's open pipeline inspired a generation of community fine-tunes. However, the local AI landscape has advanced dramatically since mid-2023, and newer models surpass Airoboros on virtually every metric.

Historical Significance

Airoboros deserves recognition as a pioneer. In mid-2023, it demonstrated that an individual researcher could create a competitive 70B model using synthetic data -- at a time when most assumed you needed massive human-labeled datasets. This insight influenced later projects like Orca, WizardLM, and the broader synthetic data movement. Jon Durbin's published code and datasets remain valuable educational resources for anyone learning about self-instruct fine-tuning.

Why You Should Use Something Else

*MMLU 64% vs 86%: Qwen 2.5 72B scores 22 percentage points higher on MMLU with similar VRAM
*4K vs 128K context: Modern models offer 32x more context window at the same parameter count
*Non-commercial license: Llama 2 Community License restricts commercial use; newer alternatives use Apache 2.0
*40GB+ VRAM minimum: Same VRAM requirement as newer, far more capable models
*No active development: Last version (2.2.1) released October 2023; no updates expected

When Airoboros Still Makes Sense

*Creative writing niche: Some users still prefer Airoboros's storytelling style for roleplay and fiction
*Learning self-instruct: Studying Durbin's pipeline is an excellent way to learn synthetic data generation
*Existing deployments: If already deployed and working for your specific use case, migration may not be worth it
*Research purposes: Comparing self-instruct models across different eras and methodologies

Recommended Upgrade Path

If you are currently using Airoboros-70B, the best upgrade in 2026 is Qwen 2.5 72B(Apache 2.0 license, MMLU ~86%, 128K context, similar VRAM at Q4) or Llama 3.1 70B(more permissive license than Llama 2, MMLU ~79%, 128K context). Both are available through Ollama and require the same hardware as Airoboros-70B.

Resources & Further Reading

Official Airoboros Resources

Airoboros GitHub Repository
Jon Durbin's training code and self-instruct pipeline
Airoboros L2 70B 2.2.1 (HuggingFace)
Latest 70B model version with model card
Jon Durbin's HuggingFace Profile
All Airoboros variants and other models
Llama 2 Paper (arXiv)
Base model architecture research (Touvron et al., 2023)
Self-Instruct Paper (arXiv)
The self-instruct methodology that inspired Airoboros

Local Deployment Tools

Ollama
Easiest way to run models locally (if 70B is available)
llama.cpp
C++ inference engine for GGUF models, best for 70B control
vLLM
High-performance serving with PagedAttention
text-generation-webui
Popular web UI for running local models including Airoboros
LM Studio
Desktop app for running GGUF models with GUI

Community & Learning

Reddit r/LocalLLaMA
Community discussions on local AI models including Airoboros
HuggingFace Open LLM Leaderboard
Benchmark comparisons for open models
TheBloke on HuggingFace
GGUF quantizations of Airoboros and many other models
PyTorch Tutorials
Framework tutorials for model inference and fine-tuning
HuggingFace NLP Course
Comprehensive NLP education with transformers

Frequently Asked Questions

What is Airoboros-70B and who created it?

Airoboros-70B is a 70-billion parameter language model created by Jon Durbin. It is a fine-tune of Meta's Llama 2 70B, trained using Jon Durbin's novel self-instruct methodology where high-quality instruction data was generated using GPT-4 outputs with careful curation. The model is particularly known for its creative writing and roleplay capabilities.

What are the VRAM requirements for running Airoboros-70B locally?

Airoboros-70B requires significant VRAM. At Q4_K_M quantization (the most common balance of quality and size), you need approximately 40GB VRAM. Q2_K (lowest quality) needs around 28GB, Q5_K_M needs about 48GB, Q8_0 needs approximately 70GB, and full FP16 precision requires around 140GB. Most consumer GPUs cannot run this model -- you typically need an RTX A6000 (48GB), dual RTX 3090s, or an Apple M2 Ultra with 64GB+ unified memory.

How does Airoboros-70B perform on standard benchmarks?

On the HuggingFace Open LLM Leaderboard, Airoboros-70B scores approximately 63-65% on MMLU, 67% on ARC, 86% on HellaSwag, and 56% on TruthfulQA, for an overall average around 68%. These are solid scores for a community fine-tune, though not state-of-the-art compared to newer 70B models.

What license does Airoboros-70B use?

Airoboros-70B uses the Llama 2 Community License, inherited from its Llama 2 70B base model. This license restricts commercial use -- applications with over 700 million monthly active users require a separate license from Meta. It is not a permissive open-source license like MIT or Apache 2.0.

What is the context window of Airoboros-70B?

Airoboros-70B has a 4,096 token context window, inherited from the Llama 2 base model. This is relatively limited compared to newer models (Llama 3.1 supports 128K tokens). The 4K context means you can process roughly 3,000 words of combined input and output per request.

Can I run Airoboros-70B with Ollama?

Ollama has an 'airoboros' model available (run with 'ollama run airoboros'), but the 70B variant may not be directly available through Ollama's library. You can download GGUF quantized files from HuggingFace (e.g., from TheBloke) and create a custom Ollama model using a Modelfile. Alternatively, use llama.cpp directly with GGUF files for more control.

What are the different Airoboros versions?

Jon Durbin released multiple Airoboros versions: 1.4 (early release), 2.0 (improved training data), 2.1 (refined methodology), and 2.2.1 (latest iteration with best data quality). Model names on HuggingFace include airoboros-l2-70b, airoboros-l2-70b-2.1, and airoboros-l2-70b-2.2.1. Later versions generally have improved instruction following and fewer hallucinations.

Was this helpful?

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 22 courses that take you from reading about AI to building AI.

Explore the Learning Path See pricing

Reading now

Join the discussion

Related Guides

Continue your local AI journey with these comprehensive guides

View All Local AI Guides

🎯

AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Start free Browse courses first

Or own it for life — Lifetime $149 $599, pay once

Training your whole team? Get a team quote →

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

GitHub LinkedIn Twitter

📅 Published: 2023-07-01🔄 Last Updated: March 13, 2026✓ Manually Reviewed