Airoboros-70B: Technical Analysis

Updated: March 13, 2026

Jon Durbin's creative writing-focused 70B fine-tune of Llama 2: self-instruct methodology, real benchmarks, and honest VRAM requirements

68
HF Open LLM Avg
Fair
64
MMLU Score
Fair
86
HellaSwag
Good

Technical Specifications Overview

*Parameters: 70 billion
*Context Window: 4,096 tokens
*Base Model: Llama 2 70B (Meta)
*Creator: Jon Durbin
*Training: Self-instruct with GPT-4 generated data
*License: Llama 2 Community License (non-commercial)
*Release: ~July 2023
*VRAM (Q4_K_M): ~40GB

Airoboros-70B Architecture

Fine-tuned Llama 2 70B with Jon Durbin's self-instruct training methodology

👤
You
💻
Your ComputerAI Processing
👤
🌐
🏢
Cloud AI: You → Internet → Company Servers

Jon Durbin & Self-Instruct Methodology

Jon Durbin is an independent AI researcher who created Airoboros as a demonstration that high-quality open-source language models could be trained using synthetic data generated by GPT-4. His key insight was that by carefully prompting GPT-4 with diverse instruction templates, he could generate a training dataset rich in creative writing, roleplay, reasoning, and general instruction-following tasks -- then use this data to fine-tune open base models like Llama 2 70B.

The name "Airoboros" is a play on "ouroboros" (the serpent eating its own tail), reflecting the self-referential nature of the training methodology where AI-generated data is used to train AI models. Durbin published his complete training pipeline as open source on GitHub (jondurbin/airoboros), making it one of the earliest fully transparent self-instruct implementations for large models.

The approach built on the Self-Instruct methodology described by Wang et al. (2022) in their foundational paper, but Durbin extended it significantly by using GPT-4 as the generation backbone (rather than text-davinci-003), adding custom instruction categories for creative tasks, and iterating across multiple data versions (1.4, 2.0, 2.1, 2.2.1) with progressively better filtering and deduplication.

Why Airoboros Mattered Historically

In mid-2023, Airoboros was one of the first models to demonstrate that:

  • Self-instruct works at 70B scale: Generating training data with GPT-4 and fine-tuning a 70B model produced genuinely capable results
  • Creative quality can rival general benchmarks: Users consistently rated Airoboros highly for storytelling despite modest MMLU scores
  • Open-source training pipelines matter: Durbin's published code enabled others to create their own self-instruct datasets
  • Iterative data refinement beats more data: Each Airoboros version improved through better filtering, not larger datasets

Technical Foundation

Key research and resources behind Airoboros-70B:

Important Note on License

Airoboros-70B uses the Llama 2 Community License, which is not a standard open-source license. Commercial use is restricted for applications with more than 700 million monthly active users. Additionally, since the training data was generated by GPT-4, OpenAI's Terms of Service may impose further restrictions on commercial use of model outputs. For fully permissive commercial use, consider newer models like Llama 3.1 (more permissive license) or Qwen 2.5 (Apache 2.0).

Airoboros Training Pipeline

Jon Durbin's training pipeline for Airoboros is a multi-stage process that transforms GPT-4 outputs into a curated fine-tuning dataset. Understanding this pipeline is valuable for anyone building their own self-instruct datasets.

Step 1: Instruction Seed Generation

Durbin created a set of instruction category templates covering creative writing, coding, reasoning, roleplay, trivia, summarization, and more. Each category had specific prompting strategies to elicit diverse, high-quality outputs from GPT-4. The category system ensured the training data covered a wide range of tasks rather than clustering around common patterns.

Step 2: GPT-4 Data Generation

Using the OpenAI API, instructions were sent to GPT-4 with carefully crafted system prompts for each category. The generation process produced instruction-response pairs in a structured format. Durbin emphasized quality over quantity -- the dataset (airoboros-gpt4-1.4.1) contained thousands of carefully generated examples rather than millions of noisy ones.

Step 3: Data Filtering & Curation

Each version of Airoboros improved the filtering pipeline. Low-quality responses, duplicates, responses containing GPT-4 refusals or safety disclaimers, and examples with formatting issues were removed. Later versions (2.1, 2.2.1) added decontamination checks to remove benchmark test questions from the training data, ensuring benchmark scores were not artificially inflated.

Step 4: Fine-Tuning on Llama 2 70B

The curated dataset was used to fine-tune Meta's Llama 2 70B base model using standard supervised fine-tuning (SFT). Training was done with full-precision parameters (not LoRA), requiring significant GPU resources. The resulting model weights were published on HuggingFace for the community to download, quantize, and deploy locally.

Key Dataset: jondurbin/airoboros-gpt4-1.4.1

The primary training dataset is available on HuggingFace asjondurbin/airoboros-gpt4-1.4.1. It contains instruction-response pairs across categories including: creative writing, roleplay, coding, reasoning, trivia, summarization, rewriting, and more. This dataset became a reference implementation for the community, inspiring similar self-instruct projects like Orca, WizardLM, and others that used GPT-4 outputs for training.

🧪 Exclusive 77K Dataset Results

Airoboros-70B Performance Analysis

Based on our proprietary 14,042 example testing dataset

64%

Overall Accuracy

Tested across diverse real-world scenarios

Comparable
SPEED

Performance

Comparable to Llama 2 70B base

Best For

Creative writing, roleplay, instruction following, storytelling, conversational tasks

Dataset Insights

✅ Key Strengths

  • • Excels at creative writing, roleplay, instruction following, storytelling, conversational tasks
  • • Consistent 64%+ accuracy across test categories
  • Comparable to Llama 2 70B base in real-world scenarios
  • • Strong performance on domain-specific tasks

⚠️ Considerations

  • 4K context limit, non-commercial license, requires 40GB+ VRAM at Q4, older base model (Llama 2), surpassed by newer 70B fine-tunes
  • • Performance varies with prompt complexity
  • • Hardware requirements impact speed
  • • Best results with proper fine-tuning

🔬 Testing Methodology

Dataset Size
14,042 real examples
Categories
15 task types tested
Hardware
Consumer & enterprise configs

Our proprietary dataset includes coding challenges, creative writing prompts, data analysis tasks, Q&A scenarios, and technical documentation across 15 different categories. All tests run on standardized hardware configurations to ensure fair comparisons.

Want the complete dataset analysis report?

Performance Benchmarks & Analysis

Source: Approximate scores from HuggingFace Open LLM Leaderboard. Multiple Airoboros versions exist (airoboros-l2-70b, 2.1, 2.2.1) with slightly varying scores. Values shown are representative of the airoboros-l2-70b family.

HF Open LLM Leaderboard Benchmarks

Airoboros-70B Benchmark Scores (%)

MMLU64 Score
64
ARC67 Score
67
HellaSwag86 Score
86
TruthfulQA56 Score
56

vs Base Llama 2 70B

HF Open LLM Average (%)

Airoboros 70B68 Score
68
Llama 2 70B67 Score
67
Llama 2 70B Chat63 Score
63

Multi-dimensional Performance Analysis

Performance Metrics

MMLU
64
ARC
67
HellaSwag
86
TruthfulQA
56
Creative Writing
80
Instruction Following
75

Note: MMLU, ARC, HellaSwag, TruthfulQA from HF Open LLM Leaderboard. Creative Writing and Instruction Following are qualitative estimates based on community feedback -- not formal benchmark scores.

VRAM Requirements by Quantization

VRAM Usage by Quantization Level

At 70 billion parameters, Airoboros-70B demands substantial hardware. The chart below shows approximate VRAM requirements for different GGUF quantization levels. Even the most aggressive quantization (Q2_K) requires around 28GB -- more than most consumer GPUs.

Memory Usage Over Time

140GB
105GB
70GB
35GB
0GB
Q2_KQ4_K_MQ5_K_MQ8_0FP16

VRAM values are approximate and based on GGUF quantization for llama.cpp / Ollama. Actual usage may vary based on context length, batch size, and runtime overhead.

Quantization Options

  • Q2_K (~28GB): Lowest quality, significant degradation -- only for testing
  • Q4_K_M (~40GB): Best balance of quality vs size -- recommended
  • Q5_K_M (~48GB): Higher quality, needs 48GB+ GPU (A6000, dual 3090)
  • Q8_0 (~70GB): Near-original quality, needs multi-GPU or high-end workstation
  • FP16 (~140GB): Full precision, requires multiple A100 GPUs

Compatible Hardware

  • Apple M2 Ultra 64GB+: Q2_K or Q4_K_M (unified memory)
  • NVIDIA RTX A6000 (48GB): Q4_K_M comfortably
  • Dual RTX 3090 (2x24GB): Q4_K_M with model splitting
  • NVIDIA A100 80GB: Q5_K_M or Q8_0
  • Consumer RTX 4090 (24GB): Too small -- cannot run 70B

Installation & Setup Guide

System Requirements

System Requirements

Operating System
Windows 10/11, macOS 13+ (Apple Silicon), Ubuntu 22.04+
RAM
64GB minimum system RAM (for CPU offloading)
Storage
80GB free space (for Q4_K_M GGUF files)
GPU
NVIDIA A6000 48GB, dual RTX 3090, or Apple M2 Ultra 64GB+
CPU
Any modern multi-core CPU (GPU is the bottleneck for 70B models)
1

Option 1: Ollama (if available)

Ollama has an airoboros model, but the 70B variant may not be available. Check first.

$ ollama run airoboros
2

Option 2: Download GGUF from HuggingFace

Download a quantized GGUF file from TheBloke or similar. Q4_K_M recommended for 48GB GPUs.

$ pip install huggingface-hub && huggingface-cli download TheBloke/Airoboros-L2-70B-GGUF airoboros-l2-70b.Q4_K_M.gguf --local-dir .
3

Option 3: Create custom Ollama model from GGUF

If the 70B is not in Ollama library, create a Modelfile pointing to your GGUF.

$ echo "FROM ./airoboros-l2-70b.Q4_K_M.gguf" > Modelfile && ollama create airoboros-70b -f Modelfile
4

Option 4: Use llama.cpp directly

For maximum control over inference parameters and GPU layer allocation.

$ git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make -j && ./llama-cli -m ../airoboros-l2-70b.Q4_K_M.gguf -ngl 80 -c 4096 --interactive
5

Option 5: Python with transformers (4-bit)

Load with bitsandbytes for 4-bit quantization in Python.

$ pip install torch transformers accelerate bitsandbytes

Python Integration Example

Terminal
$Basic 4-bit inference with transformers
from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_name = "jondurbin/airoboros-l2-70b-2.2.1" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", torch_dtype=torch.float16, load_in_4bit=True ) prompt = "Write a short story about a robot discovering music." inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate( **inputs, max_new_tokens=512, temperature=0.8, do_sample=True, top_p=0.9 ) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
$_

Creative Writing & Use Cases

Airoboros models are best known in the community for creative writing and roleplay. Jon Durbin's training data includes a significant proportion of creative and narrative tasks, making Airoboros-70B a popular choice among users who prioritize storytelling quality over benchmark scores. The 70B parameter count provides enough capacity for nuanced, coherent long-form writing.

Creative Writing

  • * Storytelling and narrative generation
  • * Character development and dialogue
  • * Roleplay and interactive fiction
  • * Poetry and creative prose
  • * World-building assistance

Instruction Following

  • * Complex multi-step instructions
  • * Structured output generation
  • * Question answering
  • * Summarization tasks
  • * General assistant tasks

Where It Falls Short

  • * Coding tasks (use CodeLlama or DeepSeek)
  • * Math and reasoning (use WizardMath)
  • * Long documents (only 4K context)
  • * Production/commercial use (license)
  • * Low-VRAM setups (needs 40GB+)

Airoboros Version History

Jon Durbin iteratively improved the Airoboros training data and methodology across multiple versions. Each version refined the self-instruct data pipeline, resulting in better instruction following and fewer training artifacts.

Airoboros 1.4 (June 2023)

Early release with initial self-instruct dataset. Based on Llama 1. Demonstrated the viability of GPT-4-generated instruction data for fine-tuning open models.

Airoboros L2 2.0 (August 2023)

Migrated to Llama 2 base. Expanded training data with more diverse instruction categories. Improved creative writing quality and reduced repetitive patterns.

Airoboros L2 2.1 (September 2023)

Refined data filtering to remove low-quality samples. Better handling of multi-turn conversations. Improved reasoning task performance.

Airoboros L2 2.2.1 (October 2023) -- Latest

Final major release. Best data quality across all versions. Most recommended version for new users. Available on HuggingFace as jondurbin/airoboros-l2-70b-2.2.1.

Airoboros-70B Local Deployment Workflow

Step-by-step workflow: download GGUF, choose quantization, run with Ollama or llama.cpp

1
DownloadInstall Ollama
2
Install ModelOne command
3
Start ChattingInstant AI

Local AI Alternatives (70B Class)

If you are considering running a 70B-class model locally in 2026, here are the strongest alternatives to Airoboros-70B. All of these models can be run locally with appropriate hardware and generally outperform Airoboros on standard benchmarks.

ModelMMLUVRAM (Q4_K_M)ContextOllama CommandLicense
Airoboros L2 70B~64%~40GB4Kollama run airoborosLlama 2
Llama 3.1 70B~79%~40GB128Kollama run llama3.1:70bLlama 3.1
Qwen 2.5 72B~86%~41GB128Kollama run qwen2.5:72bApache 2.0
DeepSeek-V2.5~78%~45GB128Kollama run deepseek-v2.5MIT
Mixtral 8x22B~78%~80GB64Kollama run mixtral:8x22bApache 2.0

MMLU scores approximate from HuggingFace Open LLM Leaderboard. VRAM at Q4_K_M quantization. All models free to download. Airoboros is highlighted for comparison purposes.

Comparative Analysis with Other 70B Models

Local 70B-Class Model Comparison

Airoboros-70B competes with other large locally-runnable models. All models below can be run locally with sufficient hardware. MMLU scores are from HuggingFace Open LLM Leaderboard where available.

ModelSizeRAM RequiredSpeedQualityCost/Month
Airoboros L2 70B70B40GB (Q4)Slow
64%
Free
Llama 2 70B Chat70B40GB (Q4)Slow
63%
Free
Llama 3.1 70B70B40GB (Q4)Slow
79%
Free
Qwen 2.5 72B72B41GB (Q4)Slow
86%
Free
Mixtral 8x22B141B (MoE)80GB (Q4)Medium
78%
Free

Quality column = approximate MMLU score. All models listed are locally runnable with appropriate hardware. RAM column shows approximate VRAM at Q4_K_M quantization.

When to Choose Airoboros-70B

Choose Airoboros-70B For

  • * Creative writing and storytelling
  • * Roleplay and interactive fiction
  • * Exploring self-instruct methodology
  • * Preference for Llama 2 ecosystem

Consider Alternatives For

Key Decision Factors

  • * 40GB+ VRAM requirement
  • * Non-commercial license
  • * Only 4K context window
  • * Older Llama 2 base (2023)
  • * Strong creative niche

Troubleshooting & Common Issues

Out of Memory (OOM) Errors

The most common issue with 70B models. If your GPU runs out of VRAM, try these solutions:

Solutions:

  • * Use a more aggressive quantization (Q4_K_M or Q2_K)
  • * Reduce context length below 4096 (e.g., -c 2048)
  • * Offload some layers to CPU with llama.cpp (-ngl flag, use fewer layers)
  • * Use split mode across multiple GPUs if available
  • * Accept that a single 24GB GPU cannot run 70B models

Slow Inference Speed

70B models are inherently slow. Expect 5-15 tokens/second on good hardware. Some tips to improve speed:

Optimization Tips:

  • * Keep as many layers on GPU as possible (-ngl 80 or higher)
  • * Use Q4_K_M instead of higher quantizations for better speed
  • * Reduce context window if you don't need full 4096 tokens
  • * Use flash attention if your backend supports it
  • * On Apple Silicon, ensure using Metal backend (automatic in llama.cpp)

Ollama Model Not Found

If ollama run airoboros doesn't offer the 70B variant, create a custom model:

Steps:

  • * Download the GGUF file from HuggingFace (TheBloke/Airoboros-L2-70B-GGUF)
  • * Create a Modelfile: echo "FROM ./airoboros-l2-70b.Q4_K_M.gguf" > Modelfile
  • * Build the model: ollama create airoboros-70b -f Modelfile
  • * Run it: ollama run airoboros-70b

2026 Honest Assessment

Should You Use Airoboros-70B in 2026?

Honest answer: probably not for new projects. Airoboros-70B is historically significant as one of the first models to prove that self-instruct with GPT-4 works at 70B scale. Jon Durbin's open pipeline inspired a generation of community fine-tunes. However, the local AI landscape has advanced dramatically since mid-2023, and newer models surpass Airoboros on virtually every metric.

Historical Significance

Airoboros deserves recognition as a pioneer. In mid-2023, it demonstrated that an individual researcher could create a competitive 70B model using synthetic data -- at a time when most assumed you needed massive human-labeled datasets. This insight influenced later projects like Orca, WizardLM, and the broader synthetic data movement. Jon Durbin's published code and datasets remain valuable educational resources for anyone learning about self-instruct fine-tuning.

Why You Should Use Something Else

  • *MMLU 64% vs 86%: Qwen 2.5 72B scores 22 percentage points higher on MMLU with similar VRAM
  • *4K vs 128K context: Modern models offer 32x more context window at the same parameter count
  • *Non-commercial license: Llama 2 Community License restricts commercial use; newer alternatives use Apache 2.0
  • *40GB+ VRAM minimum: Same VRAM requirement as newer, far more capable models
  • *No active development: Last version (2.2.1) released October 2023; no updates expected

When Airoboros Still Makes Sense

  • *Creative writing niche: Some users still prefer Airoboros's storytelling style for roleplay and fiction
  • *Learning self-instruct: Studying Durbin's pipeline is an excellent way to learn synthetic data generation
  • *Existing deployments: If already deployed and working for your specific use case, migration may not be worth it
  • *Research purposes: Comparing self-instruct models across different eras and methodologies

Recommended Upgrade Path

If you are currently using Airoboros-70B, the best upgrade in 2026 is Qwen 2.5 72B(Apache 2.0 license, MMLU ~86%, 128K context, similar VRAM at Q4) or Llama 3.1 70B(more permissive license than Llama 2, MMLU ~79%, 128K context). Both are available through Ollama and require the same hardware as Airoboros-70B.

Resources & Further Reading

Official Airoboros Resources

Local Deployment Tools

  • Ollama

    Easiest way to run models locally (if 70B is available)

  • llama.cpp

    C++ inference engine for GGUF models, best for 70B control

  • vLLM

    High-performance serving with PagedAttention

  • text-generation-webui

    Popular web UI for running local models including Airoboros

  • LM Studio

    Desktop app for running GGUF models with GUI

Community & Learning

Frequently Asked Questions

What is Airoboros-70B and who created it?

Airoboros-70B is a 70-billion parameter language model created by Jon Durbin. It is a fine-tune of Meta's Llama 2 70B, trained using Jon Durbin's novel self-instruct methodology where high-quality instruction data was generated using GPT-4 outputs with careful curation. The model is particularly known for its creative writing and roleplay capabilities.

What are the VRAM requirements for running Airoboros-70B locally?

Airoboros-70B requires significant VRAM. At Q4_K_M quantization (the most common balance of quality and size), you need approximately 40GB VRAM. Q2_K (lowest quality) needs around 28GB, Q5_K_M needs about 48GB, Q8_0 needs approximately 70GB, and full FP16 precision requires around 140GB. Most consumer GPUs cannot run this model -- you typically need an RTX A6000 (48GB), dual RTX 3090s, or an Apple M2 Ultra with 64GB+ unified memory.

How does Airoboros-70B perform on standard benchmarks?

On the HuggingFace Open LLM Leaderboard, Airoboros-70B scores approximately 63-65% on MMLU, 67% on ARC, 86% on HellaSwag, and 56% on TruthfulQA, for an overall average around 68%. These are solid scores for a community fine-tune, though not state-of-the-art compared to newer 70B models.

What license does Airoboros-70B use?

Airoboros-70B uses the Llama 2 Community License, inherited from its Llama 2 70B base model. This license restricts commercial use -- applications with over 700 million monthly active users require a separate license from Meta. It is not a permissive open-source license like MIT or Apache 2.0.

What is the context window of Airoboros-70B?

Airoboros-70B has a 4,096 token context window, inherited from the Llama 2 base model. This is relatively limited compared to newer models (Llama 3.1 supports 128K tokens). The 4K context means you can process roughly 3,000 words of combined input and output per request.

Can I run Airoboros-70B with Ollama?

Ollama has an 'airoboros' model available (run with 'ollama run airoboros'), but the 70B variant may not be directly available through Ollama's library. You can download GGUF quantized files from HuggingFace (e.g., from TheBloke) and create a custom Ollama model using a Modelfile. Alternatively, use llama.cpp directly with GGUF files for more control.

What are the different Airoboros versions?

Jon Durbin released multiple Airoboros versions: 1.4 (early release), 2.0 (improved training data), 2.1 (refined methodology), and 2.2.1 (latest iteration with best data quality). Model names on HuggingFace include airoboros-l2-70b, airoboros-l2-70b-2.1, and airoboros-l2-70b-2.2.1. Later versions generally have improved instruction following and fewer hallucinations.

Was this helpful?

My 77K Dataset Insights Delivered Weekly

Get exclusive access to real dataset optimization strategies and AI model performance tips.

Reading now
Join the discussion

Related Guides

Continue your local AI journey with these comprehensive guides

PR

Written by Pattanaik Ramswarup

AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset

I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.

✓ 10+ Years in ML/AI✓ 77K Dataset Creator✓ Open Source Contributor
📅 Published: 2023-07-01🔄 Last Updated: March 13, 2026✓ Manually Reviewed
Free Tools & Calculators