★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 secondsLifetime $149 ends in

Airoboros L2-70B: LLaMA 2 Self-Instruct Fine-Tune

Published: August 15, 2023 | Updated: March 13, 2026

Jon Durbin's GPT-4 self-instruct fine-tune on LLaMA 2 70B. Real benchmarks, VRAM requirements, and honest assessment for local deployment.

64
MMLU (Open LLM Leaderboard)
Fair
70
Parameters: 70B
Good
4
Context: 4096 tokens
Poor

Technical Specifications Overview

*Parameters: 70 billion (same as LLaMA 2 70B base)
*Context Window: 4,096 tokens
*Architecture: LLaMA 2 transformer (GQA, RoPE)
*Training Method: GPT-4 self-instruct (Airoboros methodology)
*License: LLaMA 2 Community License (commercial use with restrictions for 700M+ MAU)
*Creator: Jon Durbin
*Released: August 2023

L2 vs Original Airoboros: What Changed

The "L2" in Airoboros L2-70B means it is fine-tuned on LLaMA 2 rather than the original LLaMA 1. Jon Durbin released the original Airoboros-70B on LLaMA 1 65B, then upgraded to the LLaMA 2 base when Meta released it in July 2023. The L2 version brings several concrete improvements from the better base model:

FeatureAiroboros-70B (LLaMA 1)Airoboros L2-70B (LLaMA 2)
Base ModelLLaMA 1 65BLLaMA 2 70B
Base Training Data1.4 trillion tokens2 trillion tokens
Context Window2,048 tokens4,096 tokens
AttentionMulti-Head AttentionGrouped-Query Attention (GQA)
LicenseNon-commercial (LLaMA 1)LLaMA 2 Community (commercial OK under 700M MAU)
Fine-Tune MethodGPT-4 self-instructGPT-4 self-instruct (same methodology)

Key takeaway: If you are choosing between Airoboros-70B and Airoboros L2-70B, always pick the L2 version. The LLaMA 2 base is strictly better: more training data, longer context, GQA for faster inference, and a commercial-friendly license.

Research Background & Training Method

Airoboros uses a self-instruct methodology: Jon Durbin generated synthetic instruction-response pairs using GPT-4, then fine-tuned the LLaMA 2 base model on this data. This approach, inspired by the Self-Instruct paper (Wang et al., 2022), produces models that follow complex instructions well despite relatively small fine-tuning datasets.

The Airoboros training set includes diverse task types: creative writing, coding, math, logic puzzles, roleplay scenarios, and multi-step reasoning. This breadth gives the model flexibility across use cases, though it scores below models fine-tuned with RLHF (like Llama 2 Chat) on safety-oriented benchmarks.

Sources & References

Performance Benchmarks

Benchmark source: Scores below are from the HuggingFace Open LLM Leaderboard (accessed 2024). Airoboros L2-70B was competitive at launch (August 2023) but has been surpassed by newer 70B-class models.

MMLU Comparison (Local 70B Models)

MMLU Score (%)

Llama 3 70B79 Score
79
Qwen 2.5 72B85 Score
85
Llama 2 70B (base)69 Score
69
Airoboros L2-70B64 Score
64

Open LLM Leaderboard Tasks

Airoboros L2-70B Benchmark Scores (%)

ARC Challenge67 Score
67
HellaSwag85 Score
85
MMLU64 Score
64
TruthfulQA54 Score
54

Strengths & Weaknesses Profile

Performance Metrics

Instruction Following
70
Creative Writing
78
Code Generation
55
Mathematical Tasks
50
Reading Comprehension
72
Roleplay / Fiction
80

Note: Radar values are approximate relative assessments based on community feedback and benchmark data, not absolute scores. Airoboros L2-70B is known in the community for strong creative writing and roleplay capabilities.

VRAM Requirements by Quantization

QuantizationModel SizeVRAM RequiredCompatible HardwareQuality Impact
FP16~140 GB~140 GB2x A100 80GB, multi-GPU clustersNo loss
Q8_0~70 GB~72 GBA100 80GB, 3x RTX 3090/4090Minimal loss
Q5_K_M~48 GB~50 GB2x RTX 3090/4090, A6000 48GBMinor loss
Q4_K_M (recommended)~40 GB~42 GB2x RTX 3090/4090, A6000 48GBAcceptable loss
Q3_K_M~33 GB~35 GBRTX 4090 + partial offloadNoticeable loss
Q2_K~26 GB~28 GBRTX 4090 24GB + CPU offloadSignificant loss

VRAM estimates include model weights plus KV cache overhead for typical inference. Actual usage depends on context length and batch size. For CPU-only inference (llama.cpp, Ollama CPU mode), the Q4_K_M version needs ~48GB system RAM.

Memory Usage During Inference (Q4_K_M)

Memory Usage Over Time

42GB
31GB
21GB
10GB
0GB
0s30s120s

Typical VRAM profile for Q4_K_M quantization with 4096-token context on a dual-GPU setup.

Installation & Setup Guide

System Requirements

System Requirements

Operating System
Windows 10/11, macOS 12+ (Apple Silicon recommended), Ubuntu 20.04+
RAM
48GB minimum (CPU mode), 32GB+ system RAM alongside GPU
Storage
50GB free space for Q4_K_M quantized model
GPU
40GB+ VRAM: 2x RTX 3090/4090, A6000 48GB, or A100 80GB
CPU
Modern 8+ core CPU. Apple M1 Ultra/M2 Ultra viable for CPU inference.

Recommended: Ollama (Simplest)

Terminal
$# Install Ollama and run Airoboros L2-70B
# Install Ollama (macOS/Linux) curl -fsSL https://ollama.com/install.sh | sh # Pull and run Airoboros ollama run airoboros # The model will download (~40GB for default quantization) # Once loaded, you can start chatting directly >>> Write a short story about a time traveler [Model generates creative fiction response...] # To serve as an API: # ollama serve # Then query: curl http://localhost:11434/api/generate -d '{"model":"airoboros","prompt":"Hello"}'
$_

Alternative: llama.cpp

Terminal
$# Run with llama.cpp for more control
# Clone and build llama.cpp git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && make -j # Download GGUF from HuggingFace (community quantizations) # e.g., from TheBloke/airoboros-l2-70B-GGUF # Run with GPU offloading (adjust -ngl for your VRAM) ./main -m airoboros-l2-70b.Q4_K_M.gguf \ -ngl 80 \ -c 4096 \ -p "Write a detailed analysis of..." \ --temp 0.7
$_

Python (Transformers + bitsandbytes)

Terminal
$# Python inference with 4-bit quantization
from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_name = "jondurbin/airoboros-l2-70b" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", torch_dtype=torch.float16, load_in_4bit=True # Requires bitsandbytes ) # Airoboros uses a specific prompt format prompt = """A chat between a curious user and an assistant. The assistant gives helpful, detailed, and polite answers. USER: Explain the difference between LLaMA 1 and LLaMA 2. ASSISTANT:""" inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate( **inputs, max_new_tokens=512, temperature=0.7, do_sample=True, pad_token_id=tokenizer.eos_token_id ) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
$_

Use Cases & Strengths

Airoboros L2-70B is best known in the open-source community for creative writing, roleplay, and fiction. Its GPT-4 self-instruct training gives it a distinctive writing style. For coding or math-heavy tasks, newer models perform significantly better.

Creative Writing (Strong)

  • * Fiction and storytelling
  • * Character dialogue
  • * Worldbuilding
  • * Roleplay scenarios
  • * Poetry and prose

Instruction Following (Good)

  • * Multi-step task completion
  • * Structured output generation
  • * Research summarization
  • * Document drafting
  • * Q&A with context

Limitations (Weaker Areas)

  • * Math and reasoning (newer models better)
  • * Code generation (use CodeLlama or Qwen Coder)
  • * Factual accuracy (no RLHF safety training)
  • * Long context (only 4K tokens)
  • * Multilingual (English-focused)

Local 70B Alternatives (2026)

Airoboros L2-70B was released in August 2023. Since then, several stronger 70B-class models have become available for local deployment. Unless you specifically need Airoboros's creative writing style, consider these newer alternatives:

ModelMMLUContextVRAM (Q4)Best ForOllama
Airoboros L2-70B~64%4K~40 GBCreative writing, roleplayollama run airoboros
Llama 3 70B~79%8K~40 GBGeneral purposeollama run llama3:70b
Llama 3.1 70B~82%128K~40 GBLong context, generalollama run llama3.1:70b
Qwen 2.5 72B~85%128K~42 GBMultilingual, reasoningollama run qwen2.5:72b
Mixtral 8x22B~77%64K~80 GBCode, multilingualollama run mixtral:8x22b
Nemotron 70B~83%4K~40 GBInstruction followingollama run nemotron:70b

Comparative Analysis

Local 70B Models Comparison

All models below run locally. Scores reflect MMLU performance from the Open LLM Leaderboard.

ModelSizeRAM RequiredSpeedQualityCost/Month
Airoboros L2-70B70B~40GB (Q4)Medium
64%
Free
Llama 2 70B Chat70B~40GB (Q4)Medium
69%
Free
Llama 3 70B70B~40GB (Q4)Medium
79%
Free
Qwen 2.5 72B72B~42GB (Q4)Medium
85%
Free

When to Choose Airoboros L2-70B

Good Choice If...

  • * You need strong creative writing / fiction output
  • * You want a model known for roleplay capabilities
  • * You prefer GPT-4 self-instruct style responses
  • * You are already familiar with the Airoboros prompt format

Better Alternatives If...

  • * You need strong reasoning or math (use Qwen 2.5 72B)
  • * You need code generation (use Qwen 2.5 Coder or CodeLlama)
  • * You need long context (use Llama 3.1 70B with 128K)
  • * You need the best general-purpose 70B (use Llama 3 70B or Qwen 2.5 72B)
  • * You need multilingual support (use Qwen 2.5 72B)

Troubleshooting & Common Issues

Out of Memory (OOM) Errors

70B models are demanding. If you hit OOM errors, try these steps:

  • * Use a smaller quantization: switch from Q5_K_M to Q4_K_M or Q3_K_M
  • * Reduce context length: set -c 2048 instead of 4096
  • * Offload layers to CPU: use -ngl 40 (fewer GPU layers) in llama.cpp
  • * Close other GPU applications before loading the model
  • * Consider a smaller model: Llama 3 8B fits in 8GB VRAM and outperforms Airoboros L2-70B on many benchmarks

Slow Generation Speed

70B models are inherently slower than smaller models. Typical speeds:

  • * GPU (2x RTX 4090, Q4_K_M): ~10-15 tokens/second
  • * GPU (A100 80GB, Q4_K_M): ~20-30 tokens/second
  • * CPU only (64GB RAM): ~1-3 tokens/second
  • * Apple M2 Ultra: ~5-8 tokens/second

If speed is critical, consider Llama 3 8B or Mistral 7B Instruct which run at 50-100+ tokens/second on a single GPU.

Wrong Prompt Format

Airoboros uses a specific chat template. Using the wrong format degrades quality:

# Correct Airoboros prompt format:
A chat between a curious user and an assistant.
The assistant gives helpful, detailed, and polite answers.
USER: [your question here]
ASSISTANT:

Ollama handles this automatically. If using llama.cpp or Transformers directly, make sure to use this format.

Frequently Asked Questions

What is the difference between Airoboros L2-70B and Airoboros-70B?

Airoboros L2-70B is fine-tuned on LLaMA 2 70B (the 'L2' stands for LLaMA 2), while the original Airoboros-70B was fine-tuned on the first LLaMA 1 65B. The L2 version benefits from LLaMA 2's improved base training (2 trillion tokens vs 1.4 trillion), doubled context window (4096 vs 2048 tokens), and a more permissive commercial license. Both use Jon Durbin's GPT-4 self-instruct training methodology.

What are the hardware requirements for running Airoboros L2-70B locally?

At full FP16 precision, Airoboros L2-70B requires ~140GB VRAM — impractical for most users. With Q4_K_M quantization (recommended), it needs ~40GB VRAM, fitting on dual RTX 3090/4090 or a single A100 80GB. For CPU-only inference, you need at least 48GB system RAM for the Q4 quantized version. An NVMe SSD is strongly recommended for model loading speed.

How does Airoboros L2-70B perform on benchmarks?

On the HuggingFace Open LLM Leaderboard, Airoboros L2-70B scores approximately 64-66% on MMLU (Massive Multitask Language Understanding). It performs well on creative writing and instruction-following tasks due to its GPT-4 self-instruct training, but scores below newer 70B models like Llama 3 70B (~79% MMLU). It remains a good option for creative and roleplay use cases.

Can I run Airoboros L2-70B with Ollama?

Yes. Run 'ollama run airoboros' to pull and run the model. Ollama handles quantization automatically. The default GGUF quantization typically uses Q4_K_M, requiring approximately 40GB VRAM. For CPU-only mode, ensure you have at least 48GB system RAM. Performance will be slower on CPU compared to GPU inference.

Is Airoboros L2-70B still worth using in 2026?

Airoboros L2-70B (released August 2023) has been surpassed by newer models in raw benchmark performance. Llama 3 70B, Qwen 2.5 72B, and Mixtral 8x22B all score significantly higher on MMLU and reasoning benchmarks. However, Airoboros L2-70B retains a following for creative writing and roleplay tasks where its GPT-4 self-instruct training produces distinctive output. For general-purpose use, newer models are recommended.

Was this helpful?

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Reading now
Join the discussion

Related Guides

Continue your local AI journey with these comprehensive guides

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or lock in Lifetime $149 $599 — ends in
LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📅 Published: 2023-08-15🔄 Last Updated: 2026-03-16✓ Manually Reviewed
More on AI Models Directory
See the full AI Models Directory guide.
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Found your model? Now build something with it.

20 hands-on courses — RAG, agents, fine-tuning — all running locally. First chapter free, no card.

Or lock in Lifetime $149 $599 — ends in
Free Tools & Calculators