Airoboros L2-70B: LLaMA 2 Self-Instruct Fine-Tune

Published: August 15, 2023 | Updated: March 13, 2026

Jon Durbin's GPT-4 self-instruct fine-tune on LLaMA 2 70B. Real benchmarks, VRAM requirements, and honest assessment for local deployment.

64
MMLU (Open LLM Leaderboard)
Fair
70
Parameters: 70B
Good
4
Context: 4096 tokens
Poor

Technical Specifications Overview

*Parameters: 70 billion (same as LLaMA 2 70B base)
*Context Window: 4,096 tokens
*Architecture: LLaMA 2 transformer (GQA, RoPE)
*Training Method: GPT-4 self-instruct (Airoboros methodology)
*License: LLaMA 2 Community License (commercial use with restrictions for 700M+ MAU)
*Creator: Jon Durbin
*Released: August 2023

L2 vs Original Airoboros: What Changed

The "L2" in Airoboros L2-70B means it is fine-tuned on LLaMA 2 rather than the original LLaMA 1. Jon Durbin released the original Airoboros-70B on LLaMA 1 65B, then upgraded to the LLaMA 2 base when Meta released it in July 2023. The L2 version brings several concrete improvements from the better base model:

FeatureAiroboros-70B (LLaMA 1)Airoboros L2-70B (LLaMA 2)
Base ModelLLaMA 1 65BLLaMA 2 70B
Base Training Data1.4 trillion tokens2 trillion tokens
Context Window2,048 tokens4,096 tokens
AttentionMulti-Head AttentionGrouped-Query Attention (GQA)
LicenseNon-commercial (LLaMA 1)LLaMA 2 Community (commercial OK under 700M MAU)
Fine-Tune MethodGPT-4 self-instructGPT-4 self-instruct (same methodology)

Key takeaway: If you are choosing between Airoboros-70B and Airoboros L2-70B, always pick the L2 version. The LLaMA 2 base is strictly better: more training data, longer context, GQA for faster inference, and a commercial-friendly license.

Research Background & Training Method

Airoboros uses a self-instruct methodology: Jon Durbin generated synthetic instruction-response pairs using GPT-4, then fine-tuned the LLaMA 2 base model on this data. This approach, inspired by the Self-Instruct paper (Wang et al., 2022), produces models that follow complex instructions well despite relatively small fine-tuning datasets.

The Airoboros training set includes diverse task types: creative writing, coding, math, logic puzzles, roleplay scenarios, and multi-step reasoning. This breadth gives the model flexibility across use cases, though it scores below models fine-tuned with RLHF (like Llama 2 Chat) on safety-oriented benchmarks.

Sources & References

Performance Benchmarks

Benchmark source: Scores below are from the HuggingFace Open LLM Leaderboard (accessed 2024). Airoboros L2-70B was competitive at launch (August 2023) but has been surpassed by newer 70B-class models.

MMLU Comparison (Local 70B Models)

MMLU Score (%)

Llama 3 70B79 Score
79
Qwen 2.5 72B85 Score
85
Llama 2 70B (base)69 Score
69
Airoboros L2-70B64 Score
64

Open LLM Leaderboard Tasks

Airoboros L2-70B Benchmark Scores (%)

ARC Challenge67 Score
67
HellaSwag85 Score
85
MMLU64 Score
64
TruthfulQA54 Score
54

Strengths & Weaknesses Profile

Performance Metrics

Instruction Following
70
Creative Writing
78
Code Generation
55
Mathematical Tasks
50
Reading Comprehension
72
Roleplay / Fiction
80

Note: Radar values are approximate relative assessments based on community feedback and benchmark data, not absolute scores. Airoboros L2-70B is known in the community for strong creative writing and roleplay capabilities.

VRAM Requirements by Quantization

QuantizationModel SizeVRAM RequiredCompatible HardwareQuality Impact
FP16~140 GB~140 GB2x A100 80GB, multi-GPU clustersNo loss
Q8_0~70 GB~72 GBA100 80GB, 3x RTX 3090/4090Minimal loss
Q5_K_M~48 GB~50 GB2x RTX 3090/4090, A6000 48GBMinor loss
Q4_K_M (recommended)~40 GB~42 GB2x RTX 3090/4090, A6000 48GBAcceptable loss
Q3_K_M~33 GB~35 GBRTX 4090 + partial offloadNoticeable loss
Q2_K~26 GB~28 GBRTX 4090 24GB + CPU offloadSignificant loss

VRAM estimates include model weights plus KV cache overhead for typical inference. Actual usage depends on context length and batch size. For CPU-only inference (llama.cpp, Ollama CPU mode), the Q4_K_M version needs ~48GB system RAM.

Memory Usage During Inference (Q4_K_M)

Memory Usage Over Time

42GB
31GB
21GB
10GB
0GB
0s30s120s

Typical VRAM profile for Q4_K_M quantization with 4096-token context on a dual-GPU setup.

Installation & Setup Guide

System Requirements

System Requirements

Operating System
Windows 10/11, macOS 12+ (Apple Silicon recommended), Ubuntu 20.04+
RAM
48GB minimum (CPU mode), 32GB+ system RAM alongside GPU
Storage
50GB free space for Q4_K_M quantized model
GPU
40GB+ VRAM: 2x RTX 3090/4090, A6000 48GB, or A100 80GB
CPU
Modern 8+ core CPU. Apple M1 Ultra/M2 Ultra viable for CPU inference.

Recommended: Ollama (Simplest)

Terminal
$# Install Ollama and run Airoboros L2-70B
# Install Ollama (macOS/Linux) curl -fsSL https://ollama.com/install.sh | sh # Pull and run Airoboros ollama run airoboros # The model will download (~40GB for default quantization) # Once loaded, you can start chatting directly >>> Write a short story about a time traveler [Model generates creative fiction response...] # To serve as an API: # ollama serve # Then query: curl http://localhost:11434/api/generate -d '{"model":"airoboros","prompt":"Hello"}'
$_

Alternative: llama.cpp

Terminal
$# Run with llama.cpp for more control
# Clone and build llama.cpp git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && make -j # Download GGUF from HuggingFace (community quantizations) # e.g., from TheBloke/airoboros-l2-70B-GGUF # Run with GPU offloading (adjust -ngl for your VRAM) ./main -m airoboros-l2-70b.Q4_K_M.gguf \ -ngl 80 \ -c 4096 \ -p "Write a detailed analysis of..." \ --temp 0.7
$_

Python (Transformers + bitsandbytes)

Terminal
$# Python inference with 4-bit quantization
from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_name = "jondurbin/airoboros-l2-70b" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", torch_dtype=torch.float16, load_in_4bit=True # Requires bitsandbytes ) # Airoboros uses a specific prompt format prompt = """A chat between a curious user and an assistant. The assistant gives helpful, detailed, and polite answers. USER: Explain the difference between LLaMA 1 and LLaMA 2. ASSISTANT:""" inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate( **inputs, max_new_tokens=512, temperature=0.7, do_sample=True, pad_token_id=tokenizer.eos_token_id ) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
$_

Use Cases & Strengths

Airoboros L2-70B is best known in the open-source community for creative writing, roleplay, and fiction. Its GPT-4 self-instruct training gives it a distinctive writing style. For coding or math-heavy tasks, newer models perform significantly better.

Creative Writing (Strong)

  • * Fiction and storytelling
  • * Character dialogue
  • * Worldbuilding
  • * Roleplay scenarios
  • * Poetry and prose

Instruction Following (Good)

  • * Multi-step task completion
  • * Structured output generation
  • * Research summarization
  • * Document drafting
  • * Q&A with context

Limitations (Weaker Areas)

  • * Math and reasoning (newer models better)
  • * Code generation (use CodeLlama or Qwen Coder)
  • * Factual accuracy (no RLHF safety training)
  • * Long context (only 4K tokens)
  • * Multilingual (English-focused)

Local 70B Alternatives (2026)

Airoboros L2-70B was released in August 2023. Since then, several stronger 70B-class models have become available for local deployment. Unless you specifically need Airoboros's creative writing style, consider these newer alternatives:

ModelMMLUContextVRAM (Q4)Best ForOllama
Airoboros L2-70B~64%4K~40 GBCreative writing, roleplayollama run airoboros
Llama 3 70B~79%8K~40 GBGeneral purposeollama run llama3:70b
Llama 3.1 70B~82%128K~40 GBLong context, generalollama run llama3.1:70b
Qwen 2.5 72B~85%128K~42 GBMultilingual, reasoningollama run qwen2.5:72b
Mixtral 8x22B~77%64K~80 GBCode, multilingualollama run mixtral:8x22b
Nemotron 70B~83%4K~40 GBInstruction followingollama run nemotron:70b

Comparative Analysis

Local 70B Models Comparison

All models below run locally. Scores reflect MMLU performance from the Open LLM Leaderboard.

ModelSizeRAM RequiredSpeedQualityCost/Month
Airoboros L2-70B70B~40GB (Q4)Medium
64%
Free
Llama 2 70B Chat70B~40GB (Q4)Medium
69%
Free
Llama 3 70B70B~40GB (Q4)Medium
79%
Free
Qwen 2.5 72B72B~42GB (Q4)Medium
85%
Free

When to Choose Airoboros L2-70B

Good Choice If...

  • * You need strong creative writing / fiction output
  • * You want a model known for roleplay capabilities
  • * You prefer GPT-4 self-instruct style responses
  • * You are already familiar with the Airoboros prompt format

Better Alternatives If...

  • * You need strong reasoning or math (use Qwen 2.5 72B)
  • * You need code generation (use Qwen 2.5 Coder or CodeLlama)
  • * You need long context (use Llama 3.1 70B with 128K)
  • * You need the best general-purpose 70B (use Llama 3 70B or Qwen 2.5 72B)
  • * You need multilingual support (use Qwen 2.5 72B)

Troubleshooting & Common Issues

Out of Memory (OOM) Errors

70B models are demanding. If you hit OOM errors, try these steps:

  • * Use a smaller quantization: switch from Q5_K_M to Q4_K_M or Q3_K_M
  • * Reduce context length: set -c 2048 instead of 4096
  • * Offload layers to CPU: use -ngl 40 (fewer GPU layers) in llama.cpp
  • * Close other GPU applications before loading the model
  • * Consider a smaller model: Llama 3 8B fits in 8GB VRAM and outperforms Airoboros L2-70B on many benchmarks

Slow Generation Speed

70B models are inherently slower than smaller models. Typical speeds:

  • * GPU (2x RTX 4090, Q4_K_M): ~10-15 tokens/second
  • * GPU (A100 80GB, Q4_K_M): ~20-30 tokens/second
  • * CPU only (64GB RAM): ~1-3 tokens/second
  • * Apple M2 Ultra: ~5-8 tokens/second

If speed is critical, consider Llama 3 8B or Mistral 7B Instruct which run at 50-100+ tokens/second on a single GPU.

Wrong Prompt Format

Airoboros uses a specific chat template. Using the wrong format degrades quality:

# Correct Airoboros prompt format:
A chat between a curious user and an assistant.
The assistant gives helpful, detailed, and polite answers.
USER: [your question here]
ASSISTANT:

Ollama handles this automatically. If using llama.cpp or Transformers directly, make sure to use this format.

Frequently Asked Questions

What is the difference between Airoboros L2-70B and Airoboros-70B?

Airoboros L2-70B is fine-tuned on LLaMA 2 70B (the 'L2' stands for LLaMA 2), while the original Airoboros-70B was fine-tuned on the first LLaMA 1 65B. The L2 version benefits from LLaMA 2's improved base training (2 trillion tokens vs 1.4 trillion), doubled context window (4096 vs 2048 tokens), and a more permissive commercial license. Both use Jon Durbin's GPT-4 self-instruct training methodology.

What are the hardware requirements for running Airoboros L2-70B locally?

At full FP16 precision, Airoboros L2-70B requires ~140GB VRAM — impractical for most users. With Q4_K_M quantization (recommended), it needs ~40GB VRAM, fitting on dual RTX 3090/4090 or a single A100 80GB. For CPU-only inference, you need at least 48GB system RAM for the Q4 quantized version. An NVMe SSD is strongly recommended for model loading speed.

How does Airoboros L2-70B perform on benchmarks?

On the HuggingFace Open LLM Leaderboard, Airoboros L2-70B scores approximately 64-66% on MMLU (Massive Multitask Language Understanding). It performs well on creative writing and instruction-following tasks due to its GPT-4 self-instruct training, but scores below newer 70B models like Llama 3 70B (~79% MMLU). It remains a good option for creative and roleplay use cases.

Can I run Airoboros L2-70B with Ollama?

Yes. Run 'ollama run airoboros' to pull and run the model. Ollama handles quantization automatically. The default GGUF quantization typically uses Q4_K_M, requiring approximately 40GB VRAM. For CPU-only mode, ensure you have at least 48GB system RAM. Performance will be slower on CPU compared to GPU inference.

Is Airoboros L2-70B still worth using in 2026?

Airoboros L2-70B (released August 2023) has been surpassed by newer models in raw benchmark performance. Llama 3 70B, Qwen 2.5 72B, and Mixtral 8x22B all score significantly higher on MMLU and reasoning benchmarks. However, Airoboros L2-70B retains a following for creative writing and roleplay tasks where its GPT-4 self-instruct training produces distinctive output. For general-purpose use, newer models are recommended.

Was this helpful?

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Reading now
Join the discussion

Related Guides

Continue your local AI journey with these comprehensive guides

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📅 Published: 2023-08-15🔄 Last Updated: 2026-03-16✓ Manually Reviewed
Free Tools & Calculators