Airoboros L2-70B: LLaMA 2 Self-Instruct Fine-Tune
Published: August 15, 2023 | Updated: March 13, 2026
Jon Durbin's GPT-4 self-instruct fine-tune on LLaMA 2 70B. Real benchmarks, VRAM requirements, and honest assessment for local deployment.
Technical Specifications Overview
L2 vs Original Airoboros: What Changed
The "L2" in Airoboros L2-70B means it is fine-tuned on LLaMA 2 rather than the original LLaMA 1. Jon Durbin released the original Airoboros-70B on LLaMA 1 65B, then upgraded to the LLaMA 2 base when Meta released it in July 2023. The L2 version brings several concrete improvements from the better base model:
| Feature | Airoboros-70B (LLaMA 1) | Airoboros L2-70B (LLaMA 2) |
|---|---|---|
| Base Model | LLaMA 1 65B | LLaMA 2 70B |
| Base Training Data | 1.4 trillion tokens | 2 trillion tokens |
| Context Window | 2,048 tokens | 4,096 tokens |
| Attention | Multi-Head Attention | Grouped-Query Attention (GQA) |
| License | Non-commercial (LLaMA 1) | LLaMA 2 Community (commercial OK under 700M MAU) |
| Fine-Tune Method | GPT-4 self-instruct | GPT-4 self-instruct (same methodology) |
Key takeaway: If you are choosing between Airoboros-70B and Airoboros L2-70B, always pick the L2 version. The LLaMA 2 base is strictly better: more training data, longer context, GQA for faster inference, and a commercial-friendly license.
Research Background & Training Method
Airoboros uses a self-instruct methodology: Jon Durbin generated synthetic instruction-response pairs using GPT-4, then fine-tuned the LLaMA 2 base model on this data. This approach, inspired by the Self-Instruct paper (Wang et al., 2022), produces models that follow complex instructions well despite relatively small fine-tuning datasets.
The Airoboros training set includes diverse task types: creative writing, coding, math, logic puzzles, roleplay scenarios, and multi-step reasoning. This breadth gives the model flexibility across use cases, though it scores below models fine-tuned with RLHF (like Llama 2 Chat) on safety-oriented benchmarks.
Sources & References
- Airoboros L2-70B on Hugging Face — Official model card
- Airoboros GitHub Repository — Training code and methodology
- Llama 2: Open Foundation and Fine-Tuned Chat Models (Touvron et al., 2023) — Base model paper
- Self-Instruct: Aligning LM with Self-Generated Instructions (Wang et al., 2022) — Training methodology foundation
Performance Benchmarks
Benchmark source: Scores below are from the HuggingFace Open LLM Leaderboard (accessed 2024). Airoboros L2-70B was competitive at launch (August 2023) but has been surpassed by newer 70B-class models.
MMLU Comparison (Local 70B Models)
MMLU Score (%)
Open LLM Leaderboard Tasks
Airoboros L2-70B Benchmark Scores (%)
Strengths & Weaknesses Profile
Performance Metrics
Note: Radar values are approximate relative assessments based on community feedback and benchmark data, not absolute scores. Airoboros L2-70B is known in the community for strong creative writing and roleplay capabilities.
VRAM Requirements by Quantization
| Quantization | Model Size | VRAM Required | Compatible Hardware | Quality Impact |
|---|---|---|---|---|
| FP16 | ~140 GB | ~140 GB | 2x A100 80GB, multi-GPU clusters | No loss |
| Q8_0 | ~70 GB | ~72 GB | A100 80GB, 3x RTX 3090/4090 | Minimal loss |
| Q5_K_M | ~48 GB | ~50 GB | 2x RTX 3090/4090, A6000 48GB | Minor loss |
| Q4_K_M (recommended) | ~40 GB | ~42 GB | 2x RTX 3090/4090, A6000 48GB | Acceptable loss |
| Q3_K_M | ~33 GB | ~35 GB | RTX 4090 + partial offload | Noticeable loss |
| Q2_K | ~26 GB | ~28 GB | RTX 4090 24GB + CPU offload | Significant loss |
VRAM estimates include model weights plus KV cache overhead for typical inference. Actual usage depends on context length and batch size. For CPU-only inference (llama.cpp, Ollama CPU mode), the Q4_K_M version needs ~48GB system RAM.
Memory Usage During Inference (Q4_K_M)
Memory Usage Over Time
Typical VRAM profile for Q4_K_M quantization with 4096-token context on a dual-GPU setup.
Installation & Setup Guide
System Requirements
System Requirements
Recommended: Ollama (Simplest)
Alternative: llama.cpp
Python (Transformers + bitsandbytes)
Use Cases & Strengths
Airoboros L2-70B is best known in the open-source community for creative writing, roleplay, and fiction. Its GPT-4 self-instruct training gives it a distinctive writing style. For coding or math-heavy tasks, newer models perform significantly better.
Creative Writing (Strong)
- * Fiction and storytelling
- * Character dialogue
- * Worldbuilding
- * Roleplay scenarios
- * Poetry and prose
Instruction Following (Good)
- * Multi-step task completion
- * Structured output generation
- * Research summarization
- * Document drafting
- * Q&A with context
Limitations (Weaker Areas)
- * Math and reasoning (newer models better)
- * Code generation (use CodeLlama or Qwen Coder)
- * Factual accuracy (no RLHF safety training)
- * Long context (only 4K tokens)
- * Multilingual (English-focused)
Local 70B Alternatives (2026)
Airoboros L2-70B was released in August 2023. Since then, several stronger 70B-class models have become available for local deployment. Unless you specifically need Airoboros's creative writing style, consider these newer alternatives:
| Model | MMLU | Context | VRAM (Q4) | Best For | Ollama |
|---|---|---|---|---|---|
| Airoboros L2-70B | ~64% | 4K | ~40 GB | Creative writing, roleplay | ollama run airoboros |
| Llama 3 70B | ~79% | 8K | ~40 GB | General purpose | ollama run llama3:70b |
| Llama 3.1 70B | ~82% | 128K | ~40 GB | Long context, general | ollama run llama3.1:70b |
| Qwen 2.5 72B | ~85% | 128K | ~42 GB | Multilingual, reasoning | ollama run qwen2.5:72b |
| Mixtral 8x22B | ~77% | 64K | ~80 GB | Code, multilingual | ollama run mixtral:8x22b |
| Nemotron 70B | ~83% | 4K | ~40 GB | Instruction following | ollama run nemotron:70b |
Comparative Analysis
Local 70B Models Comparison
All models below run locally. Scores reflect MMLU performance from the Open LLM Leaderboard.
| Model | Size | RAM Required | Speed | Quality | Cost/Month |
|---|---|---|---|---|---|
| Airoboros L2-70B | 70B | ~40GB (Q4) | Medium | 64% | Free |
| Llama 2 70B Chat | 70B | ~40GB (Q4) | Medium | 69% | Free |
| Llama 3 70B | 70B | ~40GB (Q4) | Medium | 79% | Free |
| Qwen 2.5 72B | 72B | ~42GB (Q4) | Medium | 85% | Free |
When to Choose Airoboros L2-70B
Good Choice If...
- * You need strong creative writing / fiction output
- * You want a model known for roleplay capabilities
- * You prefer GPT-4 self-instruct style responses
- * You are already familiar with the Airoboros prompt format
Better Alternatives If...
- * You need strong reasoning or math (use Qwen 2.5 72B)
- * You need code generation (use Qwen 2.5 Coder or CodeLlama)
- * You need long context (use Llama 3.1 70B with 128K)
- * You need the best general-purpose 70B (use Llama 3 70B or Qwen 2.5 72B)
- * You need multilingual support (use Qwen 2.5 72B)
Troubleshooting & Common Issues
Out of Memory (OOM) Errors
70B models are demanding. If you hit OOM errors, try these steps:
- * Use a smaller quantization: switch from Q5_K_M to Q4_K_M or Q3_K_M
- * Reduce context length: set
-c 2048instead of 4096 - * Offload layers to CPU: use
-ngl 40(fewer GPU layers) in llama.cpp - * Close other GPU applications before loading the model
- * Consider a smaller model: Llama 3 8B fits in 8GB VRAM and outperforms Airoboros L2-70B on many benchmarks
Slow Generation Speed
70B models are inherently slower than smaller models. Typical speeds:
- * GPU (2x RTX 4090, Q4_K_M): ~10-15 tokens/second
- * GPU (A100 80GB, Q4_K_M): ~20-30 tokens/second
- * CPU only (64GB RAM): ~1-3 tokens/second
- * Apple M2 Ultra: ~5-8 tokens/second
If speed is critical, consider Llama 3 8B or Mistral 7B Instruct which run at 50-100+ tokens/second on a single GPU.
Wrong Prompt Format
Airoboros uses a specific chat template. Using the wrong format degrades quality:
# Correct Airoboros prompt format: A chat between a curious user and an assistant. The assistant gives helpful, detailed, and polite answers. USER: [your question here] ASSISTANT:
Ollama handles this automatically. If using llama.cpp or Transformers directly, make sure to use this format.
Frequently Asked Questions
What is the difference between Airoboros L2-70B and Airoboros-70B?
Airoboros L2-70B is fine-tuned on LLaMA 2 70B (the 'L2' stands for LLaMA 2), while the original Airoboros-70B was fine-tuned on the first LLaMA 1 65B. The L2 version benefits from LLaMA 2's improved base training (2 trillion tokens vs 1.4 trillion), doubled context window (4096 vs 2048 tokens), and a more permissive commercial license. Both use Jon Durbin's GPT-4 self-instruct training methodology.
What are the hardware requirements for running Airoboros L2-70B locally?
At full FP16 precision, Airoboros L2-70B requires ~140GB VRAM — impractical for most users. With Q4_K_M quantization (recommended), it needs ~40GB VRAM, fitting on dual RTX 3090/4090 or a single A100 80GB. For CPU-only inference, you need at least 48GB system RAM for the Q4 quantized version. An NVMe SSD is strongly recommended for model loading speed.
How does Airoboros L2-70B perform on benchmarks?
On the HuggingFace Open LLM Leaderboard, Airoboros L2-70B scores approximately 64-66% on MMLU (Massive Multitask Language Understanding). It performs well on creative writing and instruction-following tasks due to its GPT-4 self-instruct training, but scores below newer 70B models like Llama 3 70B (~79% MMLU). It remains a good option for creative and roleplay use cases.
Can I run Airoboros L2-70B with Ollama?
Yes. Run 'ollama run airoboros' to pull and run the model. Ollama handles quantization automatically. The default GGUF quantization typically uses Q4_K_M, requiring approximately 40GB VRAM. For CPU-only mode, ensure you have at least 48GB system RAM. Performance will be slower on CPU compared to GPU inference.
Is Airoboros L2-70B still worth using in 2026?
Airoboros L2-70B (released August 2023) has been surpassed by newer models in raw benchmark performance. Llama 3 70B, Qwen 2.5 72B, and Mixtral 8x22B all score significantly higher on MMLU and reasoning benchmarks. However, Airoboros L2-70B retains a following for creative writing and roleplay tasks where its GPT-4 self-instruct training produces distinctive output. For general-purpose use, newer models are recommended.
Was this helpful?
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Related Guides
Continue your local AI journey with these comprehensive guides
Written by Pattanaik Ramswarup
Creator of Local AI Master
I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.