Alpaca 7B
Stanford's $600 Instruction-Tuning Pioneer
Updated: March 16, 2026
Historical Context
Alpaca 7B (March 2023) is historically significant as the model that proved instruction-tuning could be done cheaply on open-source base models. It is not recommended for production use in 2026 — modern 7B models like Qwen 2.5 7B and Mistral 7B Instruct dramatically outperform it. This page covers Alpaca's methodology, real capabilities, and historical importance.
Why Alpaca Matters in AI History
Before March 2023, instruction-following AI was effectively a monopoly. OpenAI's ChatGPT (released November 2022) and Google's Bard were the only models that could follow natural language instructions well. Running anything comparable locally was considered impossible without massive compute budgets.
Stanford's Alpaca project changed this perception overnight. By fine-tuning Meta's LLaMA 7B on just 52,000 instruction-output pairs — generated for approximately $600 using OpenAI's text-davinci-003 API — the Stanford team demonstrated that a 7B parameter model could produce surprisingly coherent instruction-following behavior.
The Stanford team's own blind evaluation found Alpaca 7B performed comparably to text-davinci-003 on their test set, winning 45% of comparisons while losing 45% and tying 10%. This wasn't GPT-4-level performance, but it proved the concept: cheap instruction-tuning on open base models could produce usable AI assistants. Within weeks, projects like Vicuna, Koala, and Dolly followed.
Technical Architecture
Base Model: LLaMA 7B
- Architecture: Transformer decoder-only (GPT-style)
- Parameters: 6.74B (commonly rounded to 7B)
- Hidden Size: 4096
- Layers: 32 transformer blocks
- Attention Heads: 32
- Vocabulary: 32,000 tokens (SentencePiece)
- Context Length: 2048 tokens
- Pre-training Data: ~1T tokens (CommonCrawl, C4, GitHub, Wikipedia, Books, ArXiv, StackExchange)
Source: Touvron et al., "LLaMA: Open and Efficient Foundation Language Models" (arXiv:2302.13971)
Alpaca Fine-tuning
- Method: Full supervised fine-tuning (SFT)
- Dataset: 52,002 instruction-output pairs
- Data Generation: text-davinci-003 API (~$500)
- Seed Tasks: 175 hand-written instruction-output pairs
- Framework: HuggingFace Transformers
- Hardware: 4x A100 80GB GPUs
- Training Time: ~3 hours
- Total Cost: ~$600 (API + compute)
Source: Stanford CRFM Alpaca blog post (crfm.stanford.edu, March 13, 2023)
Key Architectural Limitations
- No RLHF: Alpaca only used supervised fine-tuning, not reinforcement learning from human feedback. This means it can generate harmful or incorrect content more readily than RLHF-trained models.
- 2048 context: Inherited from LLaMA 1, this is very short by 2026 standards (modern models support 32K-128K+ tokens).
- Single-turn only: The training data consisted of single instruction-response pairs, not multi-turn conversations. Alpaca struggles with dialogue continuity.
- Distillation artifacts: Because training data came from text-davinci-003, Alpaca inherited some of that model's biases and failure modes, plus introduced new ones from the distillation process.
Self-Instruct Training Methodology
Alpaca's training data was generated using a modified version of the Self-Instruct framework (Wang et al., 2022, arXiv:2212.10560). This was the key innovation — instead of paying humans to write thousands of instruction-output pairs, Stanford used an existing capable model to generate them.
How the 52K Dataset Was Created
175 Seed Tasks
Stanford researchers hand-wrote 175 diverse instruction-output pairs covering tasks like brainstorming, classification, rewriting, coding, and open-ended generation.
Prompt text-davinci-003
Using 3 seed examples as in-context demonstrations, text-davinci-003 was prompted to generate new instructions and corresponding outputs. Each API call cost ~$0.01.
Filter and Deduplicate
Generated pairs were filtered for quality and deduplicated using ROUGE-L similarity, yielding 52,002 unique instruction-output pairs.
Fine-tune LLaMA 7B
Standard supervised fine-tuning on the 52K pairs. Learning rate 2e-5, batch size 128, 3 epochs. Total training: ~3 hours on 4x A100 GPUs.
Training Data Format
{
"instruction": "Give three tips for staying healthy.",
"input": "",
"output": "1. Eat a balanced diet with plenty of fruits,
vegetables, and whole grains.
2. Exercise regularly, aiming for at least 30 minutes
of moderate activity most days.
3. Get adequate sleep, typically 7-9 hours per night."
}The dataset includes both instruction-only and instruction+input formats. The full dataset is available at github.com/tatsu-lab/stanford_alpaca
Honest Performance Assessment
No Standard Benchmarks Available
Alpaca was released before the Open LLM Leaderboard standardized model evaluation. Stanford evaluated Alpaca through blind pairwise comparisons with text-davinci-003 rather than reporting MMLU, HellaSwag, or other standard benchmark scores. The numbers below are from Stanford's own evaluation and from community evaluations on the base LLaMA 7B model.
Stanford's Blind Evaluation
The Stanford team conducted blind comparisons between Alpaca 7B and text-davinci-003 on 252 test instructions:
Source: Stanford CRFM Alpaca announcement, March 13, 2023 (crfm.stanford.edu/2023/03/13/alpaca.html). Note: The Stanford team acknowledged this evaluation was limited — text-davinci-003 itself was not state-of-the-art by early 2023 (GPT-4 had just been announced).
LLaMA 7B Base Model Benchmarks (for context)
Alpaca shares LLaMA 7B's knowledge — fine-tuning improved instruction-following format but didn't significantly change factual knowledge.
| Benchmark | LLaMA 7B | LLaMA 13B | GPT-3 175B |
|---|---|---|---|
| MMLU (5-shot) | 35.1% | 46.9% | 43.9% |
| HellaSwag | 76.1% | 79.2% | 78.9% |
| ARC (Challenge) | 47.6% | 52.7% | 51.4% |
| WinoGrande | 70.1% | 72.8% | 70.2% |
| TruthfulQA | 33.0% | 34.8% | — |
Source: Touvron et al., arXiv:2302.13971, Tables 3-9. LLaMA 7B matched GPT-3 175B on several benchmarks despite being 25x smaller.
What Alpaca Does Well
- Simple instruction following (rewrite, summarize, classify)
- Basic creative writing and brainstorming
- Formatting responses to user requests
- Short factual Q&A (within LLaMA 7B's knowledge)
- Demonstrating the instruction-tuning concept
Where Alpaca Fails
- Multi-turn conversations (not trained for dialogue)
- Complex reasoning and math (LLaMA 7B limitation)
- Code generation (minimal code in training data)
- Long-form content (2048 token context limit)
- Safety — no RLHF means it can be easily prompted to generate harmful content
- Hallucination — frequently fabricates facts confidently
VRAM Requirements by Quantization
Since Alpaca shares LLaMA 7B's architecture, VRAM requirements are identical to other LLaMA 7B variants. Community GGUF quantizations are available on HuggingFace.
| Quantization | File Size | VRAM Required | Quality Impact | Best For |
|---|---|---|---|---|
| Q4_0 | ~3.8 GB | ~4.5 GB | Noticeable loss | 8GB GPU / CPU-only machines |
| Q4_K_M | ~4.1 GB | ~5.0 GB | Good balance | Recommended for most users |
| Q5_K_M | ~4.8 GB | ~5.5 GB | Minimal loss | 12GB+ GPU |
| Q8_0 | ~7.2 GB | ~8.0 GB | Near-lossless | 16GB+ GPU |
| FP16 | ~13.5 GB | ~14.5 GB | Full precision | 24GB+ GPU (research only) |
Running Alpaca in 2026
Availability Note
Stanford took down the original Alpaca weights after legal concerns about LLaMA 1's license. The Alpaca dataset (52K instruction-output pairs) is still available on GitHub. Community-made GGUF quantizations can be found on HuggingFace, but there is no official Ollama model. For practical use, we recommend modern alternatives instead (see below).
Using llama.cpp (if you have GGUF weights)
# Build llama.cpp git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && make # Run with a community GGUF file ./llama-cli -m alpaca-7b.Q4_K_M.gguf \ -p "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Give three tips for staying healthy. ### Response:" \ -n 256 --temp 0.7
Alpaca uses a specific prompt format with "### Instruction:" and "### Response:" markers. Using the wrong format will produce poor results.
Using Transformers (if you have HF weights)
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "chavinlo/alpaca-native" # Community upload
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
prompt = """Below is an instruction that describes a task.
Write a response that appropriately completes the request.
### Instruction:
Explain the difference between supervised and unsupervised learning.
### Response:"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))Requires ~14GB VRAM for FP16. Use load_in_4bit=True with bitsandbytes for ~5GB VRAM.
The 2023 Instruction-Tuning Wave
Alpaca triggered an explosion of instruction-tuned open models within weeks of its release. This timeline shows how one $600 experiment catalyzed an entire movement:
By mid-2023, the basic Self-Instruct approach was already being superseded. Vicuna showed that real conversation data (from ShareGPT) produced better chat models than synthetic instructions. WizardLM showed that evolving instruction complexity matters more than dataset size. By the time LLaMA 2 arrived in July 2023 with an open license and built-in RLHF, the Alpaca-era approach was largely obsolete — but it had proven the concept that unlocked everything that followed.
License Restrictions
Dual License Problems
Alpaca has two separate license issues that make it unsuitable for commercial use:
1. LLaMA 1 License (Meta)
LLaMA 1 was released under a non-commercial research license. Any derivative model (including Alpaca) inherits this restriction. This is why Stanford eventually took down the weights.
2. OpenAI Terms of Service
The 52K training examples were generated by text-davinci-003. OpenAI's Terms of Service prohibit using API outputs to train models that compete with OpenAI. This creates an additional legal gray area even for research use.
For commercial projects: Use models with clear open licenses instead — Llama 3.x (Meta Community License), Qwen 2.5 (Apache 2.0), or Mistral (Apache 2.0).
Modern Alternatives (2026)
If you're looking for a local instruction-following model, these modern options dramatically outperform Alpaca 7B on every metric while being easier to run:
| Model | Size | MMLU | Context | License | Ollama |
|---|---|---|---|---|---|
| Alpaca 7B (2023) | 7B | ~35%* | 2K | Non-commercial | Not available |
| Qwen 2.5 7B Instruct | 7B | ~74% | 128K | Apache 2.0 | ollama run qwen2.5:7b |
| Mistral 7B Instruct v0.3 | 7B | ~63% | 32K | Apache 2.0 | ollama run mistral |
| Llama 3.2 3B Instruct | 3B | ~63% | 128K | Meta Community | ollama run llama3.2:3b |
| Gemma 2 9B Instruct | 9B | ~72% | 8K | Gemma ToU | ollama run gemma2:9b |
*Alpaca MMLU estimated from LLaMA 7B base (35.1%). Instruction tuning typically doesn't improve MMLU significantly. Modern 7B models have doubled this score through better pre-training data and techniques.
Was this helpful?
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Frequently Asked Questions
Is Alpaca 7B still worth using in 2026?
For practical use, no. Modern models like Qwen 2.5 7B score ~74% on MMLU vs Alpaca's ~35%, support 128K context vs 2K, have proper RLHF safety training, and are available with commercial licenses. Alpaca is valuable as a learning tool for understanding instruction-tuning methodology.
Can I run Alpaca on Ollama?
Alpaca is not available as an official Ollama model. You can use community GGUF files with llama.cpp directly. However, for instruction-following tasks, ollama run qwen2.5:7b or ollama run mistral will give you dramatically better results with zero setup friction.
Why did Stanford take down the Alpaca weights?
Two legal concerns: (1) LLaMA 1's non-commercial research license restricted derivative distribution, and (2) OpenAI's Terms of Service prohibit using API outputs to train competing models. The 52K instruction dataset is still available on GitHub at tatsu-lab/stanford_alpaca.
What was Alpaca's actual impact on AI?
Alpaca demonstrated that instruction-tuning a small open model on synthetic data could produce usable AI assistants at trivial cost. This directly inspired Vicuna, Koala, WizardLM, Dolly, and dozens of other projects. The Self-Instruct methodology it popularized remains influential, though modern approaches use RLHF, DPO, and larger/better training datasets.
Sources & References
- Stanford CRFM: "Alpaca: A Strong, Replicable Instruction-Following Model" — Official project announcement (March 13, 2023)
- github.com/tatsu-lab/stanford_alpaca — Source code, training data, and methodology
- arXiv:2302.13971 — "LLaMA: Open and Efficient Foundation Language Models" — Touvron et al., 2023 (base model paper)
- arXiv:2212.10560 — "Self-Instruct: Aligning LMs with Self-Generated Instructions" — Wang et al., 2022 (training methodology)
- arXiv:2203.02155 — "Training Language Models to Follow Instructions with Human Feedback" — Ouyang et al., 2022 (InstructGPT, context for instruction-tuning)
Written by Pattanaik Ramswarup
Creator of Local AI Master
I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.
Related Guides
Continue your local AI journey with these comprehensive guides
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.