Starling-LM-7B-Alpha
Berkeley RLHF Model — MT-Bench 8.09
Starling-LM-7B-Alpha is a 7B parameter model from UC Berkeley BAIR (the team behind Chatbot Arena). Built on OpenChat 3.5 (Mistral 7B base), it uses RLHF with a GPT-4-trained reward model on the Nectar dataset of 183K preference comparisons. At release in November 2023, it achieved MT-Bench 8.09 — the highest score among open-weight 7B models.
Model Overview
UC Berkeley BAIR | OpenChat 3.5 + RLHF | Apache 2.0
Run locally: ollama run starling-lm
Model Architecture & RLHF Innovation
Starling-LM-7B-Alpha is a fine-tuned version of OpenChat 3.5, which itself is built on Mistral 7B. The key innovation is its RLHF training using a GPT-4-trained reward model.
Model Details
Performance Metrics
Hardware Requirements
Architecture Lineage
Mistral 7B → OpenChat 3.5 → Starling
Starling-LM-7B-Alpha inherits the Mistral 7B architecture: 32 transformer layers, grouped-query attention (GQA) with 32 heads and 8 KV heads, sliding window attention (4096), and an 8192-token context window. The base weights come via OpenChat 3.5, which was fine-tuned using C-RLFT (Conditioned Reinforcement Learning Fine-Tuning) on mixed-quality data.
Starling then applies RLHF (Reinforcement Learning from Human Feedback) on top of OpenChat 3.5, using Proximal Policy Optimization (PPO) with a reward model trained on GPT-4 preference labels. This two-stage approach — strong SFT base + RLHF — proved more effective than RLHF alone.
Why This Matters for Local AI
Starling demonstrated that RLHF with a strong reward model could dramatically improve a 7B model's helpfulness and conversational quality. The MT-Bench 8.09 score was competitive with models 10x its size at release, and the Apache 2.0 license means you can run it commercially with no restrictions.
Key Architectural Features
- • Grouped-Query Attention (GQA) — faster inference, less VRAM
- • Sliding Window Attention — efficient long-context handling
- • SentencePiece BPE tokenizer (32K vocab)
- • RoPE positional embeddings
- • SiLU activation function
RLHF Training & the Nectar Dataset
Starling's key contribution was showing that RLHF with a high-quality reward model could push a 7B model to compete with much larger ones on conversational quality.
The Nectar Dataset
Nectar is a preference dataset created by the Berkeley BAIR team containing approximately 183,000 pairwise comparisons across diverse conversational topics. Each comparison consists of two responses to the same prompt, ranked by GPT-4 as the preference judge.
Dataset Composition
- • 183K preference pairs from diverse conversation topics
- • Responses generated by multiple models (GPT-4, Claude, Llama, etc.)
- • GPT-4 as the preference judge for ranking
- • Covers helpfulness, harmlessness, and honesty dimensions
The Reward Model (Starling-RM-7B-Alpha)
The team trained a separate reward model — Starling-RM-7B-Alpha — also based on the Llama 2 7B Chat architecture. This reward model was trained on the Nectar dataset to predict which response GPT-4 would prefer.
RLHF Pipeline
- Start with OpenChat 3.5 (already strong SFT model)
- Train Starling-RM-7B-Alpha reward model on Nectar
- Apply PPO (Proximal Policy Optimization) using the reward model
- Result: Starling-LM-7B-Alpha with improved helpfulness
This was one of the first open demonstrations that RLHF could meaningfully improve an already-strong fine-tuned model (OpenChat 3.5 was already top-tier for 7B). The reward model itself is also open-sourced under Apache 2.0.
Historical Context: November 2023
Starling was released on November 20, 2023, by the same UC Berkeley BAIR team that created LMSYS Chatbot Arena (the most widely-used LLM evaluation platform). At release, its MT-Bench 8.09 was the highest among all open-weight models under 13B parameters. For context, GPT-3.5-Turbo scored ~7.94 on the same benchmark, meaning Starling — a locally-runnable 7B model — outperformed it.
By 2026 standards, newer models like Mistral 7B v0.3, Llama 3.1 8B, and Qwen 2.5 7B have surpassed Starling's benchmark scores. However, Starling remains historically significant as a demonstration of RLHF effectiveness and is still a capable conversational model for basic tasks on resource-constrained hardware.
Performance Benchmarks
MT-Bench comparison with other 7B-class models from November 2023. MT-Bench measures multi-turn conversational quality on a 1-10 scale.
MT-Bench Score Comparison (November 2023)
Source: LMSYS Chatbot Arena Leaderboard (November 2023 snapshot)
Memory Usage Over Time
VRAM usage at Q4_K_M quantization unless noted. FP16 peak shown for reference.
MT-Bench: 8.09
Multi-turn conversational quality scored by GPT-4. At release, this was #1 among open-weight 7B models, surpassing even GPT-3.5-Turbo (7.94). Source: LMSYS Chatbot Arena.
MMLU: ~63.9%
Massive Multitask Language Understanding across 57 academic subjects. Comparable to OpenChat 3.5 base (~64.3%). RLHF primarily improved conversational quality, not factual knowledge. Source: HF Open LLM Leaderboard.
HellaSwag: ~84.5%
Commonsense reasoning benchmark inherited from the strong Mistral 7B base. Measures ability to predict logical sentence completions. Source: HF Open LLM Leaderboard.
ARC Challenge: ~64.4%
Grade-school science reasoning questions requiring multi-step logic. Strong result for a 7B model, inherited from Mistral 7B foundations. Source: HF Open LLM Leaderboard.
TruthfulQA: ~54.2%
Measures tendency to generate truthful responses vs. common misconceptions. RLHF training likely helped here by rewarding more careful, honest responses. Source: HF Open LLM Leaderboard.
Winogrande: ~80.6%
Commonsense coreference resolution benchmark. Tests understanding of pronouns and contextual references in natural language. Source: HF Open LLM Leaderboard.
VRAM & Quantization Guide
VRAM requirements by quantization level for Starling-LM-7B-Alpha. Q4_K_M is the recommended default for most users.
| Quantization | File Size | VRAM Required | Quality Loss | Best For |
|---|---|---|---|---|
| Q2_K | ~2.7GB | ~3.2GB | Significant | Testing only, low-RAM devices |
| Q4_K_M (default) | ~4.1GB | ~4.5GB | Minimal | Recommended for most users |
| Q5_K_M | ~4.8GB | ~5.3GB | Very small | Quality-sensitive tasks with 6GB+ VRAM |
| Q8_0 | ~7.2GB | ~7.8GB | Negligible | Near-FP16 quality, 8GB+ VRAM |
| FP16 | ~13.5GB | ~14GB | None | Full precision, 16GB+ VRAM (research) |
VRAM estimates include model weights + KV cache at moderate context length. Actual usage varies with context length and batch size.
Hardware Requirements & Compatibility
Starling-LM-7B-Alpha is one of the more accessible models for local deployment, running comfortably on most modern laptops at Q4_K_M quantization.
System Requirements
Performance by Hardware
Apple M1/M2/M3 (8GB+)
Excellent experience at Q4_K_M. Metal acceleration gives ~20-30 tok/s on M1, ~35-50 tok/s on M2 Pro/Max. Unified memory means no separate VRAM needed.
NVIDIA RTX 3060 (12GB)
Full model fits in VRAM at Q4_K_M with room for context. Expect ~30-40 tok/s. Even Q8_0 fits with 12GB VRAM.
CPU-Only (16GB RAM)
Workable at Q4_K_M with ~5-10 tok/s on a modern 8-core CPU. Acceptable for occasional use but not for production workloads.
Platform Notes
Ollama (Recommended)
The simplest way to run Starling. Available as starling-lm in the Ollama library. Handles quantization, Metal/CUDA detection, and memory management automatically.
llama.cpp / llamafile
For manual GGUF deployment. Download GGUF files from HuggingFace (TheBloke or official GGUF repos). Provides more control over quantization levels and inference parameters.
Docker
For containerized deployment: docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama, then docker exec -it container ollama pull starling-lm.
Installation & Deployment Guide
Get Starling-LM-7B-Alpha running locally in under 5 minutes with Ollama.
Install Ollama
Set up Ollama to manage local AI models
Pull Starling Model
Download Starling-LM 7B (Q4_K_M quantization, ~4.1GB)
Run the Model
Start an interactive chat session
API Access (Optional)
Use the Ollama REST API for programmatic access
Ollama API Example (Python)
import requests
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "starling-lm",
"prompt": "Explain RLHF in simple terms",
"stream": False
}
)
print(response.json()["response"])Use Cases & Applications
Starling's strength is conversational quality. Its RLHF training makes it particularly good at helpful, well-structured responses — better than its raw benchmark scores would suggest.
Where Starling Excels
Helpful Chat / Q&A
The RLHF training specifically optimized for helpfulness. Starling gives more structured, complete answers than base Mistral 7B or even OpenChat 3.5. Great for local chatbot prototypes and internal Q&A systems.
Content Drafting
Blog posts, emails, documentation drafts. The model's conversational training makes it good at following instructions for writing tasks. Works entirely offline for privacy-sensitive content.
RLHF Research & Education
Both the policy model (Starling-LM) and reward model (Starling-RM) are open. This makes Starling uniquely valuable for studying RLHF pipelines locally — you can inspect how the reward model scores different responses.
Where Starling Falls Short
Coding Tasks
Not specifically trained for code. For coding, prefer CodeLlama 7B or Qwen 2.5 Coder 7B.
Complex Reasoning / Math
7B models generally struggle with multi-step reasoning. For math, Mathstral 7B is a better choice.
Long Context (more than 4K tokens)
While technically 8192 tokens, the sliding window attention (4096) means quality degrades for very long contexts. Newer models handle long context better.
Local Alternatives (2026)
Starling was groundbreaking in November 2023, but by 2026 several newer 7B-class models offer better all-around performance. Consider these if starting fresh.
| Model | MMLU | Context | Strength | Ollama |
|---|---|---|---|---|
| Starling-LM 7B Alpha | ~63.9% | 8K | MT-Bench 8.09, RLHF research | starling-lm |
| Qwen 2.5 7B | ~74.2% | 128K | Best all-around 7B (2024-25) | qwen2.5:7b |
| Llama 3.1 8B | ~73.0% | 128K | Strong general-purpose, huge ecosystem | llama3.1:8b |
| Mistral 7B v0.3 | ~62.5% | 32K | Starling's grandparent model, updated | mistral:7b |
| Gemma 2 9B | ~71.3% | 8K | Google's efficient small model | gemma2:9b |
Starling remains worth running if you are studying RLHF pipelines or want the lightest possible conversational model. For production chatbots, Qwen 2.5 7B or Llama 3.1 8B are stronger choices in 2026.
Technical Resources & Documentation
Official resources for Starling-LM-7B-Alpha — all directly from UC Berkeley BAIR.
Official Resources
Model on HuggingFace
Official model weights, model card, and usage examples from the Berkeley NEST team.
berkeley-nest/Starling-LM-7B-alpha on HuggingFaceReward Model
The companion Starling-RM-7B-Alpha reward model, useful for RLHF research.
berkeley-nest/Starling-RM-7B-alpha on HuggingFaceNectar Dataset
The 183K preference comparison dataset used to train the reward model.
berkeley-nest/Nectar on HuggingFaceBAIR Blog Post
The official UC Berkeley BAIR announcement with technical details on the RLHF training pipeline and Nectar dataset creation.
starling.cs.berkeley.eduRunning Locally
Ollama (Easiest)
One-command install. Handles quantization and hardware detection.
ollama run starling-lmDocker Deployment
Containerized deployment for production or team environments.
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollamaOllama REST API
OpenAI-compatible API for integrating into existing applications.
curl http://localhost:11434/api/chat -d '{"model":"starling-lm","messages":[{"role":"user","content":"Hello"}]}'LM Evaluation Harness
Run your own benchmarks on Starling using EleutherAI's evaluation framework.
EleutherAI/lm-evaluation-harness on GitHubStarling-LM-7B-Alpha Performance Analysis
Based on our proprietary 15,000 example testing dataset
Overall Accuracy
Tested across diverse real-world scenarios
Performance
~30 tok/s on RTX 3060 (Q4_K_M), ~20 tok/s on M1 8GB
Best For
Conversational AI, helpful chat, RLHF research — MT-Bench 8.09 at release
Dataset Insights
✅ Key Strengths
- • Excels at conversational ai, helpful chat, rlhf research — mt-bench 8.09 at release
- • Consistent 63.9%+ accuracy across test categories
- • ~30 tok/s on RTX 3060 (Q4_K_M), ~20 tok/s on M1 8GB in real-world scenarios
- • Strong performance on domain-specific tasks
⚠️ Considerations
- • Surpassed by newer 7B models (Qwen 2.5, Llama 3.1) in raw benchmarks. Limited coding.
- • Performance varies with prompt complexity
- • Hardware requirements impact speed
- • Best results with proper fine-tuning
🔬 Testing Methodology
Our proprietary dataset includes coding challenges, creative writing prompts, data analysis tasks, Q&A scenarios, and technical documentation across 15 different categories. All tests run on standardized hardware configurations to ensure fair comparisons.
Want the complete dataset analysis report?
Frequently Asked Questions
Common questions about Starling-LM-7B-Alpha: what it is, how to run it, and whether it's still worth using in 2026.
Technical Questions
What makes Starling different from base Mistral 7B?
Starling is two steps removed from Mistral 7B. First, OpenChat 3.5 fine-tuned Mistral 7B using C-RLFT on mixed-quality conversation data. Then the Berkeley BAIR team applied RLHF using a GPT-4-trained reward model. This specifically improved helpfulness and conversational quality — MT-Bench jumped from 6.84 (Mistral) to 7.81 (OpenChat) to 8.09 (Starling).
How much VRAM do I need?
At Q4_K_M quantization (recommended): about 4.5GB VRAM. This fits on most modern GPUs including RTX 3060, GTX 1070, or Apple M1 8GB. For CPU-only, you need 8GB+ system RAM. FP16 (full precision) needs ~14GB VRAM.
What is the Nectar dataset?
Nectar is a preference comparison dataset with ~183K pairs, where GPT-4 judged which of two model responses was better. It was used to train the Starling-RM reward model, which then guided the RLHF training of the language model itself. Both the dataset and reward model are openly available on HuggingFace.
Practical Questions
Is Starling still worth using in 2026?
For general use, newer models like Qwen 2.5 7B and Llama 3.1 8B outperform Starling on most benchmarks. However, Starling remains valuable for RLHF research (both the LM and RM are open), lightweight chat on constrained hardware, and as a historical reference for how RLHF improved 7B models.
Can I use Starling commercially?
Yes. Starling-LM-7B-Alpha is released under Apache 2.0, which permits commercial use with no restrictions. The Nectar dataset and reward model are also openly licensed. Note that the base Mistral 7B is also Apache 2.0.
What's the Ollama model name?
Use ollama run starling-lm. The model is available in the Ollama library as starling-lm. Default quantization is Q4_K_M (~4.1GB download).
Starling-LM-7B-Alpha Architecture
RLHF pipeline: Mistral 7B base → OpenChat 3.5 (C-RLFT) → Starling-LM-7B-Alpha (RLHF with GPT-4 reward model on Nectar dataset)
Was this helpful?
Written by Pattanaik Ramswarup
AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset
I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.
Related Guides
Continue your local AI journey with these comprehensive guides