Free course — 2 free chapters of every course. No credit card.Start learning free
OPEN-SOURCE 30B LLM — APACHE 2.0

MPT-30B by MosaicML

Pioneering open-source LLM: 30B parameters with ALiBi attention for context-length generalization, trained on 1 trillion tokens with a fully permissive Apache 2.0 license

By Local AI MasterReleased June 22, 2023Updated March 13, 2026

Honest Assessment Up Front

MPT-30B was a historically important model when released in June 2023 — one of the first high-quality open-source 30B models with a permissive license. Its ALiBi attention mechanism was genuinely innovative and influenced later models. However, MPT-30B has been substantially surpassed by newer open-source models (Llama 3, Qwen 2.5, Mistral) in both benchmarks and real-world quality.

We cover MPT-30B for its historical significance and because it remains relevant for understanding ALiBi attention. For new projects in 2026, see our alternatives section.

What Is MPT-30B?

30B

Parameters

Decoder-only transformer

8,192

Context Tokens

Extendable via ALiBi

1T

Training Tokens

Mixed web + code data

Apache 2.0

License

Full commercial use

MPT-30B Architecture Details

Model Specifications

  • Architecture: Decoder-only transformer
  • Layers: 48 transformer blocks
  • Attention heads: 40 heads
  • Hidden dimension: 7,168
  • Vocabulary: 50,432 tokens
  • Positional encoding: ALiBi (no learned positions)

Training Details

  • Training data: 1 trillion tokens
  • Training framework: MosaicML Composer
  • Hardware: Trained on MosaicML platform (440 H100 GPUs)
  • Flash Attention: Used during training
  • Activation: GELU
  • Normalization: Low-precision LayerNorm

MosaicML and Databricks Background

MPT-30B was developed by MosaicML, a company focused on making training large language models more efficient and accessible. In June 2023, Databricks acquired MosaicML for $1.3 billion, making it one of the largest AI acquisitions of 2023.

The MPT Family

MPT-7B (May 2023): First model in the series. 7B parameters, Apache 2.0. Proved MosaicML's training approach worked.
MPT-30B (June 2023): The flagship model. 30B parameters, same architecture scaled up. Multiple variants released (base, instruct, chat).
DBRX (March 2024): Databricks' successor after the acquisition. 132B MoE model that significantly surpassed MPT-30B.

Why MPT-30B Mattered

  • Apache 2.0 license: At release, most competitive models had restrictive licenses (Llama 1 was non-commercial). MPT-30B was one of the most capable truly open models.
  • Training efficiency: MosaicML demonstrated that with good infrastructure (their Composer framework), you could train competitive models more cost-effectively.
  • ALiBi attention: Popularized this positional encoding approach, which avoids learned position embeddings and generalizes to longer contexts.
  • Reproducibility: Published training details and data composition, advancing open-source LLM development.

ALiBi: The Key Architecture Innovation

ALiBi (Attention with Linear Biases) is the most technically interesting feature of MPT-30B. Instead of using learned positional embeddings (like GPT-2/3) or sinusoidal encodings (like the original Transformer), ALiBi adds a linear distance-based penalty directly to the attention scores.

How ALiBi Works

The Core Idea

In standard attention, positional information is added to token embeddings before computing attention. ALiBi instead modifies the attention computation itself by subtracting a penalty proportional to the distance between query and key positions:

attention(q, k) = softmax(q * k^T / sqrt(d) - m * |i - j|)

where m is a head-specific slope (geometric series: 2^(-8/n), 2^(-16/n), ...)

and |i - j| is the distance between token positions

Source: Press et al., "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation" (ICLR 2022, arXiv:2108.12409)

Advantages of ALiBi

  • No learned parameters: Positional encoding adds zero trainable parameters
  • Context extrapolation: Can generalize to longer sequences than seen during training (with quality degradation)
  • Memory efficiency: No position embedding table to store
  • Simpler architecture: One fewer learned component in the model

Limitations of ALiBi

  • Not unlimited context: Quality degrades beyond training length (8K). "Infinite context" claims are misleading.
  • Newer alternatives exist: RoPE (used in Llama, Qwen) has become the dominant positional encoding, partly because it handles long contexts better with techniques like YaRN/NTK scaling.
  • Linear bias assumption: The fixed linear decay may not capture all positional relationships optimally.

Context Length Reality Check

MPT-30B was trained with an 8,192 token context window. While ALiBi theoretically allows extrapolation to longer sequences, in practice quality drops noticeably beyond the training length. The MPT-30B model card recommends staying within 8K tokens for reliable results.

For comparison, modern models like Llama 3.1 (128K context), Qwen 2.5 (128K context), and Mistral (32K context) were actually trained on long sequences and handle them natively.

Real Benchmark Performance

Benchmark Correction

Some sources cite MPT-30B MMLU at 50.6% (base model, 5-shot) from the HuggingFace Open LLM Leaderboard. MosaicML's own blog post reported higher numbers for the instruct variant. The numbers below are from public leaderboard data and MosaicML's published results.

MMLU Accuracy (5-shot) — MPT-30B vs Contemporaries and Modern Models

MPT-30B (Base)50.6 Tokens/Second
50.6
MPT-30B-Instruct52.2 Tokens/Second
52.2
Llama 2 13B55.7 Tokens/Second
55.7
Falcon 40B55.4 Tokens/Second
55.4
Llama 3 8B66.6 Tokens/Second
66.6
Qwen 2.5 32B83.3 Tokens/Second
83.3

MPT-30B Benchmark Results

MMLU (5-shot)~50.6%
HellaSwag (10-shot)~82.6%
ARC-Challenge (25-shot)~58.5%
TruthfulQA (0-shot)~38.7%
Winogrande (5-shot)~78.6%

Source: HuggingFace Open LLM Leaderboard (v1), MosaicML blog

Context: Where MPT-30B Stood in 2023

When released in June 2023, MPT-30B was competitive with other open models of its size class:

  • Comparable to: Falcon 40B (~55% MMLU), LLaMA 30B (~58% MMLU)
  • Advantage: Apache 2.0 license (LLaMA was non-commercial, Falcon had its own restrictive license initially)
  • By late 2023: Llama 2 70B (68% MMLU) and Mixtral 8x7B (70.6% MMLU) surpassed it
  • By 2024-2025: Llama 3 8B alone (66.6% MMLU) beats MPT-30B while using far less VRAM

VRAM Requirements by Quantization

MPT-30B Memory Usage

QuantizationFile SizeVRAM (GPU)RAM (CPU)Quality ImpactHardware
Q4_K_M~17 GB~18 GB~20 GBModerate lossRTX 4090 (24GB)
Q5_K_M~20 GB~22 GB~24 GBMinor lossRTX 4090 or 2x RTX 3090
Q8_0~31 GB~33 GB~35 GBNegligible loss2x RTX 3090/4090
FP16~60 GB~62 GB~64 GBNo lossA100 80GB or 3x RTX 3090

Note: VRAM numbers include overhead for KV cache at 8K context. Actual usage varies by context length and batch size. CPU-only inference is possible but slow (~2-5 tok/s).

Practical Recommendation

The Q4_K_M quantization fits on a single RTX 4090 (24GB VRAM) and is the most practical option for most users. However, given that a Llama 3 8B model at full Q8 quantization delivers better benchmark scores and fits in 10GB VRAM, MPT-30B is not the most efficient choice for new projects.

Installation and Running Locally

System Requirements

Operating System
Windows 10/11, macOS 12+, Ubuntu 20.04+
RAM
24GB minimum (Q4_K_M), 64GB for FP16
Storage
20-60GB depending on quantization
GPU
RTX 4090 (24GB) for Q4_K_M, A100 for FP16
CPU
8+ cores for CPU-only inference (very slow)
1

Install Ollama

Download and install the Ollama runtime

$ curl -fsSL https://ollama.com/install.sh | sh
2

Pull MPT-30B

Download the MPT-30B model (check available tags on ollama.com)

$ ollama pull mpt:30b
3

Run MPT-30B

Start an interactive chat session

$ ollama run mpt:30b
4

Set Context Size (Optional)

Adjust context window via Modelfile if needed

$ echo "FROM mpt:30b\nPARAMETER num_ctx 4096" > Modelfile && ollama create mpt-custom -f Modelfile

Ollama Availability Note

MPT-30B availability on Ollama may be limited compared to more popular models like Llama 3 or Mistral. Check ollama.com/library for the latest available tags. The model is always available on Hugging Face at mosaicml/mpt-30b for use with llama.cpp or Hugging Face Transformers.

Terminal
$ollama run mpt:30b
pulling manifest... pulling 8f4e1e2a5b3c... 100% |████████████████████| 17 GB verifying sha256 digest... writing manifest... success >>> What is ALiBi attention and how does MPT-30B use it? ALiBi (Attention with Linear Biases) is a method for encoding positional information in transformer models. Instead of adding positional embeddings to the input, ALiBi modifies the attention scores by subtracting a linear penalty based on the distance between tokens. Each attention head uses a different slope value, creating a geometric series of distance sensitivities. MPT-30B uses ALiBi across all 40 attention heads, which means the model has no learned positional embeddings. This makes the model architecture simpler and theoretically allows it to generalize to longer sequences than it was trained on. >>>
$_

Alternative: Using llama.cpp

For more control over quantization and inference parameters, you can use llama.cpp directly with GGUF-converted MPT-30B weights from TheBloke or other community contributors on Hugging Face:

# Clone and build llama.cpp

git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make

# Download GGUF model (example — check HuggingFace for latest)

huggingface-cli download TheBloke/mpt-30B-GGUF mpt-30b.Q4_K_M.gguf

# Run inference

./main -m mpt-30b.Q4_K_M.gguf -n 512 -p "Explain ALiBi attention:"

MPT-30B vs Other 30B-Class Models

ModelSizeRAM RequiredSpeedQualityCost/Month
MPT-30B30B~18GB (Q4)~8-15 tok/s
51%
Free (Apache 2.0)
Llama 2 13B13B~8GB (Q4)~20-30 tok/s
56%
Free (Meta License)
Falcon 40B40B~24GB (Q4)~5-10 tok/s
55%
Free (Apache 2.0)
Qwen 2.5 32B32B~20GB (Q4)~10-18 tok/s
83%
Free (Apache 2.0)

Comparison Context

When MPT-30B Made Sense (2023)

  • • One of the only permissively-licensed 30B+ models
  • • ALiBi attention was genuinely novel
  • • Good training efficiency demonstration
  • • Competitive with other models of the era

Why It's Hard to Recommend in 2026

  • Llama 3 8B beats it on MMLU (66.6% vs 50.6%) with 1/4 the VRAM
  • Qwen 2.5 7B beats it (68.4% MMLU) with even less VRAM
  • Mistral 7B beats it (62.5% MMLU) at 7B parameters
  • • Community and tooling support has largely moved to newer model families

Honest Assessment: Strengths and Limitations

Genuine Strengths

  • Apache 2.0 license: Still one of the most permissive licenses. No usage restrictions, no derivative work requirements.
  • ALiBi architecture: Genuinely interesting positional encoding approach. Worth studying for understanding transformer design choices.
  • Training transparency: MosaicML published details about training data, compute, and methodology.
  • Historical significance: Proved that non-Meta/Google/OpenAI orgs could train competitive large models.
  • Flash Attention: Early adopter of Flash Attention during training, demonstrating its benefits at scale.

Real Limitations

  • Outdated performance: ~50% MMLU is well below modern 7B models. You get less quality for more VRAM.
  • 8K context limit: Despite ALiBi, the practical context is 8K tokens. Modern models offer 32K-128K natively.
  • Limited fine-tuning ecosystem: Few community fine-tunes compared to Llama or Mistral families.
  • No instruction-following updates: The instruct variant is basic compared to modern RLHF/DPO-tuned models.
  • Tooling support declining: Ollama and other tools prioritize newer model architectures.

When to Still Consider MPT-30B

There are narrow use cases where MPT-30B may still be relevant:

  • License-sensitive deployments: If you specifically need Apache 2.0 and cannot use Meta or Mistral licenses (though Qwen 2.5 and Gemma 2 also offer permissive licenses now).
  • Research/education: Understanding ALiBi attention, studying model architecture evolution, or comparing training approaches.
  • Existing integrations: If you already have MPT-30B deployed and fine-tuned for a specific use case, migration may not be worth the effort.

Local AI Alternatives in 2026

If you're looking for a model to run locally in 2026, these options deliver better performance per VRAM dollar than MPT-30B:

Best for 8GB VRAM

Llama 3 8B

66.6% MMLU | 128K context | Meta License

Beats MPT-30B on all benchmarks at 1/4 the VRAM.

ollama run llama3

Qwen 2.5 7B

68.4% MMLU | 128K context | Apache 2.0

Same license as MPT-30B, vastly better performance.

ollama run qwen2.5:7b

Best for 16GB VRAM

Qwen 2.5 14B

79.9% MMLU | 128K context | Apache 2.0

Blows past MPT-30B at half the VRAM.

ollama run qwen2.5:14b

Mistral Nemo 12B

68.0% MMLU | 128K context | Apache 2.0

Compact, fast, permissive license.

ollama run mistral-nemo

Best for 24GB VRAM

Qwen 2.5 32B (Q4)

83.3% MMLU | 128K context | Apache 2.0

Same VRAM as MPT-30B Q4, 30+ points higher MMLU.

ollama run qwen2.5:32b

Gemma 2 27B (Q4)

75.2% MMLU | 8K context | Gemma Terms

Google's strong 27B model. Different license terms.

ollama run gemma2:27b

Sources and References

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Related Resources

LLMs you can run locally

Explore more open-source language models for local deployment

Browse all models

AI hardware guide

Find the best hardware for running AI models locally

Hardware guide
Reading now
Join the discussion

Related Guides

Continue your local AI journey with these comprehensive guides

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

MPT-30B Architecture

Decoder-only transformer with 48 layers, 40 attention heads, ALiBi positional encoding, and 7168 hidden dimensions. Trained on 1T tokens by MosaicML.

👤
You
💻
Your ComputerAI Processing
👤
🌐
🏢
Cloud AI: You → Internet → Company Servers
📅 Published: June 22, 2023🔄 Last Updated: March 13, 2026✓ Manually Reviewed
🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators