Reference · Local AI Model Database
Local AI Model Database (2026): VRAM, Speed & Specs for 30+ Models
This is a comprehensive, sourced reference for 30+ of the most popular local (open-weight) AI models in 2026 — with parameter count, context window, minimum Q4 VRAM, approximate tokens/sec on an RTX 3090 and RTX 4090, license, and best use. The short version: an 8B model needs roughly 5 GB of VRAM at Q4 and runs on almost anything; a 14B-class model needs ~9-10 GB and fits a 12-16 GB card; a 32B dense model needs ~20 GB and wants a 24 GB card; and 70B-and-up models need 40 GB+, so they either spill to system RAM or want two 24 GB GPUs. Mixture-of-Experts (MoE) models such as Qwen3-30B-A3B and gpt-oss-20b run as fast as much smaller models because only a few billion parameters activate per token — but you still need enough memory to hold all the weights. Every spec below is cited to a primary source; the tokens/sec numbers are approximate single-GPU estimates, not a controlled benchmark.
How to read this database
- Params — total parameters for dense models. For Mixture-of-Experts (MoE) models we list total / active (e.g.
30.5B total / 3.3B active): total decides memory, active decides speed. - Context — the maximum context window from the official model card. Several models hit their largest context only with extrapolation (YaRN), noted inline.
- Min VRAM (Q4) — approximate memory to hold the weights at ~Q4_K_M quantization, plus a small KV-cache allowance. Long context, larger batches, or higher quants need more. For exact numbers against your GPU, use the VRAM calculator.
- tok/s (3090 / 4090) — approximate generation speed for a short prompt with the model fully offloaded to a single GPU via llama.cpp/Ollama. Treat as ballpark; real numbers vary with quant, context, driver and prompt. “Partial” means the model does not fully fit and some layers run on CPU, which slows things sharply.
- License — the redistribution license. “Apache 2.0” and “MIT” are the most permissive; community/terms licenses (Llama, Gemma, NVIDIA Open Model) allow commercial use with conditions; Codestral (MNPL) and Mistral Large (MRL) are non-commercial — read before shipping.
The local AI model database (30+ models)
Grouped roughly by family, from smallest to largest within each. Names link to our full per-model pages where available.
| Model | Family | Params | Context | Min VRAM (Q4) | ~tok/s 3090 | ~tok/s 4090 | License | Best use |
|---|---|---|---|---|---|---|---|---|
| Llama 3.2 1B | Meta Llama | 1.23B dense | 128K | ~1.5 GB | ~120+ | ~150+ | Llama 3.2 Community | On-device, edge, fast drafting |
| Llama 3.2 3B | Meta Llama | 3.21B dense | 128K | ~2.5 GB | ~90+ | ~110+ | Llama 3.2 Community | Lightweight chat, low-VRAM laptops |
| Llama 3.1 8B | Meta Llama | 8.03B dense | 128K | ~5 GB | ~55-70 | ~80-100 | Llama 3.1 Community | General-purpose 8GB workhorse |
| Llama 3.3 70B | Meta Llama | 70.6B dense | 128K | ~40 GB | ~6-9 (partial) | ~8-12 (partial) | Llama 3.3 Community | 405B-class quality, needs 2x24GB |
| Llama 3.1 405B | Meta Llama | 405B dense | 128K | ~230 GB | N/A (multi-node) | N/A (multi-node) | Llama 3.1 Community | Frontier dense, server clusters only |
| Qwen3 4B | Qwen3 (Alibaba) | 4B dense | 128K (32K native + YaRN) | ~3 GB | ~80+ | ~100+ | Apache 2.0 | Small reasoning model, hybrid thinking |
| Qwen3 8B | Qwen3 (Alibaba) | 8B dense | 128K (32K native + YaRN) | ~5 GB | ~55-70 | ~80-100 | Apache 2.0 | Strong 8GB all-rounder w/ reasoning |
| Qwen3 14B | Qwen3 (Alibaba) | 14B dense | 128K (32K native + YaRN) | ~9.5 GB | ~38-45 | ~55-65 | Apache 2.0 | Best dense model that fits 12-16GB |
| Qwen3 32B | Qwen3 (Alibaba) | 32.8B dense | 128K (32K native + YaRN) | ~20 GB | ~18-24 | ~28-36 | Apache 2.0 | Top dense quality on a single 24GB card |
| Qwen3 30B-A3B | Qwen3 (Alibaba) | 30.5B total / 3.3B active (MoE) | 128K (32K native + YaRN) | ~18 GB | ~55-75 | ~80-110 | Apache 2.0 | 32B-class smarts at 8B-class speed |
| Qwen3 235B-A22B | Qwen3 (Alibaba) | 235B total / 22B active (MoE) | 256K (Instruct-2507) | ~130 GB | N/A (multi-GPU) | N/A (multi-GPU) | Apache 2.0 | Flagship open MoE, big rigs / Mac 256GB |
| Qwen2.5-Coder 7B | Qwen Coder | 7.6B dense | 128K | ~5 GB | ~55-70 | ~80-100 | Apache 2.0 | Fast IDE autocomplete (FIM), 8GB |
| Qwen2.5-Coder 14B | Qwen Coder | 14.7B dense | 128K | ~9.5 GB | ~38-45 | ~55-65 | Apache 2.0 | Best coder that fits a 16GB GPU |
| Qwen2.5-Coder 32B | Qwen Coder | 32.5B dense | 128K | ~20 GB | ~18-24 | ~28-36 | Apache 2.0 | 92.7% HumanEval, single 24GB card |
| Qwen3-Coder 30B-A3B | Qwen Coder | 30.5B total / 3.3B active (MoE) | 256K (1M w/ YaRN) | ~18 GB | ~55-75 | ~80-110 | Apache 2.0 | Agentic coding, repo-scale context |
| Qwen3-Coder 480B-A35B | Qwen Coder | 480B total / 35B active (MoE) | 256K (1M w/ YaRN) | ~270 GB | N/A (multi-node) | N/A (multi-node) | Apache 2.0 | Sonnet-class open agentic coder |
| DeepSeek-R1-Distill-Qwen 7B | DeepSeek | 7B dense (Qwen base) | 128K | ~5 GB | ~55-70 | ~80-100 | MIT | Cheap local reasoning, 8GB |
| DeepSeek-R1-Distill-Llama 8B | DeepSeek | 8B dense (Llama base) | 128K | ~5 GB | ~55-70 | ~80-100 | MIT | Reasoning chains on a 3060 |
| DeepSeek-R1-Distill-Llama 70B | DeepSeek | 70B dense (Llama base) | 128K | ~40 GB | ~6-9 (partial) | ~8-12 (partial) | MIT | Strongest distilled reasoner, 2x24GB |
| DeepSeek-Coder-V2-Lite 16B | DeepSeek | 16B total / 2.4B active (MoE) | 128K | ~10.5 GB | ~50+ | ~70+ | DeepSeek License (commercial OK) | Fast MoE coder, low latency |
| DeepSeek-V3 / R1 671B | DeepSeek | 671B total / 37B active (MoE) | 128K | ~400 GB | N/A (multi-node) | N/A (multi-node) | MIT | Frontier open MoE, server clusters |
| Gemma 3 270M | Google Gemma | 0.27B dense | 32K | ~0.5 GB | ~200+ | ~250+ | Gemma Terms | Tiny on-device tasks, fine-tuning base |
| Gemma 3 4B | Google Gemma | 4B dense (multimodal) | 128K | ~3 GB | ~80+ | ~100+ | Gemma Terms | Vision + text on a laptop |
| Gemma 3 12B | Google Gemma | 12B dense (multimodal) | 128K | ~8 GB | ~42-50 | ~60-72 | Gemma Terms | Multimodal all-rounder, 12GB |
| Gemma 3 27B | Google Gemma | 27B dense (multimodal) | 128K | ~16 GB | ~22-28 | ~32-42 | Gemma Terms | Top single-GPU multimodal, 24GB |
| Mistral 7B | Mistral | 7.25B dense | 32K | ~5 GB | ~60-75 | ~85-105 | Apache 2.0 | Classic lightweight baseline |
| Mistral NeMo 12B | Mistral | 12B dense | 128K | ~8 GB | ~42-50 | ~60-72 | Apache 2.0 | Multilingual 12GB workhorse |
| Mistral Small 3.2 24B | Mistral | 24B dense (multimodal) | 128K | ~14 GB | ~26-32 | ~38-48 | Apache 2.0 | Best Apache-licensed 24B, fits 24GB |
| Codestral 22B | Mistral | 22B dense | 32K | ~13 GB | ~28-34 | ~40-50 | MNPL (non-commercial) | Coding/FIM — research/personal only |
| Devstral Small 24B | Mistral | 24B dense | 128K+ | ~14 GB | ~26-32 | ~38-48 | Apache 2.0 | Agentic SWE-bench coder, single 24GB |
| Mistral Large 2 123B | Mistral | 123B dense | 128K | ~70 GB | N/A (multi-GPU) | N/A (multi-GPU) | MRL (non-commercial) | Flagship dense, multi-GPU rigs |
| Mixtral 8x7B | Mistral | 46.7B total / 12.9B active (MoE) | 32K | ~26 GB | ~30-40 (partial) | ~45-55 (partial) | Apache 2.0 | Classic open MoE, needs 24GB+ |
| Phi-4-mini 3.8B | Microsoft Phi | 3.84B dense | 128K | ~2.8 GB | ~85+ | ~110+ | MIT | Tiny reasoning model, edge |
| Phi-4 14B | Microsoft Phi | 14B dense | 16K | ~9 GB | ~40-48 | ~58-68 | MIT | Reasoning/math-first 14B all-rounder |
| gpt-oss-20b | OpenAI gpt-oss | 21B total / 3.6B active (MoE) | 128K | ~14 GB (MXFP4 native) | ~45-60 | ~65-90 | Apache 2.0 | o3-mini-class reasoning, 16GB GPU |
| gpt-oss-120b | OpenAI gpt-oss | 117B total / 5.1B active (MoE) | 128K | ~65 GB (MXFP4 native) | N/A (needs 80GB) | N/A (needs 80GB) | Apache 2.0 | o4-mini-class on one 80GB GPU |
| Kimi K2 | Moonshot Kimi | 1T total / 32B active (MoE) | 256K | ~600 GB | N/A (multi-node) | N/A (multi-node) | Modified MIT | Frontier open agentic model, clusters |
| GLM-4.6 | Z.ai GLM | 357B total / 32B active (MoE) | 200K | ~200 GB | N/A (multi-node) | N/A (multi-node) | MIT | Top open coding/agent MoE, big rigs |
| GLM-4.5-Air | Z.ai GLM | 106B total / 12B active (MoE) | 128K | ~62 GB | N/A (multi-GPU) | N/A (multi-GPU) | MIT | Smaller GLM MoE, 2-4 GPU rigs |
| Llama-3.1-Nemotron-Nano 8B | NVIDIA Nemotron | 8B dense (Llama base) | 128K | ~5 GB | ~55-70 | ~80-100 | NVIDIA Open Model | Tuned reasoning + tool-calling, 8GB |
| Llama-3.3-Nemotron-Super 49B | NVIDIA Nemotron | 49B dense (Llama base) | 128K | ~30 GB | ~10-14 (partial) | ~14-18 (partial) | NVIDIA Open Model | 70B-class accuracy, fits ~2x24GB |
| MiniMax M2 | MiniMax | ~230B total / ~10B active (MoE) | 200K+ | ~130 GB | N/A (multi-node) | N/A (multi-node) | MIT | Long-context agentic MoE, clusters |
VRAM = weights at ~Q4_K_M + small KV-cache allowance; add headroom for long context. tok/s are approximate single-GPU llama.cpp/Ollama estimates for a short prompt, not a controlled benchmark. MoE rows show total / active params: total drives memory, active drives speed.
GPU reference table — what each card runs
VRAM capacity is the hard wall for local models; memory bandwidth then sets how fast they run. Prices are approximate US street prices as of mid-2026 and move constantly — check current listings before buying. To match a specific card to a specific model, use our which-GPU-to-buy tool or the can-I-run-it checker.
| GPU | VRAM | Approx price | Mem bandwidth | Runs comfortably | Notes |
|---|---|---|---|---|---|
| RTX 3060 12GB | 12 GB GDDR6 | ~$280-320 (new) | 360 GB/s | 8B-14B at Q4 | Best cheap entry card; runs Llama 3.1 8B and 14B-class Q4 fully on GPU. |
| RTX 4060 Ti 16GB | 16 GB GDDR6 | ~$450-500 (new) | 288 GB/s | 14B Q4-Q8, 24B Q4 | More VRAM than a 3060 but narrow bus — capacity, not speed. Good for bigger models. |
| RTX 5060 Ti 16GB | 16 GB GDDR7 | ~$550-600 (new) | 448 GB/s | 14B Q4-Q8, 24B Q4 | GDDR7 bandwidth makes it noticeably faster than the 4060 Ti 16GB at the same VRAM. |
| Tesla P40 24GB | 24 GB GDDR5 | ~$150-250 (used) | 347 GB/s | 14B-32B Q4 | Cheapest 24GB by far (~$7/GB). Pascal-era, ~3x slower than a 3090; needs a fan + cooling shroud. |
| RTX 3090 24GB | 24 GB GDDR6X | ~$800-1,000 (used) | 936 GB/s | 32B Q4-Q5, 70B partial | Best used VRAM-per-dollar for a real card; the local-LLM enthusiast default. |
| RTX 4090 24GB | 24 GB GDDR6X | ~$1,600-2,000 | 1,008 GB/s | 32B Q5-Q8, 70B partial | Fastest single 24GB card; same models as a 3090 but markedly higher throughput. |
| RTX 5090 32GB | 32 GB GDDR7 | ~$2,000 MSRP (street higher) | 1,792 GB/s | 32B Q8, 49B Q4, 70B Q4 (tight) | Most VRAM + bandwidth on a single consumer card; fits a 70B at Q4 where 24GB cards spill. |
Prices are approximate US street estimates (mid-2026) and fluctuate; bandwidth from manufacturer specs. “Runs comfortably” assumes Q4-Q5 quants fully on the GPU with room for context.
VRAM by model class (the quick rule)
| Model class | ~Q4 VRAM | Fits on |
|---|---|---|
| 1B-3B | ~1.5-2.5 GB | Any GPU, most laptops, phones |
| 7B-8B | ~5 GB | 8 GB GPU, 16 GB Mac |
| 12B-14B | ~8-10 GB | 12 GB GPU (3060), 16 GB Mac |
| 22B-32B (dense) | ~14-20 GB | 24 GB GPU (3090/4090) |
| 49B-70B (dense) | ~30-42 GB | 2x 24 GB, or 1x 48 GB+ |
| 100B+ MoE | 62 GB-600 GB | Multi-GPU rigs, 80 GB cards, big Macs |
The honest caveat for MoE models: a model like Qwen3-235B-A22B only activates 22B params per token (so it generates at roughly 22B-class speed), but you must still hold all 235B params in memory. That is why “runs fast” and “fits in your VRAM” are two different questions for MoE — check both. The quantization calculator shows how each quant level (Q4/Q5/Q6/Q8) changes the footprint.
📋 Cite or embed this table
This database is meant to be referenced. If you are writing about local AI and want to link the numbers, cite this page as the source:
“Local AI Model Database (2026), Local AI Master — https://localaimaster.com/local-ai-model-database.”
Want an embeddable / interactive version (sortable by VRAM, filterable by GPU) for your own site or workflow? We are building shareable widgets and calculators — start with the live model leaderboard and the VRAM calculator.
Methodology & honest caveats
- Specs are primary-sourced. Parameter counts, context windows and licenses come from each model official card, technical report, or vendor blog (see Sources). Where a family has multiple sizes we list the ones people actually run locally.
- tok/s are estimates, not benchmarks. We list approximate single-GPU figures for a short prompt with the model fully offloaded. Your numbers will differ with quant, context length, batch size, backend (llama.cpp vs vLLM vs Ollama), driver and prompt. We err toward conservative ranges.
- VRAM is the Q4 weight footprint plus a small KV allowance. Real usage climbs with context length. A 14B model that is ~9.5 GB of weights can need 11-12 GB once you load a long prompt.
- MoE memory ≠ MoE speed. Total params decide whether it fits; active params decide how fast it generates. We show both for every MoE row.
- Licenses change and have conditions. “Commercial OK” here is a summary, not legal advice — read the actual license (especially Codestral MNPL and Mistral Large MRL, which are non-commercial).
- This page is updated as models ship. The local landscape moves monthly; we revise the table rather than spawn dated duplicates.
Go deeper
Best Local AI Models — Complete Guide
The full walkthrough: picking, installing and running local models from scratch.
Best Local AI Models for Programming
Coding-specific rankings with HumanEval scores and IDE setup.
Best GPUs for AI
Deeper GPU buying analysis across budgets and workloads.
AI Hardware Requirements Guide
CPU, RAM, VRAM and storage for running models locally.
Best Local AI Models for 8GB RAM
What actually runs on modest hardware, ranked.
Complete Ollama Guide
Install Ollama and pull any model in this database in minutes.
Best Local AI Coding Models
The coding leaderboard, ranked by capability and VRAM.
Model Recommender
Tell us your GPU and use case; get a matched model in seconds.
Sources
Every parameter count, context window and license below is taken from the model official documentation. tok/s and VRAM are derived estimates as described in Methodology.
- Meta — Llama 3.1 release (8B / 70B / 405B, 128K context)
- Meta — Llama 3.3 70B model card
- Qwen — Qwen3 Technical Report (sizes, context, Apache 2.0)
- Qwen — Qwen3-Coder blog (480B-A35B / 30B-A3B, 256K)
- Qwen — Qwen2.5-Coder GitHub
- DeepSeek — DeepSeek-V3 Technical Report (671B / 37B active)
- DeepSeek — DeepSeek-R1 GitHub (distills, MIT)
- Google — Gemma 3 model card (1B-27B, 128K, multimodal)
- Mistral — Mistral Small 3 / 3.2 (24B, Apache 2.0)
- Mistral — Mistral NeMo 12B (Apache 2.0, 128K)
- Microsoft — gpt-oss / Phi context per official cards; Phi-4 (14B, MIT)
- OpenAI — Introducing gpt-oss (120b/20b, Apache 2.0, MXFP4)
- OpenAI — gpt-oss-120b model card (active params, 128K)
- Moonshot — Kimi K2 model card (1T / 32B active, 256K)
- Z.ai — GLM-4.6 model card (357B / 32B active, 200K, MIT)
- NVIDIA — Llama Nemotron paper (Nano 8B / Super 49B, 128K)
- NVIDIA / BIZON — Best GPU for local LLM 2026 (pricing, VRAM/$)
Go from reading about AI to building with AI
20 structured courses. Hands-on projects. Runs on your machine. Start free.
Written by the Local AI Master Team
The team behind Local AI Master
We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.