Reference · Local AI Model Database

Local AI Model Database (2026): VRAM, Speed & Specs for 30+ Models

📅 Published: June 20, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed

This is a comprehensive, sourced reference for 30+ of the most popular local (open-weight) AI models in 2026 — with parameter count, context window, minimum Q4 VRAM, approximate tokens/sec on an RTX 3090 and RTX 4090, license, and best use. The short version: an 8B model needs roughly 5 GB of VRAM at Q4 and runs on almost anything; a 14B-class model needs ~9-10 GB and fits a 12-16 GB card; a 32B dense model needs ~20 GB and wants a 24 GB card; and 70B-and-up models need 40 GB+, so they either spill to system RAM or want two 24 GB GPUs. Mixture-of-Experts (MoE) models such as Qwen3-30B-A3B and gpt-oss-20b run as fast as much smaller models because only a few billion parameters activate per token — but you still need enough memory to hold all the weights. Every spec below is cited to a primary source; the tokens/sec numbers are approximate single-GPU estimates, not a controlled benchmark.

How to read this database

Params — total parameters for dense models. For Mixture-of-Experts (MoE) models we list total / active (e.g. 30.5B total / 3.3B active): total decides memory, active decides speed.
Context — the maximum context window from the official model card. Several models hit their largest context only with extrapolation (YaRN), noted inline.
Min VRAM (Q4) — approximate memory to hold the weights at ~Q4_K_M quantization, plus a small KV-cache allowance. Long context, larger batches, or higher quants need more. For exact numbers against your GPU, use the VRAM calculator.
tok/s (3090 / 4090) — approximate generation speed for a short prompt with the model fully offloaded to a single GPU via llama.cpp/Ollama. Treat as ballpark; real numbers vary with quant, context, driver and prompt. “Partial” means the model does not fully fit and some layers run on CPU, which slows things sharply.
License — the redistribution license. “Apache 2.0” and “MIT” are the most permissive; community/terms licenses (Llama, Gemma, NVIDIA Open Model) allow commercial use with conditions; Codestral (MNPL) and Mistral Large (MRL) are non-commercial — read before shipping.

The local AI model database (30+ models)

Grouped roughly by family, from smallest to largest within each. Names link to our full per-model pages where available.

Model	Family	Params	Context	Min VRAM (Q4)	~tok/s 3090	~tok/s 4090	License	Best use
Llama 3.2 1B	Meta Llama	1.23B dense	128K	~1.5 GB	~120+	~150+	Llama 3.2 Community	On-device, edge, fast drafting
Llama 3.2 3B	Meta Llama	3.21B dense	128K	~2.5 GB	~90+	~110+	Llama 3.2 Community	Lightweight chat, low-VRAM laptops
Llama 3.1 8B	Meta Llama	8.03B dense	128K	~5 GB	~55-70	~80-100	Llama 3.1 Community	General-purpose 8GB workhorse
Llama 3.3 70B	Meta Llama	70.6B dense	128K	~40 GB	~6-9 (partial)	~8-12 (partial)	Llama 3.3 Community	405B-class quality, needs 2x24GB
Llama 3.1 405B	Meta Llama	405B dense	128K	~230 GB	N/A (multi-node)	N/A (multi-node)	Llama 3.1 Community	Frontier dense, server clusters only
Qwen3 4B	Qwen3 (Alibaba)	4B dense	128K (32K native + YaRN)	~3 GB	~80+	~100+	Apache 2.0	Small reasoning model, hybrid thinking
Qwen3 8B	Qwen3 (Alibaba)	8B dense	128K (32K native + YaRN)	~5 GB	~55-70	~80-100	Apache 2.0	Strong 8GB all-rounder w/ reasoning
Qwen3 14B	Qwen3 (Alibaba)	14B dense	128K (32K native + YaRN)	~9.5 GB	~38-45	~55-65	Apache 2.0	Best dense model that fits 12-16GB
Qwen3 32B	Qwen3 (Alibaba)	32.8B dense	128K (32K native + YaRN)	~20 GB	~18-24	~28-36	Apache 2.0	Top dense quality on a single 24GB card
Qwen3 30B-A3B	Qwen3 (Alibaba)	30.5B total / 3.3B active (MoE)	128K (32K native + YaRN)	~18 GB	~55-75	~80-110	Apache 2.0	32B-class smarts at 8B-class speed
Qwen3 235B-A22B	Qwen3 (Alibaba)	235B total / 22B active (MoE)	256K (Instruct-2507)	~130 GB	N/A (multi-GPU)	N/A (multi-GPU)	Apache 2.0	Flagship open MoE, big rigs / Mac 256GB
Qwen2.5-Coder 7B	Qwen Coder	7.6B dense	128K	~5 GB	~55-70	~80-100	Apache 2.0	Fast IDE autocomplete (FIM), 8GB
Qwen2.5-Coder 14B	Qwen Coder	14.7B dense	128K	~9.5 GB	~38-45	~55-65	Apache 2.0	Best coder that fits a 16GB GPU
Qwen2.5-Coder 32B	Qwen Coder	32.5B dense	128K	~20 GB	~18-24	~28-36	Apache 2.0	92.7% HumanEval, single 24GB card
Qwen3-Coder 30B-A3B	Qwen Coder	30.5B total / 3.3B active (MoE)	256K (1M w/ YaRN)	~18 GB	~55-75	~80-110	Apache 2.0	Agentic coding, repo-scale context
Qwen3-Coder 480B-A35B	Qwen Coder	480B total / 35B active (MoE)	256K (1M w/ YaRN)	~270 GB	N/A (multi-node)	N/A (multi-node)	Apache 2.0	Sonnet-class open agentic coder
DeepSeek-R1-Distill-Qwen 7B	DeepSeek	7B dense (Qwen base)	128K	~5 GB	~55-70	~80-100	MIT	Cheap local reasoning, 8GB
DeepSeek-R1-Distill-Llama 8B	DeepSeek	8B dense (Llama base)	128K	~5 GB	~55-70	~80-100	MIT	Reasoning chains on a 3060
DeepSeek-R1-Distill-Llama 70B	DeepSeek	70B dense (Llama base)	128K	~40 GB	~6-9 (partial)	~8-12 (partial)	MIT	Strongest distilled reasoner, 2x24GB
DeepSeek-Coder-V2-Lite 16B	DeepSeek	16B total / 2.4B active (MoE)	128K	~10.5 GB	~50+	~70+	DeepSeek License (commercial OK)	Fast MoE coder, low latency
DeepSeek-V3 / R1 671B	DeepSeek	671B total / 37B active (MoE)	128K	~400 GB	N/A (multi-node)	N/A (multi-node)	MIT	Frontier open MoE, server clusters
Gemma 3 270M	Google Gemma	0.27B dense	32K	~0.5 GB	~200+	~250+	Gemma Terms	Tiny on-device tasks, fine-tuning base
Gemma 3 4B	Google Gemma	4B dense (multimodal)	128K	~3 GB	~80+	~100+	Gemma Terms	Vision + text on a laptop
Gemma 3 12B	Google Gemma	12B dense (multimodal)	128K	~8 GB	~42-50	~60-72	Gemma Terms	Multimodal all-rounder, 12GB
Gemma 3 27B	Google Gemma	27B dense (multimodal)	128K	~16 GB	~22-28	~32-42	Gemma Terms	Top single-GPU multimodal, 24GB
Mistral 7B	Mistral	7.25B dense	32K	~5 GB	~60-75	~85-105	Apache 2.0	Classic lightweight baseline
Mistral NeMo 12B	Mistral	12B dense	128K	~8 GB	~42-50	~60-72	Apache 2.0	Multilingual 12GB workhorse
Mistral Small 3.2 24B	Mistral	24B dense (multimodal)	128K	~14 GB	~26-32	~38-48	Apache 2.0	Best Apache-licensed 24B, fits 24GB
Codestral 22B	Mistral	22B dense	32K	~13 GB	~28-34	~40-50	MNPL (non-commercial)	Coding/FIM — research/personal only
Devstral Small 24B	Mistral	24B dense	128K+	~14 GB	~26-32	~38-48	Apache 2.0	Agentic SWE-bench coder, single 24GB
Mistral Large 2 123B	Mistral	123B dense	128K	~70 GB	N/A (multi-GPU)	N/A (multi-GPU)	MRL (non-commercial)	Flagship dense, multi-GPU rigs
Mixtral 8x7B	Mistral	46.7B total / 12.9B active (MoE)	32K	~26 GB	~30-40 (partial)	~45-55 (partial)	Apache 2.0	Classic open MoE, needs 24GB+
Phi-4-mini 3.8B	Microsoft Phi	3.84B dense	128K	~2.8 GB	~85+	~110+	MIT	Tiny reasoning model, edge
Phi-4 14B	Microsoft Phi	14B dense	16K	~9 GB	~40-48	~58-68	MIT	Reasoning/math-first 14B all-rounder
gpt-oss-20b	OpenAI gpt-oss	21B total / 3.6B active (MoE)	128K	~14 GB (MXFP4 native)	~45-60	~65-90	Apache 2.0	o3-mini-class reasoning, 16GB GPU
gpt-oss-120b	OpenAI gpt-oss	117B total / 5.1B active (MoE)	128K	~65 GB (MXFP4 native)	N/A (needs 80GB)	N/A (needs 80GB)	Apache 2.0	o4-mini-class on one 80GB GPU
Kimi K2	Moonshot Kimi	1T total / 32B active (MoE)	256K	~600 GB	N/A (multi-node)	N/A (multi-node)	Modified MIT	Frontier open agentic model, clusters
GLM-4.6	Z.ai GLM	357B total / 32B active (MoE)	200K	~200 GB	N/A (multi-node)	N/A (multi-node)	MIT	Top open coding/agent MoE, big rigs
GLM-4.5-Air	Z.ai GLM	106B total / 12B active (MoE)	128K	~62 GB	N/A (multi-GPU)	N/A (multi-GPU)	MIT	Smaller GLM MoE, 2-4 GPU rigs
Llama-3.1-Nemotron-Nano 8B	NVIDIA Nemotron	8B dense (Llama base)	128K	~5 GB	~55-70	~80-100	NVIDIA Open Model	Tuned reasoning + tool-calling, 8GB
Llama-3.3-Nemotron-Super 49B	NVIDIA Nemotron	49B dense (Llama base)	128K	~30 GB	~10-14 (partial)	~14-18 (partial)	NVIDIA Open Model	70B-class accuracy, fits ~2x24GB
MiniMax M2	MiniMax	~230B total / ~10B active (MoE)	200K+	~130 GB	N/A (multi-node)	N/A (multi-node)	MIT	Long-context agentic MoE, clusters

VRAM = weights at ~Q4_K_M + small KV-cache allowance; add headroom for long context. tok/s are approximate single-GPU llama.cpp/Ollama estimates for a short prompt, not a controlled benchmark. MoE rows show total / active params: total drives memory, active drives speed.

GPU reference table — what each card runs

VRAM capacity is the hard wall for local models; memory bandwidth then sets how fast they run. Prices are approximate US street prices as of mid-2026 and move constantly — check current listings before buying. To match a specific card to a specific model, use our which-GPU-to-buy tool or the can-I-run-it checker.

GPU	VRAM	Approx price	Mem bandwidth	Runs comfortably	Notes
RTX 3060 12GB	12 GB GDDR6	~$280-320 (new)	360 GB/s	8B-14B at Q4	Best cheap entry card; runs Llama 3.1 8B and 14B-class Q4 fully on GPU.
RTX 4060 Ti 16GB	16 GB GDDR6	~$450-500 (new)	288 GB/s	14B Q4-Q8, 24B Q4	More VRAM than a 3060 but narrow bus — capacity, not speed. Good for bigger models.
RTX 5060 Ti 16GB	16 GB GDDR7	~$550-600 (new)	448 GB/s	14B Q4-Q8, 24B Q4	GDDR7 bandwidth makes it noticeably faster than the 4060 Ti 16GB at the same VRAM.
Tesla P40 24GB	24 GB GDDR5	~$150-250 (used)	347 GB/s	14B-32B Q4	Cheapest 24GB by far (~$7/GB). Pascal-era, ~3x slower than a 3090; needs a fan + cooling shroud.
RTX 3090 24GB	24 GB GDDR6X	~$800-1,000 (used)	936 GB/s	32B Q4-Q5, 70B partial	Best used VRAM-per-dollar for a real card; the local-LLM enthusiast default.
RTX 4090 24GB	24 GB GDDR6X	~$1,600-2,000	1,008 GB/s	32B Q5-Q8, 70B partial	Fastest single 24GB card; same models as a 3090 but markedly higher throughput.
RTX 5090 32GB	32 GB GDDR7	~$2,000 MSRP (street higher)	1,792 GB/s	32B Q8, 49B Q4, 70B Q4 (tight)	Most VRAM + bandwidth on a single consumer card; fits a 70B at Q4 where 24GB cards spill.

Prices are approximate US street estimates (mid-2026) and fluctuate; bandwidth from manufacturer specs. “Runs comfortably” assumes Q4-Q5 quants fully on the GPU with room for context.

VRAM by model class (the quick rule)

Model class	~Q4 VRAM	Fits on
1B-3B	~1.5-2.5 GB	Any GPU, most laptops, phones
7B-8B	~5 GB	8 GB GPU, 16 GB Mac
12B-14B	~8-10 GB	12 GB GPU (3060), 16 GB Mac
22B-32B (dense)	~14-20 GB	24 GB GPU (3090/4090)
49B-70B (dense)	~30-42 GB	2x 24 GB, or 1x 48 GB+
100B+ MoE	62 GB-600 GB	Multi-GPU rigs, 80 GB cards, big Macs

The honest caveat for MoE models: a model like Qwen3-235B-A22B only activates 22B params per token (so it generates at roughly 22B-class speed), but you must still hold all 235B params in memory. That is why “runs fast” and “fits in your VRAM” are two different questions for MoE — check both. The quantization calculator shows how each quant level (Q4/Q5/Q6/Q8) changes the footprint.

📋 Cite or embed this table

This database is meant to be referenced. If you are writing about local AI and want to link the numbers, cite this page as the source:

“Local AI Model Database (2026), Local AI Master — https://localaimaster.com/local-ai-model-database.”

Want an embeddable / interactive version (sortable by VRAM, filterable by GPU) for your own site or workflow? We are building shareable widgets and calculators — start with the live model leaderboard and the VRAM calculator.

Methodology & honest caveats

Specs are primary-sourced. Parameter counts, context windows and licenses come from each model official card, technical report, or vendor blog (see Sources). Where a family has multiple sizes we list the ones people actually run locally.
tok/s are estimates, not benchmarks. We list approximate single-GPU figures for a short prompt with the model fully offloaded. Your numbers will differ with quant, context length, batch size, backend (llama.cpp vs vLLM vs Ollama), driver and prompt. We err toward conservative ranges.
VRAM is the Q4 weight footprint plus a small KV allowance. Real usage climbs with context length. A 14B model that is ~9.5 GB of weights can need 11-12 GB once you load a long prompt.
MoE memory ≠ MoE speed. Total params decide whether it fits; active params decide how fast it generates. We show both for every MoE row.
Licenses change and have conditions. “Commercial OK” here is a summary, not legal advice — read the actual license (especially Codestral MNPL and Mistral Large MRL, which are non-commercial).
This page is updated as models ship. The local landscape moves monthly; we revise the table rather than spawn dated duplicates.

Go deeper

Best Local AI Models — Complete Guide

The full walkthrough: picking, installing and running local models from scratch.

Best Local AI Models for Programming

Coding-specific rankings with HumanEval scores and IDE setup.

Best GPUs for AI

Deeper GPU buying analysis across budgets and workloads.

AI Hardware Requirements Guide

CPU, RAM, VRAM and storage for running models locally.

Best Local AI Models for 8GB RAM

What actually runs on modest hardware, ranked.

Complete Ollama Guide

Install Ollama and pull any model in this database in minutes.

Best Local AI Coding Models

The coding leaderboard, ranked by capability and VRAM.

Model Recommender

Tell us your GPU and use case; get a matched model in seconds.

Sources

Every parameter count, context window and license below is taken from the model official documentation. tok/s and VRAM are derived estimates as described in Methodology.

🎯

AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Start free Browse courses first

Or own it for life — Lifetime $149 $599, pay once

Training your whole team? Get a team quote →

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

GitHub LinkedIn Twitter