Reference · Local AI Model Database

Local AI Model Database (2026): VRAM, Speed & Specs for 30+ Models

📅 Published: June 20, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed

This is a comprehensive, sourced reference for 30+ of the most popular local (open-weight) AI models in 2026 — with parameter count, context window, minimum Q4 VRAM, approximate tokens/sec on an RTX 3090 and RTX 4090, license, and best use. The short version: an 8B model needs roughly 5 GB of VRAM at Q4 and runs on almost anything; a 14B-class model needs ~9-10 GB and fits a 12-16 GB card; a 32B dense model needs ~20 GB and wants a 24 GB card; and 70B-and-up models need 40 GB+, so they either spill to system RAM or want two 24 GB GPUs. Mixture-of-Experts (MoE) models such as Qwen3-30B-A3B and gpt-oss-20b run as fast as much smaller models because only a few billion parameters activate per token — but you still need enough memory to hold all the weights. Every spec below is cited to a primary source; the tokens/sec numbers are approximate single-GPU estimates, not a controlled benchmark.

How to read this database

  • Params — total parameters for dense models. For Mixture-of-Experts (MoE) models we list total / active (e.g. 30.5B total / 3.3B active): total decides memory, active decides speed.
  • Context — the maximum context window from the official model card. Several models hit their largest context only with extrapolation (YaRN), noted inline.
  • Min VRAM (Q4) — approximate memory to hold the weights at ~Q4_K_M quantization, plus a small KV-cache allowance. Long context, larger batches, or higher quants need more. For exact numbers against your GPU, use the VRAM calculator.
  • tok/s (3090 / 4090)approximate generation speed for a short prompt with the model fully offloaded to a single GPU via llama.cpp/Ollama. Treat as ballpark; real numbers vary with quant, context, driver and prompt. “Partial” means the model does not fully fit and some layers run on CPU, which slows things sharply.
  • License — the redistribution license. “Apache 2.0” and “MIT” are the most permissive; community/terms licenses (Llama, Gemma, NVIDIA Open Model) allow commercial use with conditions; Codestral (MNPL) and Mistral Large (MRL) are non-commercial — read before shipping.

The local AI model database (30+ models)

Grouped roughly by family, from smallest to largest within each. Names link to our full per-model pages where available.

ModelFamilyParamsContextMin VRAM (Q4)~tok/s 3090~tok/s 4090LicenseBest use
Llama 3.2 1BMeta Llama1.23B dense128K~1.5 GB~120+~150+Llama 3.2 CommunityOn-device, edge, fast drafting
Llama 3.2 3BMeta Llama3.21B dense128K~2.5 GB~90+~110+Llama 3.2 CommunityLightweight chat, low-VRAM laptops
Llama 3.1 8BMeta Llama8.03B dense128K~5 GB~55-70~80-100Llama 3.1 CommunityGeneral-purpose 8GB workhorse
Llama 3.3 70BMeta Llama70.6B dense128K~40 GB~6-9 (partial)~8-12 (partial)Llama 3.3 Community405B-class quality, needs 2x24GB
Llama 3.1 405BMeta Llama405B dense128K~230 GBN/A (multi-node)N/A (multi-node)Llama 3.1 CommunityFrontier dense, server clusters only
Qwen3 4BQwen3 (Alibaba)4B dense128K (32K native + YaRN)~3 GB~80+~100+Apache 2.0Small reasoning model, hybrid thinking
Qwen3 8BQwen3 (Alibaba)8B dense128K (32K native + YaRN)~5 GB~55-70~80-100Apache 2.0Strong 8GB all-rounder w/ reasoning
Qwen3 14BQwen3 (Alibaba)14B dense128K (32K native + YaRN)~9.5 GB~38-45~55-65Apache 2.0Best dense model that fits 12-16GB
Qwen3 32BQwen3 (Alibaba)32.8B dense128K (32K native + YaRN)~20 GB~18-24~28-36Apache 2.0Top dense quality on a single 24GB card
Qwen3 30B-A3BQwen3 (Alibaba)30.5B total / 3.3B active (MoE)128K (32K native + YaRN)~18 GB~55-75~80-110Apache 2.032B-class smarts at 8B-class speed
Qwen3 235B-A22BQwen3 (Alibaba)235B total / 22B active (MoE)256K (Instruct-2507)~130 GBN/A (multi-GPU)N/A (multi-GPU)Apache 2.0Flagship open MoE, big rigs / Mac 256GB
Qwen2.5-Coder 7BQwen Coder7.6B dense128K~5 GB~55-70~80-100Apache 2.0Fast IDE autocomplete (FIM), 8GB
Qwen2.5-Coder 14BQwen Coder14.7B dense128K~9.5 GB~38-45~55-65Apache 2.0Best coder that fits a 16GB GPU
Qwen2.5-Coder 32BQwen Coder32.5B dense128K~20 GB~18-24~28-36Apache 2.092.7% HumanEval, single 24GB card
Qwen3-Coder 30B-A3BQwen Coder30.5B total / 3.3B active (MoE)256K (1M w/ YaRN)~18 GB~55-75~80-110Apache 2.0Agentic coding, repo-scale context
Qwen3-Coder 480B-A35BQwen Coder480B total / 35B active (MoE)256K (1M w/ YaRN)~270 GBN/A (multi-node)N/A (multi-node)Apache 2.0Sonnet-class open agentic coder
DeepSeek-R1-Distill-Qwen 7BDeepSeek7B dense (Qwen base)128K~5 GB~55-70~80-100MITCheap local reasoning, 8GB
DeepSeek-R1-Distill-Llama 8BDeepSeek8B dense (Llama base)128K~5 GB~55-70~80-100MITReasoning chains on a 3060
DeepSeek-R1-Distill-Llama 70BDeepSeek70B dense (Llama base)128K~40 GB~6-9 (partial)~8-12 (partial)MITStrongest distilled reasoner, 2x24GB
DeepSeek-Coder-V2-Lite 16BDeepSeek16B total / 2.4B active (MoE)128K~10.5 GB~50+~70+DeepSeek License (commercial OK)Fast MoE coder, low latency
DeepSeek-V3 / R1 671BDeepSeek671B total / 37B active (MoE)128K~400 GBN/A (multi-node)N/A (multi-node)MITFrontier open MoE, server clusters
Gemma 3 270MGoogle Gemma0.27B dense32K~0.5 GB~200+~250+Gemma TermsTiny on-device tasks, fine-tuning base
Gemma 3 4BGoogle Gemma4B dense (multimodal)128K~3 GB~80+~100+Gemma TermsVision + text on a laptop
Gemma 3 12BGoogle Gemma12B dense (multimodal)128K~8 GB~42-50~60-72Gemma TermsMultimodal all-rounder, 12GB
Gemma 3 27BGoogle Gemma27B dense (multimodal)128K~16 GB~22-28~32-42Gemma TermsTop single-GPU multimodal, 24GB
Mistral 7BMistral7.25B dense32K~5 GB~60-75~85-105Apache 2.0Classic lightweight baseline
Mistral NeMo 12BMistral12B dense128K~8 GB~42-50~60-72Apache 2.0Multilingual 12GB workhorse
Mistral Small 3.2 24BMistral24B dense (multimodal)128K~14 GB~26-32~38-48Apache 2.0Best Apache-licensed 24B, fits 24GB
Codestral 22BMistral22B dense32K~13 GB~28-34~40-50MNPL (non-commercial)Coding/FIM — research/personal only
Devstral Small 24BMistral24B dense128K+~14 GB~26-32~38-48Apache 2.0Agentic SWE-bench coder, single 24GB
Mistral Large 2 123BMistral123B dense128K~70 GBN/A (multi-GPU)N/A (multi-GPU)MRL (non-commercial)Flagship dense, multi-GPU rigs
Mixtral 8x7BMistral46.7B total / 12.9B active (MoE)32K~26 GB~30-40 (partial)~45-55 (partial)Apache 2.0Classic open MoE, needs 24GB+
Phi-4-mini 3.8BMicrosoft Phi3.84B dense128K~2.8 GB~85+~110+MITTiny reasoning model, edge
Phi-4 14BMicrosoft Phi14B dense16K~9 GB~40-48~58-68MITReasoning/math-first 14B all-rounder
gpt-oss-20bOpenAI gpt-oss21B total / 3.6B active (MoE)128K~14 GB (MXFP4 native)~45-60~65-90Apache 2.0o3-mini-class reasoning, 16GB GPU
gpt-oss-120bOpenAI gpt-oss117B total / 5.1B active (MoE)128K~65 GB (MXFP4 native)N/A (needs 80GB)N/A (needs 80GB)Apache 2.0o4-mini-class on one 80GB GPU
Kimi K2Moonshot Kimi1T total / 32B active (MoE)256K~600 GBN/A (multi-node)N/A (multi-node)Modified MITFrontier open agentic model, clusters
GLM-4.6Z.ai GLM357B total / 32B active (MoE)200K~200 GBN/A (multi-node)N/A (multi-node)MITTop open coding/agent MoE, big rigs
GLM-4.5-AirZ.ai GLM106B total / 12B active (MoE)128K~62 GBN/A (multi-GPU)N/A (multi-GPU)MITSmaller GLM MoE, 2-4 GPU rigs
Llama-3.1-Nemotron-Nano 8BNVIDIA Nemotron8B dense (Llama base)128K~5 GB~55-70~80-100NVIDIA Open ModelTuned reasoning + tool-calling, 8GB
Llama-3.3-Nemotron-Super 49BNVIDIA Nemotron49B dense (Llama base)128K~30 GB~10-14 (partial)~14-18 (partial)NVIDIA Open Model70B-class accuracy, fits ~2x24GB
MiniMax M2MiniMax~230B total / ~10B active (MoE)200K+~130 GBN/A (multi-node)N/A (multi-node)MITLong-context agentic MoE, clusters

VRAM = weights at ~Q4_K_M + small KV-cache allowance; add headroom for long context. tok/s are approximate single-GPU llama.cpp/Ollama estimates for a short prompt, not a controlled benchmark. MoE rows show total / active params: total drives memory, active drives speed.

GPU reference table — what each card runs

VRAM capacity is the hard wall for local models; memory bandwidth then sets how fast they run. Prices are approximate US street prices as of mid-2026 and move constantly — check current listings before buying. To match a specific card to a specific model, use our which-GPU-to-buy tool or the can-I-run-it checker.

GPUVRAMApprox priceMem bandwidthRuns comfortablyNotes
RTX 3060 12GB12 GB GDDR6~$280-320 (new)360 GB/s8B-14B at Q4Best cheap entry card; runs Llama 3.1 8B and 14B-class Q4 fully on GPU.
RTX 4060 Ti 16GB16 GB GDDR6~$450-500 (new)288 GB/s14B Q4-Q8, 24B Q4More VRAM than a 3060 but narrow bus — capacity, not speed. Good for bigger models.
RTX 5060 Ti 16GB16 GB GDDR7~$550-600 (new)448 GB/s14B Q4-Q8, 24B Q4GDDR7 bandwidth makes it noticeably faster than the 4060 Ti 16GB at the same VRAM.
Tesla P40 24GB24 GB GDDR5~$150-250 (used)347 GB/s14B-32B Q4Cheapest 24GB by far (~$7/GB). Pascal-era, ~3x slower than a 3090; needs a fan + cooling shroud.
RTX 3090 24GB24 GB GDDR6X~$800-1,000 (used)936 GB/s32B Q4-Q5, 70B partialBest used VRAM-per-dollar for a real card; the local-LLM enthusiast default.
RTX 4090 24GB24 GB GDDR6X~$1,600-2,0001,008 GB/s32B Q5-Q8, 70B partialFastest single 24GB card; same models as a 3090 but markedly higher throughput.
RTX 5090 32GB32 GB GDDR7~$2,000 MSRP (street higher)1,792 GB/s32B Q8, 49B Q4, 70B Q4 (tight)Most VRAM + bandwidth on a single consumer card; fits a 70B at Q4 where 24GB cards spill.

Prices are approximate US street estimates (mid-2026) and fluctuate; bandwidth from manufacturer specs. “Runs comfortably” assumes Q4-Q5 quants fully on the GPU with room for context.

VRAM by model class (the quick rule)

Model class~Q4 VRAMFits on
1B-3B~1.5-2.5 GBAny GPU, most laptops, phones
7B-8B~5 GB8 GB GPU, 16 GB Mac
12B-14B~8-10 GB12 GB GPU (3060), 16 GB Mac
22B-32B (dense)~14-20 GB24 GB GPU (3090/4090)
49B-70B (dense)~30-42 GB2x 24 GB, or 1x 48 GB+
100B+ MoE62 GB-600 GBMulti-GPU rigs, 80 GB cards, big Macs

The honest caveat for MoE models: a model like Qwen3-235B-A22B only activates 22B params per token (so it generates at roughly 22B-class speed), but you must still hold all 235B params in memory. That is why “runs fast” and “fits in your VRAM” are two different questions for MoE — check both. The quantization calculator shows how each quant level (Q4/Q5/Q6/Q8) changes the footprint.

📋 Cite or embed this table

This database is meant to be referenced. If you are writing about local AI and want to link the numbers, cite this page as the source:

“Local AI Model Database (2026), Local AI Master — https://localaimaster.com/local-ai-model-database.”

Want an embeddable / interactive version (sortable by VRAM, filterable by GPU) for your own site or workflow? We are building shareable widgets and calculators — start with the live model leaderboard and the VRAM calculator.

Methodology & honest caveats

  • Specs are primary-sourced. Parameter counts, context windows and licenses come from each model official card, technical report, or vendor blog (see Sources). Where a family has multiple sizes we list the ones people actually run locally.
  • tok/s are estimates, not benchmarks. We list approximate single-GPU figures for a short prompt with the model fully offloaded. Your numbers will differ with quant, context length, batch size, backend (llama.cpp vs vLLM vs Ollama), driver and prompt. We err toward conservative ranges.
  • VRAM is the Q4 weight footprint plus a small KV allowance. Real usage climbs with context length. A 14B model that is ~9.5 GB of weights can need 11-12 GB once you load a long prompt.
  • MoE memory ≠ MoE speed. Total params decide whether it fits; active params decide how fast it generates. We show both for every MoE row.
  • Licenses change and have conditions. “Commercial OK” here is a summary, not legal advice — read the actual license (especially Codestral MNPL and Mistral Large MRL, which are non-commercial).
  • This page is updated as models ship. The local landscape moves monthly; we revise the table rather than spawn dated duplicates.

Go deeper

Sources

Every parameter count, context window and license below is taken from the model official documentation. tok/s and VRAM are derived estimates as described in Methodology.

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once
LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
Free Tools & Calculators