Free Tool · No Signup

Local AI Model Size Picker: 7B, 14B, 32B or 70B?

Pick what you'll use the model for and how much VRAM you have. The tool returns the right parameter size (7B / 8B / 14B / 32B / 70B), a specific model you can actually pull today, the quantization to use (like Q4_K_M), and whether it comfortably fits your card — using real VRAM-at-Q4 math, not guesswork.

📅 Published: June 20, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed

1 · What will you mainly use it for?

How the size picker works

Two inputs decide everything: your use case and your VRAM. The use case chooses which model family is best (a coding model for coding, a vision model for images), and the VRAM decides how big a version of it you can actually run. We size each parameter tier by its real Q4_K_M weight footprint and check it against your card's usable memory.

Param size	~Q4_K_M weights	Comfortable on
7B	~5 GB	8 GB VRAM (or CPU)
8B	~6 GB	8-12 GB VRAM
14B	~9 GB	12-16 GB VRAM
32B	~20 GB	24 GB VRAM
70B	~42 GB	48 GB+ / 2× 24 GB

The reserved-VRAM step matters. A 32B model at Q4_K_M is about 20 GB of weights, which looks like it fits a 24 GB RTX 4090 — and it does, but with only ~3 GB left for the KV-cache, so you keep the context modest. A 14B at ~9 GB on the same card leaves loads of room for a long context. The picker tells you which situation you're in instead of letting you discover it with an out-of-memory crash.

Worked examples

Coding · 12 GB (RTX 4070/3060)

→ 14B: Qwen2.5 Coder 14B at Q4_K_M (~9 GB). Clearly beats a 7B on real refactors and still leaves room for context. See the best 14B coding models.

Coding · 24 GB (RTX 4090/3090)

→ 32B: Qwen2.5 Coder 32B at Q4_K_M (~20 GB). The local-coding benchmark king on a single card. More on sizing in 7B vs 14B vs 32B vs 70B for coding.

General chat · 8 GB (RTX 4060)

→ 8B: Llama 3.1 8B at Q4_K_M (~6 GB). The everywhere-runs default with a huge ecosystem.

Reasoning · 32 GB+ (RTX 5090 / 2× cards)

→ 70B if you have 48 GB+, else 32B: Llama 3.3 70B (~42 GB) is the best open-weight reasoning you can self-host.

Vision · 8 GB

→ 7B: Qwen2-VL 7B at Q4_K_M (~6 GB). Excellent OCR and document reading on a small card.

Anything · CPU only

→ Stay at 7B-8B and expect ~3-12 tokens/sec. Bigger models technically run via CPU offload but get painfully slow for everyday use.

Want the exact VRAM number for a specific model and context length rather than a tier estimate? Run it through the VRAM Calculator, then come back here to confirm the size is the right call for your use case.

Frequently asked questions

How do you calculate whether a model fits my VRAM?

We use the real Q4_K_M weight size for each parameter tier — about 5 GB for 7B, 6 GB for 8B, 9 GB for 14B, 20 GB for 32B and 42 GB for 70B — then subtract that from your usable VRAM (we reserve roughly 1.5 GB of the card for your OS and desktop). Whatever is left is your budget for the KV-cache and context window. The tool steps down to the largest size that leaves you a safe margin.

Why does the tool reserve VRAM instead of using the full number on the box?

Your GPU never hands 100% of its memory to the model. The OS, your desktop, the browser and the model runtime all hold some. On top of that, the KV-cache grows with your context length, so a model that technically "fits" at 0 context can run out of memory at 16K tokens. Reserving a slice and reporting the leftover headroom is how you avoid out-of-memory crashes mid-generation.

Is bigger always better?

No. A 14B at Q4 usually beats a 7B, and a 32B usually beats a 14B — but the gains shrink while the VRAM cost and the slowdown grow. For general chat a well-tuned 8B is genuinely fine for most people. For coding and reasoning the jump to 14B and then 32B is where quality really shows up. Past 70B you are into multi-GPU territory and diminishing returns for local use.

What if nothing fits — can I still run a bigger model?

Yes, two ways. First, a more aggressive quant: Q4_K_S or IQ4_XS shave a little size (and a little quality) off the Q4_K_M number. Second, CPU offload: tools like Ollama and llama.cpp can push some layers into system RAM, which lets a too-big model run, just slower. When your pick is a tight fit the tool flags it and suggests a lighter quant or a smaller fallback model.

Does this work for Apple Silicon / unified memory?

Roughly, yes. On a Mac the unified memory acts as the VRAM budget, so pick the bucket closest to your usable unified RAM (leave headroom for macOS). The same weight-size math applies. The main difference is that Macs share memory with the whole system, so be a little more conservative than a dedicated GPU of the same number.

Want help running the model you picked?

Local AI Master walks you through pulling, quantizing and serving open-weight models — Ollama, llama.cpp and vLLM, KV-cache and context tuning, OpenAI-compatible serving. Real, working setups, not theory.

See the deployment course →

Related tools & resources

→ VRAM Calculator — exact VRAM for any model + quant + context length
→ AI Model Finder — match your full hardware setup to a model
→ 7B vs 14B vs 32B vs 70B for coding — how size affects coding quality
→ Best 14B coding models — the value tier, ranked
→ All AI models — full database with VRAM specs

🎯

AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Start free Browse courses first

Or own it for life — Lifetime $149 $599, pay once

Training your whole team? Get a team quote →

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

GitHub LinkedIn Twitter