Free Tool · No Signup
Local AI Model Size Picker: 7B, 14B, 32B or 70B?
Pick what you'll use the model for and how much VRAM you have. The tool returns the right parameter size (7B / 8B / 14B / 32B / 70B), a specific model you can actually pull today, the quantization to use (like Q4_K_M), and whether it comfortably fits your card — using real VRAM-at-Q4 math, not guesswork.
1 · What will you mainly use it for?
How the size picker works
Two inputs decide everything: your use case and your VRAM. The use case chooses which model family is best (a coding model for coding, a vision model for images), and the VRAM decides how big a version of it you can actually run. We size each parameter tier by its real Q4_K_M weight footprint and check it against your card's usable memory.
| Param size | ~Q4_K_M weights | Comfortable on |
|---|---|---|
| 7B | ~5 GB | 8 GB VRAM (or CPU) |
| 8B | ~6 GB | 8-12 GB VRAM |
| 14B | ~9 GB | 12-16 GB VRAM |
| 32B | ~20 GB | 24 GB VRAM |
| 70B | ~42 GB | 48 GB+ / 2× 24 GB |
The reserved-VRAM step matters. A 32B model at Q4_K_M is about 20 GB of weights, which looks like it fits a 24 GB RTX 4090 — and it does, but with only ~3 GB left for the KV-cache, so you keep the context modest. A 14B at ~9 GB on the same card leaves loads of room for a long context. The picker tells you which situation you're in instead of letting you discover it with an out-of-memory crash.
Worked examples
Coding · 12 GB (RTX 4070/3060)
→ 14B: Qwen2.5 Coder 14B at Q4_K_M (~9 GB). Clearly beats a 7B on real refactors and still leaves room for context. See the best 14B coding models.
Coding · 24 GB (RTX 4090/3090)
→ 32B: Qwen2.5 Coder 32B at Q4_K_M (~20 GB). The local-coding benchmark king on a single card. More on sizing in 7B vs 14B vs 32B vs 70B for coding.
General chat · 8 GB (RTX 4060)
→ 8B: Llama 3.1 8B at Q4_K_M (~6 GB). The everywhere-runs default with a huge ecosystem.
Reasoning · 32 GB+ (RTX 5090 / 2× cards)
→ 70B if you have 48 GB+, else 32B: Llama 3.3 70B (~42 GB) is the best open-weight reasoning you can self-host.
Vision · 8 GB
→ 7B: Qwen2-VL 7B at Q4_K_M (~6 GB). Excellent OCR and document reading on a small card.
Anything · CPU only
→ Stay at 7B-8B and expect ~3-12 tokens/sec. Bigger models technically run via CPU offload but get painfully slow for everyday use.
Want the exact VRAM number for a specific model and context length rather than a tier estimate? Run it through the VRAM Calculator, then come back here to confirm the size is the right call for your use case.
Frequently asked questions
How do you calculate whether a model fits my VRAM?
Why does the tool reserve VRAM instead of using the full number on the box?
Is bigger always better?
What if nothing fits — can I still run a bigger model?
Does this work for Apple Silicon / unified memory?
Want help running the model you picked?
Local AI Master walks you through pulling, quantizing and serving open-weight models — Ollama, llama.cpp and vLLM, KV-cache and context tuning, OpenAI-compatible serving. Real, working setups, not theory.
See the deployment course →Related tools & resources
- → VRAM Calculator — exact VRAM for any model + quant + context length
- → AI Model Finder — match your full hardware setup to a model
- → 7B vs 14B vs 32B vs 70B for coding — how size affects coding quality
- → Best 14B coding models — the value tier, ranked
- → All AI models — full database with VRAM specs
Go from reading about AI to building with AI
20 structured courses. Hands-on projects. Runs on your machine. Start free.
Written by the Local AI Master Team
The team behind Local AI Master
We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.