Phi-4 Local Setup Guide (2026): Microsoft's 14B Reasoning Model on 12GB GPUs
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Phi-4 is Microsoft's December 2024 release of their efficient-LLM family — a 14.7B-parameter model trained on heavily curated synthetic and academic data rather than raw web crawl. The result: math, reasoning, and code performance that beats Llama 3.1 70B in some benchmarks while running on a 12 GB consumer GPU. MIT licensed for unrestricted commercial use. For users who want strong reasoning capability without the VRAM cost of 70B-class models, Phi-4 is one of the best 14B-class options in 2026.
This guide covers the full Phi-4 family (base 14B, mini 3.8B, multimodal 5.6B), setup across Ollama / vLLM / llama.cpp, prompting techniques for reasoning workloads, fine-tuning with QLoRA / Axolotl / Unsloth, and detailed benchmarks vs Llama 3.1 8B / Qwen 2.5 14B / DeepSeek-R1.
Table of Contents
- What Phi-4 Is
- The Phi-4 Family: Base, Mini, Multimodal
- What Makes Phi-4 Different (Synthetic Data)
- Hardware Requirements & Quantization
- Phi-4 vs Llama 3.1 8B vs Qwen 2.5 14B vs DeepSeek-R1
- Ollama Setup
- llama.cpp Setup with GGUF
- vLLM Setup
- LM Studio / oobabooga / KoboldCpp
- Phi-4-mini for Edge Devices
- Phi-4-multimodal for Vision + Audio
- System Prompts & Sampling
- Reasoning Tasks: Math, Code, Logic
- Fine-Tuning with QLoRA
- Function Calling and Structured Output
- Tuning Recipes by GPU
- Real Benchmarks
- Licensing
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What Phi-4 Is {#what-it-is}
Phi-4 (microsoft/phi-4 on HuggingFace) is the December 2024 release of Microsoft's Phi family. Architecture: standard decoder-only transformer with 14.7B parameters, 16K context, ChatML chat template. License: MIT — most permissive of recent major releases.
The Phi family thesis: small, well-trained models on curated data beat larger models on reasoning. Phi-4 is the best demonstration to date.
The Phi-4 Family: Base, Mini, Multimodal {#family}
| Variant | Parameters | Context | VRAM (BF16 / Q4) | Use |
|---|---|---|---|---|
| Phi-4 (base 14B) | 14.7B | 16K | 30 GB / 9 GB | Reasoning, math, code |
| Phi-4-mini (3.8B) | 3.8B | 128K | 8 GB / 2.5 GB | Edge, mobile, real-time |
| Phi-4-multimodal (5.6B) | 5.6B | 128K | 12 GB / 4 GB | Vision + audio + text |
| Phi-4-mini-instruct | 3.8B | 128K | 8 GB / 2.5 GB | Mini for chat |
| Phi-3.5-MoE (legacy) | 41B (16x6.6B) | 128K | 80 GB / 24 GB | MoE option |
For most users the base 14B Phi-4 is the right starting point.
What Makes Phi-4 Different (Synthetic Data) {#synthetic-data}
Most LLMs train on web crawl (Common Crawl, Reddit, books). Phi family trains primarily on:
- Synthetic textbooks generated by larger models with verification
- Academic papers with chain-of-thought reasoning
- Curated code from filtered GitHub
- Math problem sets with verified solutions
- Synthetic dialogues for instruction tuning
Smaller training corpus (~10T tokens vs Llama 3.1's 15T) but much higher data density. Result: better reasoning per parameter, weaker on raw knowledge breadth (less factual world coverage than Llama).
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Hardware Requirements & Quantization {#requirements}
| GPU VRAM | Phi-4 Quant | Throughput on RTX 4090 |
|---|---|---|
| 8 GB | Q3_K_M (low quality) | ~85 tok/s |
| 10 GB | Q4_K_S | ~80 tok/s |
| 12 GB | Q4_K_M / Q5_K_S | ~75 tok/s |
| 16 GB | Q5_K_M / Q6_K | ~70 tok/s |
| 24 GB | Q8_0 / FP16 | ~60 tok/s |
For most 12-16 GB consumer GPUs: Q5_K_M is the sweet spot. For RTX 4090 24 GB: Q8_0 if you want highest fidelity, or use the freed VRAM for longer context.
Phi-4 vs Llama 3.1 8B vs Qwen 2.5 14B vs DeepSeek-R1 {#comparison}
| Benchmark | Phi-4 14B | Llama 3.1 8B | Qwen 2.5 14B | Llama 3.1 70B | DeepSeek-R1 |
|---|---|---|---|---|---|
| MMLU | 84.8 | 73.0 | 79.7 | 86.0 | 90.8 |
| GSM8K (math) | 95.6 | 84.5 | 90.2 | 95.1 | 97.3 |
| MATH | 80.4 | 51.9 | 80.0 | 68.0 | 97.3 |
| HumanEval (code) | 82.6 | 72.6 | 83.5 | 80.5 | 96.2 |
| MMLU-Pro | 70.4 | 48.3 | 63.7 | 60.4 | 84.0 |
| AIME 2024 | 12.0 | 6.7 | 14.0 | 23.3 | 79.8 |
| Throughput (RTX 4090, Q5) | ~70 tok/s | ~127 tok/s | ~52 tok/s | ~8 tok/s (offload) | varies |
| Context length | 16K | 131K | 131K | 131K | 64K |
For everyday math/code/reasoning at fast inference speed: Phi-4 wins on quality-per-VRAM. For competition math: DeepSeek-R1. For long-context: Llama / Qwen. For code specifically: Qwen 2.5 Coder 14B.
Ollama Setup {#ollama}
# Pull Phi-4 14B
ollama pull phi4
# Or specific quant
ollama pull phi4:14b-q5_K_M
# Run
ollama run phi4 "Solve: integral of sin(x)*e^x dx, show work."
Ollama Modelfile customization:
FROM phi4
PARAMETER num_ctx 16384
PARAMETER temperature 0.4
PARAMETER min_p 0.05
PARAMETER repeat_penalty 1.05
SYSTEM """You are a precise reasoning assistant. Show your work step by step."""
llama.cpp Setup with GGUF {#llamacpp}
# Download Q5_K_M from bartowski's quants
huggingface-cli download bartowski/phi-4-GGUF \
phi-4-Q5_K_M.gguf \
--local-dir ./models
./llama-cli \
-m models/phi-4-Q5_K_M.gguf \
-ngl 999 \
-c 16384 \
-fa \
--temp 0.4 \
--min-p 0.05 \
-p "Solve this step by step: ..."
For server mode:
./llama-server -m phi-4-Q5_K_M.gguf -ngl 999 -c 16384 --port 8080
vLLM Setup {#vllm}
# BF16 (needs 32+ GB VRAM)
vllm serve microsoft/phi-4 \
--max-model-len 16384 \
--gpu-memory-utilization 0.92
# AWQ-INT4 (12+ GB VRAM)
vllm serve casperhansen/phi-4-awq \
--quantization awq \
--max-model-len 16384
For high-throughput serving, see vLLM Complete Setup Guide.
LM Studio / oobabooga / KoboldCpp {#other-runtimes}
All major desktop UIs support Phi-4 as of January 2025:
- LM Studio: search "phi-4" in the model browser, download Q5_K_M, click Load
- oobabooga: download GGUF, place in models/, llama.cpp loader
- KoboldCpp:
./koboldcpp --model phi-4-Q5_K_M.gguf --usecublas
See text-generation-webui guide and KoboldCpp guide.
Phi-4-mini for Edge Devices {#mini}
Phi-4-mini (3.8B) is designed for tight resource budgets:
ollama run phi4-mini
Use cases:
- Mobile via MLC-LLM (Android / iOS): 18-25 tok/s on Snapdragon 8 Gen 3
- Browser via WebLLM: 12-18 tok/s on RTX 4070
- Raspberry Pi 5: 4-6 tok/s for offline assistants
- Embedded systems for on-device reasoning
128K context window is unusually long for a 3.8B model — useful for RAG over long documents on edge devices.
Phi-4-multimodal for Vision + Audio {#multimodal}
from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-4-multimodal-instruct",
torch_dtype="auto",
device_map="cuda",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("microsoft/Phi-4-multimodal-instruct", trust_remote_code=True)
# Vision input
inputs = processor(
images=[image],
text="<|user|>\n<|image_1|>\nDescribe this image.<|end|>\n<|assistant|>",
return_tensors="pt",
).to("cuda")
# Audio input similarly with audio_1 tag
For pure vision use cases, Llama 3.2 11B Vision and Qwen 2-VL 7B are stronger; Phi-4-multimodal's strength is doing all three modalities in one small model.
System Prompts & Sampling {#prompting}
ChatML format:
<|im_start|>system
You are a precise reasoning assistant.
<|im_end|>
<|im_start|>user
[user message]
<|im_end|>
<|im_start|>assistant
Recommended sampling:
- Reasoning / math / code: temperature 0.0-0.3, min-p 0.05, no top-k
- Chat / general: temperature 0.5-0.7, min-p 0.05
- Creative writing: temperature 0.8-1.0, min-p 0.05
See LLM Sampling Parameters for full sampler reference.
Reasoning Tasks: Math, Code, Logic {#reasoning}
For best reasoning output, use chain-of-thought prompting:
Solve this step by step. Show your work.
Problem: ...
Solution:
Or system-level CoT:
<|im_start|>system
For every problem, work through the solution step by step before providing the final answer.
<|im_end|>
Phi-4 was trained to produce CoT-style reasoning natively; it usually does this without explicit prompting.
For competition-grade math (AIME, Olympiad), use DeepSeek-R1-Distill variants — they outperform Phi-4 on hard problems with test-time compute.
Fine-Tuning with QLoRA {#fine-tuning}
Using Unsloth (fastest):
from unsloth import FastLanguageModel
from datasets import load_dataset
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/phi-4-bnb-4bit",
max_seq_length=4096,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=32, lora_alpha=32, lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
)
# Train on your dataset
On RTX 4090: ~2-4 hours for 1K-example QLoRA fine-tune. MIT license means fine-tuned derivatives can be redistributed without restriction.
See LoRA Fine-Tuning Local Guide.
Function Calling and Structured Output {#tools}
Phi-4 supports OpenAI-style tool calling. Pass tools to vLLM / Ollama with --enable-auto-tool-choice:
{
"model": "phi-4",
"messages": [...],
"tools": [{"type": "function", "function": {...}}],
"tool_choice": "auto"
}
For JSON mode constrained output, use response_format:
{"response_format": {"type": "json_schema", "json_schema": {...}}}
vLLM uses xgrammar for guaranteed schema-valid output.
Tuning Recipes by GPU {#tuning}
RTX 3060 / 4060 (12 GB)
ollama run phi4:14b-q4_K_M
# Or via Modelfile: Q4_K_M, num_ctx 8192, temperature 0.4
RTX 4070 / 4080 (16 GB)
Q5_K_M, num_ctx 16384.
RTX 4090 / 5090 (24-32 GB)
Q8_0 or FP16, full 16K context, batch multiple users via vLLM.
Mac M2 / M3 / M4
Q4_K_M or Q5_K_M via Ollama Metal — works on 16 GB Mac comfortably.
Real Benchmarks {#benchmarks}
Single-user, RTX 4090, Q5_K_M:
| Test | Phi-4 14B | Llama 3.1 8B | Qwen 2.5 14B |
|---|---|---|---|
| GSM8K | 95.6% | 84.5% | 90.2% |
| MATH | 80.4% | 51.9% | 80.0% |
| HumanEval | 82.6% | 72.6% | 83.5% |
| MMLU | 84.8% | 73.0% | 79.7% |
| Inference tok/s (Q5) | 70 | 127 | 52 |
Phi-4 is slower per-token than Llama 8B but produces correct answers more often — net throughput on hard tasks is better.
Licensing {#licensing}
MIT license — most permissive among recent major model releases. You can:
- Use commercially without restriction
- Modify and redistribute
- Bundle into proprietary products
- Sell as paid service
- Train derivative models without restriction
Compare to Llama 3.1 (Meta Llama Community License with monthly active user thresholds), Qwen 2.5 (Tongyi Qianwen License with EU restrictions). For strictest commercial cleanliness, Phi-4 is the safest choice.
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| Wrong chat format | Missing ChatML tokens | Use Modelfile / config that loads phi-4 template |
| Repetitive output | No min-p | Set min-p 0.05 |
| Verbose / over-explains | No system prompt | Add "be concise" to system |
| OOM | Context too long | 16K max for Phi-4; lower if needed |
| Slow on Mac | M1 base | Use Phi-4-mini instead |
| Worse than Llama on chat | Phi-4 is reasoning-tuned | Use Llama for chat, Phi-4 for math/code |
FAQ {#faq}
See answers to common Phi-4 questions below.
Sources: Phi-4 on Hugging Face | Phi-4 technical report | Microsoft Research | bartowski quants | Internal benchmarks RTX 4090, M4 Max.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!