★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Models

Phi-4 Local Setup Guide (2026): Microsoft's 14B Reasoning Model on 12GB GPUs

May 1, 2026
22 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

Phi-4 is Microsoft's December 2024 release of their efficient-LLM family — a 14.7B-parameter model trained on heavily curated synthetic and academic data rather than raw web crawl. The result: math, reasoning, and code performance that beats Llama 3.1 70B in some benchmarks while running on a 12 GB consumer GPU. MIT licensed for unrestricted commercial use. For users who want strong reasoning capability without the VRAM cost of 70B-class models, Phi-4 is one of the best 14B-class options in 2026.

This guide covers the full Phi-4 family (base 14B, mini 3.8B, multimodal 5.6B), setup across Ollama / vLLM / llama.cpp, prompting techniques for reasoning workloads, fine-tuning with QLoRA / Axolotl / Unsloth, and detailed benchmarks vs Llama 3.1 8B / Qwen 2.5 14B / DeepSeek-R1.

Table of Contents

  1. What Phi-4 Is
  2. The Phi-4 Family: Base, Mini, Multimodal
  3. What Makes Phi-4 Different (Synthetic Data)
  4. Hardware Requirements & Quantization
  5. Phi-4 vs Llama 3.1 8B vs Qwen 2.5 14B vs DeepSeek-R1
  6. Ollama Setup
  7. llama.cpp Setup with GGUF
  8. vLLM Setup
  9. LM Studio / oobabooga / KoboldCpp
  10. Phi-4-mini for Edge Devices
  11. Phi-4-multimodal for Vision + Audio
  12. System Prompts & Sampling
  13. Reasoning Tasks: Math, Code, Logic
  14. Fine-Tuning with QLoRA
  15. Function Calling and Structured Output
  16. Tuning Recipes by GPU
  17. Real Benchmarks
  18. Licensing
  19. Troubleshooting
  20. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What Phi-4 Is {#what-it-is}

Phi-4 (microsoft/phi-4 on HuggingFace) is the December 2024 release of Microsoft's Phi family. Architecture: standard decoder-only transformer with 14.7B parameters, 16K context, ChatML chat template. License: MIT — most permissive of recent major releases.

The Phi family thesis: small, well-trained models on curated data beat larger models on reasoning. Phi-4 is the best demonstration to date.


The Phi-4 Family: Base, Mini, Multimodal {#family}

VariantParametersContextVRAM (BF16 / Q4)Use
Phi-4 (base 14B)14.7B16K30 GB / 9 GBReasoning, math, code
Phi-4-mini (3.8B)3.8B128K8 GB / 2.5 GBEdge, mobile, real-time
Phi-4-multimodal (5.6B)5.6B128K12 GB / 4 GBVision + audio + text
Phi-4-mini-instruct3.8B128K8 GB / 2.5 GBMini for chat
Phi-3.5-MoE (legacy)41B (16x6.6B)128K80 GB / 24 GBMoE option

For most users the base 14B Phi-4 is the right starting point.


What Makes Phi-4 Different (Synthetic Data) {#synthetic-data}

Most LLMs train on web crawl (Common Crawl, Reddit, books). Phi family trains primarily on:

  • Synthetic textbooks generated by larger models with verification
  • Academic papers with chain-of-thought reasoning
  • Curated code from filtered GitHub
  • Math problem sets with verified solutions
  • Synthetic dialogues for instruction tuning

Smaller training corpus (~10T tokens vs Llama 3.1's 15T) but much higher data density. Result: better reasoning per parameter, weaker on raw knowledge breadth (less factual world coverage than Llama).


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Hardware Requirements & Quantization {#requirements}

GPU VRAMPhi-4 QuantThroughput on RTX 4090
8 GBQ3_K_M (low quality)~85 tok/s
10 GBQ4_K_S~80 tok/s
12 GBQ4_K_M / Q5_K_S~75 tok/s
16 GBQ5_K_M / Q6_K~70 tok/s
24 GBQ8_0 / FP16~60 tok/s

For most 12-16 GB consumer GPUs: Q5_K_M is the sweet spot. For RTX 4090 24 GB: Q8_0 if you want highest fidelity, or use the freed VRAM for longer context.


Phi-4 vs Llama 3.1 8B vs Qwen 2.5 14B vs DeepSeek-R1 {#comparison}

BenchmarkPhi-4 14BLlama 3.1 8BQwen 2.5 14BLlama 3.1 70BDeepSeek-R1
MMLU84.873.079.786.090.8
GSM8K (math)95.684.590.295.197.3
MATH80.451.980.068.097.3
HumanEval (code)82.672.683.580.596.2
MMLU-Pro70.448.363.760.484.0
AIME 202412.06.714.023.379.8
Throughput (RTX 4090, Q5)~70 tok/s~127 tok/s~52 tok/s~8 tok/s (offload)varies
Context length16K131K131K131K64K

For everyday math/code/reasoning at fast inference speed: Phi-4 wins on quality-per-VRAM. For competition math: DeepSeek-R1. For long-context: Llama / Qwen. For code specifically: Qwen 2.5 Coder 14B.


Ollama Setup {#ollama}

# Pull Phi-4 14B
ollama pull phi4

# Or specific quant
ollama pull phi4:14b-q5_K_M

# Run
ollama run phi4 "Solve: integral of sin(x)*e^x dx, show work."

Ollama Modelfile customization:

FROM phi4
PARAMETER num_ctx 16384
PARAMETER temperature 0.4
PARAMETER min_p 0.05
PARAMETER repeat_penalty 1.05
SYSTEM """You are a precise reasoning assistant. Show your work step by step."""

llama.cpp Setup with GGUF {#llamacpp}

# Download Q5_K_M from bartowski's quants
huggingface-cli download bartowski/phi-4-GGUF \
    phi-4-Q5_K_M.gguf \
    --local-dir ./models

./llama-cli \
    -m models/phi-4-Q5_K_M.gguf \
    -ngl 999 \
    -c 16384 \
    -fa \
    --temp 0.4 \
    --min-p 0.05 \
    -p "Solve this step by step: ..."

For server mode:

./llama-server -m phi-4-Q5_K_M.gguf -ngl 999 -c 16384 --port 8080

vLLM Setup {#vllm}

# BF16 (needs 32+ GB VRAM)
vllm serve microsoft/phi-4 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.92

# AWQ-INT4 (12+ GB VRAM)
vllm serve casperhansen/phi-4-awq \
    --quantization awq \
    --max-model-len 16384

For high-throughput serving, see vLLM Complete Setup Guide.


LM Studio / oobabooga / KoboldCpp {#other-runtimes}

All major desktop UIs support Phi-4 as of January 2025:

  • LM Studio: search "phi-4" in the model browser, download Q5_K_M, click Load
  • oobabooga: download GGUF, place in models/, llama.cpp loader
  • KoboldCpp: ./koboldcpp --model phi-4-Q5_K_M.gguf --usecublas

See text-generation-webui guide and KoboldCpp guide.


Phi-4-mini for Edge Devices {#mini}

Phi-4-mini (3.8B) is designed for tight resource budgets:

ollama run phi4-mini

Use cases:

  • Mobile via MLC-LLM (Android / iOS): 18-25 tok/s on Snapdragon 8 Gen 3
  • Browser via WebLLM: 12-18 tok/s on RTX 4070
  • Raspberry Pi 5: 4-6 tok/s for offline assistants
  • Embedded systems for on-device reasoning

128K context window is unusually long for a 3.8B model — useful for RAG over long documents on edge devices.


Phi-4-multimodal for Vision + Audio {#multimodal}

from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-4-multimodal-instruct",
    torch_dtype="auto",
    device_map="cuda",
    trust_remote_code=True,
)

processor = AutoProcessor.from_pretrained("microsoft/Phi-4-multimodal-instruct", trust_remote_code=True)

# Vision input
inputs = processor(
    images=[image],
    text="<|user|>\n<|image_1|>\nDescribe this image.<|end|>\n<|assistant|>",
    return_tensors="pt",
).to("cuda")

# Audio input similarly with audio_1 tag

For pure vision use cases, Llama 3.2 11B Vision and Qwen 2-VL 7B are stronger; Phi-4-multimodal's strength is doing all three modalities in one small model.


System Prompts & Sampling {#prompting}

ChatML format:

<|im_start|>system
You are a precise reasoning assistant.
<|im_end|>
<|im_start|>user
[user message]
<|im_end|>
<|im_start|>assistant

Recommended sampling:

  • Reasoning / math / code: temperature 0.0-0.3, min-p 0.05, no top-k
  • Chat / general: temperature 0.5-0.7, min-p 0.05
  • Creative writing: temperature 0.8-1.0, min-p 0.05

See LLM Sampling Parameters for full sampler reference.


Reasoning Tasks: Math, Code, Logic {#reasoning}

For best reasoning output, use chain-of-thought prompting:

Solve this step by step. Show your work.

Problem: ...

Solution:

Or system-level CoT:

<|im_start|>system
For every problem, work through the solution step by step before providing the final answer.
<|im_end|>

Phi-4 was trained to produce CoT-style reasoning natively; it usually does this without explicit prompting.

For competition-grade math (AIME, Olympiad), use DeepSeek-R1-Distill variants — they outperform Phi-4 on hard problems with test-time compute.


Fine-Tuning with QLoRA {#fine-tuning}

Using Unsloth (fastest):

from unsloth import FastLanguageModel
from datasets import load_dataset

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/phi-4-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=32, lora_alpha=32, lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
)

# Train on your dataset

On RTX 4090: ~2-4 hours for 1K-example QLoRA fine-tune. MIT license means fine-tuned derivatives can be redistributed without restriction.

See LoRA Fine-Tuning Local Guide.


Function Calling and Structured Output {#tools}

Phi-4 supports OpenAI-style tool calling. Pass tools to vLLM / Ollama with --enable-auto-tool-choice:

{
  "model": "phi-4",
  "messages": [...],
  "tools": [{"type": "function", "function": {...}}],
  "tool_choice": "auto"
}

For JSON mode constrained output, use response_format:

{"response_format": {"type": "json_schema", "json_schema": {...}}}

vLLM uses xgrammar for guaranteed schema-valid output.


Tuning Recipes by GPU {#tuning}

RTX 3060 / 4060 (12 GB)

ollama run phi4:14b-q4_K_M
# Or via Modelfile: Q4_K_M, num_ctx 8192, temperature 0.4

RTX 4070 / 4080 (16 GB)

Q5_K_M, num_ctx 16384.

RTX 4090 / 5090 (24-32 GB)

Q8_0 or FP16, full 16K context, batch multiple users via vLLM.

Mac M2 / M3 / M4

Q4_K_M or Q5_K_M via Ollama Metal — works on 16 GB Mac comfortably.


Real Benchmarks {#benchmarks}

Single-user, RTX 4090, Q5_K_M:

TestPhi-4 14BLlama 3.1 8BQwen 2.5 14B
GSM8K95.6%84.5%90.2%
MATH80.4%51.9%80.0%
HumanEval82.6%72.6%83.5%
MMLU84.8%73.0%79.7%
Inference tok/s (Q5)7012752

Phi-4 is slower per-token than Llama 8B but produces correct answers more often — net throughput on hard tasks is better.


Licensing {#licensing}

MIT license — most permissive among recent major model releases. You can:

  • Use commercially without restriction
  • Modify and redistribute
  • Bundle into proprietary products
  • Sell as paid service
  • Train derivative models without restriction

Compare to Llama 3.1 (Meta Llama Community License with monthly active user thresholds), Qwen 2.5 (Tongyi Qianwen License with EU restrictions). For strictest commercial cleanliness, Phi-4 is the safest choice.


Troubleshooting {#troubleshooting}

SymptomCauseFix
Wrong chat formatMissing ChatML tokensUse Modelfile / config that loads phi-4 template
Repetitive outputNo min-pSet min-p 0.05
Verbose / over-explainsNo system promptAdd "be concise" to system
OOMContext too long16K max for Phi-4; lower if needed
Slow on MacM1 baseUse Phi-4-mini instead
Worse than Llama on chatPhi-4 is reasoning-tunedUse Llama for chat, Phi-4 for math/code

FAQ {#faq}

See answers to common Phi-4 questions below.


Sources: Phi-4 on Hugging Face | Phi-4 technical report | Microsoft Research | bartowski quants | Internal benchmarks RTX 4090, M4 Max.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes Phi-4 + Open WebUI reasoning assistant deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators