Q: What's the system prompt format for Phi-4?

Phi-4 uses ChatML format: ` system\n[system message] \n user\n[user message] \n assistant\n`. The instruct-tuned variant follows instructions well with clear, direct system prompts. Recommended sampling: temperature 0.4-0.7, min-p 0.05, no top-k. For reasoning-heavy tasks: temperature 0.0-0.3 to keep chains tight. Phi-4 occasionally over-explains — explicit "be concise" in system prompt helps.

Question 1

What is Phi-4 and why does Microsoft's 14B model matter?

Accepted Answer

Phi-4 is Microsoft's December 2024 release of their Phi family of efficient models. 14.7B parameters, trained on a heavily curated synthetic-and-academic dataset rather than raw web crawl. Result: it punches well above its weight on reasoning, math, and code benchmarks — outperforming Llama 3.1 8B by a wide margin and matching Llama 3.1 70B on math benchmarks like MATH and GSM8K. Released under the MIT license. For users with 12-16 GB VRAM who want strong reasoning capability without the VRAM cost of 70B-class models, Phi-4 is one of the best options in 2026.

Question 2

What hardware does Phi-4 need?

Accepted Answer

Phi-4 14B in BF16 needs ~30 GB VRAM. With Q4_K_M GGUF quantization: ~9 GB — fits comfortably in a 12 GB RTX 3060 / 4060. Q5_K_M: ~10 GB. Q8_0: ~16 GB. For Phi-4-mini (3.8B): ~2.5 GB Q4, runs on any modern GPU including phones via MLC-LLM. For Phi-4-multimodal (5.6B with vision/audio): ~4 GB Q4. Apple Silicon: M2+ comfortable for all variants via llama.cpp Metal.

Question 3

How does Phi-4 compare to Llama 3.1 8B and Qwen 2.5 14B?

Accepted Answer

On reasoning benchmarks (MATH, GSM8K, AIME): Phi-4 14B beats Llama 3.1 70B in many cases, definitively beats Qwen 2.5 14B and Llama 3.1 8B. On general chat / instruction following: Llama 3.1 8B and Qwen 2.5 14B are competitive or slightly better on conversational quality. On long-context (Phi-4 has 16K context vs Llama 3.1's 131K): Llama wins. On code: Qwen 2.5 Coder 14B specifically beats Phi-4. Pick Phi-4 when reasoning, math, or chain-of-thought is the workload; pick Llama or Qwen for general chat or long-context RAG.

Question 4

How do I run Phi-4 in Ollama / llama.cpp / vLLM?

Accepted Answer

Ollama: `ollama run phi4`. llama.cpp: download a GGUF from bartowski/phi-4-GGUF on HuggingFace; run with `-m phi-4-Q5_K_M.gguf -ngl 999`. vLLM: `vllm serve microsoft/phi-4` for BF16 or `vllm serve casperhansen/phi-4-awq --quantization awq` for INT4-AWQ. Phi-4 uses the same chat template as Phi-3.5 (`<|im_start|>...<|im_end|>` ChatML format). All major inference engines have day-1 support since Phi-4 release in December 2024.

Question 5

What is Phi-4-multimodal?

Accepted Answer

Phi-4-multimodal is Microsoft's 5.6B-parameter unified multimodal extension released February 2025. Single model accepts text, images, and audio — outputs text. Strong vision benchmark performance for the size, decent audio understanding (transcription + reasoning over audio in one model). For multimodal use cases on tight VRAM (8-12 GB), Phi-4-multimodal is the right choice. Larger multimodal options (Llama 3.2 90B Vision, Qwen 2-VL 72B) outperform on raw quality but need 60+ GB.

Question 6

Can I fine-tune Phi-4 locally?

Accepted Answer

Yes — Phi-4 supports QLoRA fine-tuning via Axolotl, Unsloth, or text-generation-webui's Training tab. On RTX 4090: fine-tune Phi-4 14B with QLoRA (rank 32) on 1K examples in ~2-4 hours. The MIT license is the most permissive of the major recent models — fine-tuned Phi-4 derivatives can be redistributed and commercially deployed without restrictions. For instruction-tuned outputs without fine-tuning, the base Phi-4-Instruct works well out of the box.

Question 7

How does Phi-4 reasoning compare to o1-style test-time-compute models?

Accepted Answer

Phi-4 is reasoning-tuned but uses standard single-pass generation, not test-time-compute (TTC) like o1 / DeepSeek-R1 / QwQ. On hard math (AIME 2024): Phi-4 ~7-12% accuracy vs DeepSeek-R1 ~80%+ and o1 ~90%+. For competition-level math reasoning, TTC models win decisively. For everyday math/reasoning at high speed (10x faster than TTC inference): Phi-4. For mixed workloads: pair Phi-4 for fast tasks + DeepSeek-R1-Distill (also based on Llama / Qwen architectures) for hard reasoning.

Question 8

What's the system prompt format for Phi-4?

Accepted Answer

Variant	Parameters	Context	VRAM (BF16 / Q4)	Use
Phi-4 (base 14B)	14.7B	16K	30 GB / 9 GB	Reasoning, math, code
Phi-4-mini (3.8B)	3.8B	128K	8 GB / 2.5 GB	Edge, mobile, real-time
Phi-4-multimodal (5.6B)	5.6B	128K	12 GB / 4 GB	Vision + audio + text
Phi-4-mini-instruct	3.8B	128K	8 GB / 2.5 GB	Mini for chat
Phi-3.5-MoE (legacy)	41B (16x6.6B)	128K	80 GB / 24 GB	MoE option

GPU VRAM	Phi-4 Quant	Throughput on RTX 4090
8 GB	Q3_K_M (low quality)	~85 tok/s
10 GB	Q4_K_S	~80 tok/s
12 GB	Q4_K_M / Q5_K_S	~75 tok/s
16 GB	Q5_K_M / Q6_K	~70 tok/s
24 GB	Q8_0 / FP16	~60 tok/s

Benchmark	Phi-4 14B	Llama 3.1 8B	Qwen 2.5 14B	Llama 3.1 70B	DeepSeek-R1
MMLU	84.8	73.0	79.7	86.0	90.8
GSM8K (math)	95.6	84.5	90.2	95.1	97.3
MATH	80.4	51.9	80.0	68.0	97.3
HumanEval (code)	82.6	72.6	83.5	80.5	96.2
MMLU-Pro	70.4	48.3	63.7	60.4	84.0
AIME 2024	12.0	6.7	14.0	23.3	79.8
Throughput (RTX 4090, Q5)	~70 tok/s	~127 tok/s	~52 tok/s	~8 tok/s (offload)	varies
Context length	16K	131K	131K	131K	64K

Test	Phi-4 14B	Llama 3.1 8B	Qwen 2.5 14B
GSM8K	95.6%	84.5%	90.2%
MATH	80.4%	51.9%	80.0%
HumanEval	82.6%	72.6%	83.5%
MMLU	84.8%	73.0%	79.7%
Inference tok/s (Q5)	70	127	52

Symptom	Cause	Fix
Wrong chat format	Missing ChatML tokens	Use Modelfile / config that loads phi-4 template
Repetitive output	No min-p	Set min-p 0.05
Verbose / over-explains	No system prompt	Add "be concise" to system
OOM	Context too long	16K max for Phi-4; lower if needed
Slow on Mac	M1 base	Use Phi-4-mini instead
Worse than Llama on chat	Phi-4 is reasoning-tuned	Use Llama for chat, Phi-4 for math/code

Phi-4 Local Setup Guide (2026): Microsoft's 14B Reasoning Model on 12GB GPUs

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What Phi-4 Is {#what-it-is}

The Phi-4 Family: Base, Mini, Multimodal {#family}

What Makes Phi-4 Different (Synthetic Data) {#synthetic-data}

Reading articles is good. Building is better.

Hardware Requirements & Quantization {#requirements}

Phi-4 vs Llama 3.1 8B vs Qwen 2.5 14B vs DeepSeek-R1 {#comparison}

Ollama Setup {#ollama}

llama.cpp Setup with GGUF {#llamacpp}

vLLM Setup {#vllm}

LM Studio / oobabooga / KoboldCpp {#other-runtimes}

Phi-4-mini for Edge Devices {#mini}

Phi-4-multimodal for Vision + Audio {#multimodal}

System Prompts & Sampling {#prompting}

Reasoning Tasks: Math, Code, Logic {#reasoning}

Fine-Tuning with QLoRA {#fine-tuning}

Function Calling and Structured Output {#tools}

Tuning Recipes by GPU {#tuning}

RTX 3060 / 4060 (12 GB)

RTX 4070 / 4080 (16 GB)

RTX 4090 / 5090 (24-32 GB)

Mac M2 / M3 / M4

Real Benchmarks {#benchmarks}

Licensing {#licensing}

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

Best Local AI Models 2025

Llama 4 Local Setup Guide

Qwen 3 Local Setup Guide

DeepSeek R1 Local Setup

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI