Mistral Small 3 Local Setup Guide (2026): 24B Apache-Licensed Workhorse
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Mistral Small 3 is Mistral AI's answer to "what does the right open-weights default look like for production?" 24B parameters, Apache 2.0 license, low-latency tuning for agentic workloads, native function calling, and competitive performance against Llama 3.3 70B at much lower VRAM cost. For commercial deployments where licensing cleanliness matters and performance must beat the 7B-13B class without paying 70B inference costs, Mistral Small 3 hits the sweet spot.
This guide covers everything: setup across Ollama / vLLM / llama.cpp, the Apache 2.0 license advantages, prompting and tool calling, fine-tuning workflows, multimodal Mistral Small 3.1, and detailed benchmarks against Llama 3.3 70B / Phi-4 / Qwen 2.5 14B.
Table of Contents
- What Mistral Small 3 Is
- Apache 2.0 License Implications
- Mistral Small 3 vs Llama 3.3 70B vs Phi-4
- Hardware Requirements
- Ollama Setup
- llama.cpp Setup
- vLLM Setup
- Chat Template and Sampling
- Function Calling / Tool Use
- Mistral Small 3.1 Multimodal
- Multilingual Support
- Fine-Tuning
- Speculative Decoding with Mistral Small
- Real Benchmarks
- Tuning Recipes
- Production Deployment Patterns
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What Mistral Small 3 Is {#what-it-is}
Mistral Small 3 (mistralai/Mistral-Small-Instruct-2501 on HuggingFace, January 2025 release) is a 24B-parameter decoder-only transformer trained for low-latency instruction following and tool use. 32K context. Apache 2.0 license.
Mistral's positioning: "small enough for production cost, smart enough for agentic workloads, free enough to deploy anywhere."
Apache 2.0 License Implications {#license}
Apache 2.0 is the most commercial-friendly open-source license. You can:
- Use commercially without attribution requirements (good practice still)
- Modify and redistribute
- Bundle into proprietary closed-source products
- Sell as a paid service / API
- Fine-tune and distribute derivatives without copyleft
Compare to Llama 3 family's Meta Llama Community License (MAU thresholds for very large companies, attribution requirements). Compare to Qwen's Tongyi license (commercial allowed but with regional restrictions). For pure commercial cleanliness, Apache 2.0 wins.
Mistral Small 3 vs Llama 3.3 70B vs Phi-4 {#comparison}
| Benchmark | Mistral Small 3 24B | Llama 3.3 70B | Phi-4 14B | Qwen 2.5 14B |
|---|---|---|---|---|
| MMLU | 81.0 | 86.0 | 84.8 | 79.7 |
| MMLU-Pro | 66.3 | 60.4 | 70.4 | 63.7 |
| GSM8K | 90.0 | 95.1 | 95.6 | 90.2 |
| HumanEval | 84.8 | 80.5 | 82.6 | 83.5 |
| Arena Hard | 87.3 | 86.0 | 75.4 | 71.3 |
| Tok/s on RTX 4090 (Q5) | 45 | 7 (offload) | 70 | 52 |
| Context length | 32K | 131K | 16K | 131K |
| License | Apache 2.0 | Llama Community | MIT | Tongyi |
For commercial chat / agent workloads at fast inference: Mistral Small 3 wins on the Apache 2.0 + speed + Arena Hard combo.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Hardware Requirements {#requirements}
| GPU | Quant | Tok/s |
|---|---|---|
| RTX 3060 12 GB | Q4_K_M | ~30 |
| RTX 4070 16 GB | Q4_K_M / Q5_K_S | ~35 |
| RTX 4080 / 5070 Ti 16 GB | Q5_K_M | ~40 |
| RTX 3090 / 4090 / 7900 XTX 24 GB | Q5_K_M / Q6_K | ~45 |
| RTX 5090 32 GB | Q8_0 | ~55 |
| Pro W7900 / A6000 48 GB | BF16 | ~30 |
| Mac M3 / M4 Pro 32 GB | Q5_K_M | ~25 |
| Mac M4 Max 64 GB | Q8_0 | ~30 |
For most consumer GPUs in 2026: Q5_K_M is the right default.
Ollama Setup {#ollama}
ollama run mistral-small
Modelfile customization:
FROM mistral-small
PARAMETER num_ctx 32768
PARAMETER temperature 0.7
PARAMETER min_p 0.05
PARAMETER repeat_penalty 1.05
SYSTEM """You are a helpful assistant."""
For tool calling, Ollama since v0.3+ exposes /api/chat with native tool support — see Ollama Tool Calling Guide.
llama.cpp Setup {#llamacpp}
huggingface-cli download bartowski/Mistral-Small-Instruct-2501-GGUF \
Mistral-Small-Instruct-2501-Q5_K_M.gguf \
--local-dir ./models
./llama-cli \
-m models/Mistral-Small-Instruct-2501-Q5_K_M.gguf \
-ngl 999 \
-c 32768 \
-fa \
--chat-template mistral \
--temp 0.7 \
--min-p 0.05
For server mode:
./llama-server -m mistral-small-Q5_K_M.gguf -ngl 999 -c 32768 --port 8080
vLLM Setup {#vllm}
# BF16 (48GB+ VRAM)
vllm serve mistralai/Mistral-Small-Instruct-2501 \
--max-model-len 32768
# AWQ-INT4 (12+ GB VRAM)
vllm serve casperhansen/mistral-small-2501-awq \
--quantization awq \
--max-model-len 32768 \
--enable-prefix-caching
For tool calling enablement: --enable-auto-tool-choice --tool-call-parser mistral.
Chat Template and Sampling {#chat-template}
Mistral chat template:
<s>[INST] {system + user} [/INST] {assistant}</s>
Or with explicit system:
<s>[SYSTEM] {system} [/SYSTEM] [INST] {user} [/INST]
Recommended sampling:
- General chat: temperature 0.7, min-p 0.05
- Code: temperature 0.2, min-p 0.05
- Reasoning: temperature 0.3, min-p 0.05
- Creative: temperature 1.0, min-p 0.05, dry_multiplier 0.8
Function Calling / Tool Use {#tool-calling}
OpenAI-compatible:
{
"model": "mistral-small",
"messages": [...],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}}}
}
}],
"tool_choice": "auto"
}
Mistral Small 3 was specifically tuned for low-latency tool calling — typical round-trip on RTX 4090 is <2 seconds for a 5-tool agent loop. For agentic workloads, see AI Agents Local Guide.
Mistral Small 3.1 Multimodal {#multimodal}
Mistral Small 3.1 (March 2025) adds vision:
# vLLM with vision input
{
"model": "mistralai/Mistral-Small-3.1-Instruct",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]
}]
}
Same Apache 2.0 license. For vision-heavy use cases, Qwen 3 VL is stronger; for occasional image input on a permissive license, Mistral Small 3.1 is the right choice.
Multilingual Support {#multilingual}
Native support for English, French, German, Spanish, Italian, Portuguese, Russian, Arabic, Chinese, Japanese. Compared to Llama 3.x: Mistral Small 3 has noticeably stronger French and Italian (Mistral is a French company). For multilingual production: solid choice.
Fine-Tuning {#fine-tuning}
QLoRA via Unsloth:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/Mistral-Small-Instruct-2501-bnb-4bit",
max_seq_length=4096,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model, r=32, lora_alpha=32,
target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
)
# Train on your dataset
On RTX 4090: ~4-8 hours for 1K-example QLoRA. Apache 2.0 license means derivative weights can be redistributed under any license you choose.
Speculative Decoding with Mistral Small {#speculative}
Pair Mistral Small 3 24B target with Mistral 7B v0.3 draft (same tokenizer family):
vllm serve mistralai/Mistral-Small-Instruct-2501 \
--speculative-model mistralai/Mistral-7B-Instruct-v0.3 \
--num-speculative-tokens 5
Expected speedup: ~1.7-2.0x at single-user batch size 1. See CUDA Optimization.
Real Benchmarks {#benchmarks}
RTX 4090 Q5_K_M, single user, 4K context:
| Test | Mistral Small 3 | Llama 3.3 70B (offload) | Phi-4 |
|---|---|---|---|
| Tok/s | 45 | 7 | 70 |
| Arena Hard | 87.3 | 86.0 | 75.4 |
| HumanEval | 84.8 | 80.5 | 82.6 |
| GSM8K | 90.0 | 95.1 | 95.6 |
| MMLU | 81.0 | 86.0 | 84.8 |
| Tool calling latency | 1.5s | 8s (offload) | 2.5s |
For chat / agent workloads at fast inference, Mistral Small 3 is hard to beat in mid-2026.
Tuning Recipes {#tuning}
RTX 4090 / 7900 XTX 24 GB
ollama:
model: mistral-small:24b-instruct-2501-q5_K_M
num_ctx: 32768
flash_attn: true
RTX 4080 16 GB
quantization: Q4_K_M
context: 16384
Apple M4 Max 64 GB
quantization: Q8_0
context: 32768
Production multi-user (vLLM)
vllm serve casperhansen/mistral-small-2501-awq \
--quantization awq \
--max-model-len 32768 \
--enable-prefix-caching --enable-chunked-prefill \
--max-num-seqs 64 \
--kv-cache-dtype fp8_e4m3
Production Deployment Patterns {#production}
For Mistral Small 3 in production:
- vLLM with AWQ on RTX 4090 / A100 — ~1500 tok/s aggregate at 16 concurrent users
- TensorRT-LLM if you need lowest single-stream latency for tool calling
- LiteLLM gateway in front for per-user keys / rate limits
- Langfuse for tracing
- Open WebUI for chat UI
See vLLM Complete Setup and Ollama Production Deployment.
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| Wrong chat format | Missing [INST] template | Use --chat-template mistral |
| Tool calls malformed | Tool parser missing | vLLM: --tool-call-parser mistral |
| Repetitive output | No min-p | Set min-p 0.05 |
| OOM | Q5 too tight | Drop to Q4_K_M |
| Multilingual quality variance | Lower-resource language | Use English; or fine-tune for target lang |
FAQ {#faq}
See answers to common Mistral Small 3 questions below.
Sources: Mistral Small Instruct on HF | Mistral AI announcement | bartowski quants | Internal benchmarks RTX 4090, M4 Max.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!