Question 1

What is DeepSeek V3 and why does it matter?

Accepted Answer

DeepSeek V3 is the December 2024 release of DeepSeek AI's flagship Mixture-of-Experts language model. 671B total parameters, 37B activated per token, trained on 14.8T tokens. The breakthrough: it matches or beats GPT-4o and Claude 3.5 Sonnet on most benchmarks while training cost was only ~$5.5M (vs hundreds of millions for comparable closed models). The technical innovations: Multi-Head Latent Attention (MLA) for compressed KV cache, DeepSeekMoE with auxiliary-loss-free balancing, native FP8 training, multi-token prediction (MTP) auxiliary objective. Released under MIT license for code + DeepSeek Model License for weights — one of the most permissive frontier-model licenses available.

Question 2

What hardware do I actually need to run DeepSeek V3 locally?

Accepted Answer

Bare minimum for the full 671B: 8x H100 80GB (640 GB total) for FP8 inference, or 16x H100 for BF16. For 4-bit quantized GGUF (~340 GB on disk): 6-8x A100 80GB, 4x H200 141GB, or 2x H200 + significant CPU RAM offload. On a high-end workstation: M3 Ultra Mac Studio with 512 GB unified memory runs Q4_K_M at ~10-15 tok/s via llama.cpp Metal. CPU+RAM-only on a server with 768 GB DDR5: ~2-4 tok/s. For 99% of self-hosters, the practical path is one of: (a) cloud H100 cluster, (b) Mac Studio M3 Ultra 512GB, (c) DeepSeek-V3-Distill variants on standard GPUs, or (d) DeepSeek API instead of self-hosting.

Question 3

How does DeepSeek V3 compare to GPT-4o, Claude 3.5 Sonnet, and Llama 3.1 405B?

Accepted Answer

On the standard benchmark suite as of release: MMLU 88.5 (vs GPT-4o 88.7, Claude 3.5 Sonnet 88.3, Llama 3.1 405B 88.6). MMLU-Pro 75.9 (vs GPT-4o 73.3, Claude 75.1, Llama 73.3). HumanEval 82.6 (vs GPT-4o 90.2, Claude 92.0, Llama 89.0). LiveCodeBench 40.5 (vs GPT-4o 36.4, Claude 36.3, Llama 35.7). Math GSM8K 89.3, MATH-500 90.2 (similar to GPT-4o). DeepSeek V3 wins on most knowledge / reasoning benchmarks, ties on math, slightly trails on raw HumanEval pass-rate. Its real strength is cost-per-token: as a hosted API it's ~10-30x cheaper than GPT-4o.

Question 4

How do I run DeepSeek V3 in vLLM / SGLang / llama.cpp?

Accepted Answer

vLLM (recommended for production GPU clusters): `vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8 --pipeline-parallel-size 1 --trust-remote-code`. SGLang (optimal for DeepSeek with native MLA support): `python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 8 --enable-torch-compile`. llama.cpp for CPU/Mac/single-GPU offload: download GGUF Q4_K_M from `unsloth/DeepSeek-V3-GGUF`, run with `llama-cli -m model.gguf -ngl 999`. TensorRT-LLM has DeepSeek V3 plugin support as of TRT-LLM 0.16+. Chat template uses DeepSeek-specific tokens (\u003c|begin_of_sentence|>...) — most engines auto-detect via tokenizer config.

Question 5

What is Multi-Head Latent Attention (MLA)?

Accepted Answer

MLA is DeepSeek's KV cache compression technique introduced in V2 and refined in V3. Standard MHA stores full K and V tensors per token per layer — for 671B with 128K context, that's tens of GB of cache. MLA projects K and V into a low-rank latent space (~512 dimensions), storing the latent representation. At inference, K/V are reconstructed on-the-fly. Result: 5-10x KV cache reduction with negligible quality loss. This is what makes DeepSeek V3 practical to serve at long context — 128K context KV cache is ~12 GB instead of 60+ GB. SGLang and vLLM both have MLA-optimized kernels; llama.cpp added MLA support in early 2025.

Question 6

Can I fine-tune DeepSeek V3 locally?

Accepted Answer

Full fine-tuning is impractical for self-hosters — needs 64+ H100. QLoRA is theoretically possible but requires 200+ GB VRAM even with 4-bit base, so 4x H100 80GB minimum. Practical path: use DeepSeek-V3-Base or one of the distilled variants (DeepSeek-V3-Distill-Llama-70B) as your fine-tuning starting point. The distilled 70B models capture most of V3's reasoning behavior at standard 70B fine-tuning costs. For full V3 customization, use the DeepSeek API's prompt caching + long context windows instead of self-hosted fine-tuning. See [QLoRA Fine-Tuning Guide](/blog/qlora-fine-tuning-guide) for the distilled-model approach.

Question 7

What are DeepSeek V3 distilled models?

Accepted Answer

DeepSeek released several distilled variants that capture most of V3's capabilities at smaller sizes: DeepSeek-V3-Distill-Llama-70B, -Qwen-32B, -Qwen-14B, -Llama-8B, -Qwen-7B, -Qwen-1.5B. The distillation: train smaller base models (Llama 3.1 / Qwen 2.5) on DeepSeek V3's outputs over a curated prompt set. Result: the 70B distilled variant retains ~85-90% of V3's benchmark performance and runs on 2x RTX 3090. The 32B variant fits a single RTX 4090 with INT4. For self-hosted reasoning workloads where you can't serve the full 671B, the distilled models are the practical choice. Note: these are technically DeepSeek-R1 distillations released in 2025 — the same approach is used for V3 fine-tuned variants.

Question 8

How does DeepSeek V3 compare to DeepSeek R1?

Accepted Answer

V3 is the foundation model — fast, capable, optimized for general use and chat. R1 is the reasoning-tuned variant trained with reinforcement learning to produce long chains of thought before answering. R1 wins on competition math (AIME 79.8% vs V3 39.2%), code competitions (LiveCodeBench 65.9% vs 40.5%), and PhD-level science (GPQA 71.5% vs 59.1%). V3 wins on speed (no thinking tokens) and chat quality. For most production workloads where latency matters: V3. For hard reasoning where quality matters more than speed: R1. For most self-hosted enthusiasts on consumer GPUs: use the R1-Distill-Qwen-32B or R1-Distill-Llama-70B variants — they get most of R1's reasoning at standard model sizes. See [DeepSeek R1 Local Setup](/blog/deepseek-r1-local-setup-guide).

Setup	Quant	Throughput	Cost (used market)
8x H100 80GB (NVLink)	FP8 native	60-80 tok/s single	$200K+
16x H100 80GB	BF16	80-100 tok/s	$400K+
8x A100 80GB	INT8	30-50 tok/s	$80K+
4x H200 141GB	FP8	50-70 tok/s	$120K+
Mac Studio M3 Ultra 512GB	Q4_K_M GGUF	10-15 tok/s	$10K
Server 768GB DDR5 (CPU only)	Q4_K_M GGUF	2-4 tok/s	$8K
2x RTX 3090 + 256GB RAM (offload)	Q3_K_M GGUF	1-2 tok/s	$4K

Quant	Size	VRAM	Quality Loss	Best Engine
BF16	1342 GB	1500 GB	0%	vLLM, SGLang (16 GPUs)
FP8 native	671 GB	750 GB	<0.5%	SGLang, TRT-LLM, vLLM (8 H100)
INT8 W8A8	671 GB	750 GB	<1%	vLLM (8 A100)
INT4 AWQ	360 GB	400 GB	1-2%	SGLang, vLLM (8 A100/4 H100)
GGUF Q5_K_M	470 GB	500 GB	<1%	llama.cpp (Mac, CPU)
GGUF Q4_K_M	340 GB	380 GB	1-2%	llama.cpp (Mac, CPU)
GGUF Q3_K_M	280 GB	320 GB	3-5%	llama.cpp (extreme low VRAM)
GGUF Q2_K	230 GB	280 GB	5-10%	Last resort

Variant	Base	VRAM (Q4)	Quality vs V3
R1-Distill-Qwen-1.5B	Qwen 2.5	1.5 GB	~40% of V3/R1
R1-Distill-Qwen-7B	Qwen 2.5	5 GB	~55%
R1-Distill-Llama-8B	Llama 3.1	5 GB	~55%
R1-Distill-Qwen-14B	Qwen 2.5	9 GB	~70%
R1-Distill-Qwen-32B	Qwen 2.5	20 GB	~85%
R1-Distill-Llama-70B	Llama 3.1	40 GB	~92%

Workload	Throughput
Single-user (1 conversation)	65 tok/s
Batch 8 concurrent	480 tok/s aggregate
Batch 32 concurrent	1850 tok/s aggregate
TTFT (1K input prompt)	280 ms
TTFT (32K input prompt)	1.2 s

Workload	Throughput
Single-user, 4K context	14 tok/s
Single-user, 32K context	9 tok/s
TTFT (1K input prompt)	4.2 s
Power draw	~280W peak

Benchmark	DeepSeek V3	GPT-4o (May 2024)	Claude 3.5 Sonnet	Llama 3.1 405B
MMLU	88.5	88.7	88.3	88.6
MMLU-Pro	75.9	73.3	75.1	73.3
GPQA Diamond	59.1	49.9	65.0	51.1
MATH-500	90.2	76.6	78.3	73.8
GSM8K	89.3	92.0	92.3	89.0
HumanEval	82.6	90.2	92.0	89.0
LiveCodeBench	40.5	36.4	36.3	35.7
MMLU (Chinese)	89.1	81.4	85.4	80.0
Context length	128K	128K	200K	131K
Cost / M input tokens	$0.14	$2.50	$3.00	n/a

Aspect	V3	R1
Training	Standard SFT + RLHF	RL with verifiable rewards
Output	Direct response	Long thinking + response
AIME 2024	39.2%	79.8%
GPQA	59.1%	71.5%
LiveCodeBench	40.5%	65.9%
Speed	Fast (no thinking)	Slow (thinking tokens)
Best for	Chat, agents, general	Hard reasoning, math, code

Symptom	Cause	Fix
OOM with 8x H100 BF16	BF16 needs 16x H100	Use FP8 native via SGLang/vLLM
Slow MoE routing	Old vLLM/SGLang	Upgrade to vLLM 0.7+ / SGLang latest
Wrong chat format	Custom template	Use `--trust-remote-code` to load DeepSeek tokenizer
MLA not engaged	llama.cpp pre-Feb 2025	Build llama.cpp from latest main
TRT-LLM build fails	TRT < 0.16	Upgrade TensorRT-LLM to 0.16+
Mac M3 Ultra OOM	macOS RAM limits	`sudo sysctl iogpu.wired_limit_mb=458752` to allow 448 GB GPU use
Tool calls malformed	DeepSeek tool format	Use `--tool-call-parser deepseek_v3` in vLLM

DeepSeek V3 Local Setup Guide (2026): 671B MoE on Workstations and Multi-GPU Rigs

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What DeepSeek V3 Is {#what-it-is}

Architecture: MoE + MLA + MTP + FP8 {#architecture}

Mixture of Experts (DeepSeekMoE)

Multi-Head Latent Attention (MLA)

Multi-Token Prediction (MTP)

FP8 Native Training

Hardware Reality Check {#hardware}

Reading articles is good. Building is better.

DeepSeek V3 vs GPT-4o vs Claude 3.5 vs Llama 3.1 405B {#comparison}

SGLang Setup (Recommended) {#sglang}

vLLM Setup (Multi-GPU) {#vllm}

TensorRT-LLM Setup {#tensorrt}

llama.cpp + GGUF for Mac / CPU {#llamacpp}

Quantization Options (FP8, INT8, GGUF) {#quants}

Distilled Variants for Consumer GPUs {#distilled}

Mac Studio M3 Ultra Path {#mac-studio}

Fine-Tuning Strategy {#fine-tuning}

Option 1: Distilled Variant Fine-Tuning

Option 2: API + Prompt Caching

Option 3: Continued Pretraining (research labs)

System Prompts & Sampling {#prompting}

Real Benchmarks {#benchmarks}

DeepSeek V3 vs R1 {#v3-vs-r1}

Licensing {#licensing}

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

DeepSeek R1 Local Setup

DeepSeek Local Setup

SGLang Setup Guide

vLLM Complete Setup

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI