DeepSeek · Open-Weight · MIT Licensed

DeepSeek V4: Run a Frontier-Class Model on Your Own Hardware

DeepSeek V4 shipped April 24, 2026 in two sizes — V4-Pro (1.6T total / 49B active MoE) and V4-Flash (284B / 13B active). Both are MIT licensed, both have 1-million-token context windows, and both can be downloaded from Hugging Face and run on your own hardware. This is the closest open-weight equivalent to closed frontier models like Gemini 3.1 Pro. Below: hardware requirements, Ollama/vLLM/llama.cpp setup, and how V4 actually compares to Claude Sonnet 5 and GPT-5.5.

📅 Published: May 9, 2026🔄 Last Updated: May 9, 2026✓ Manually Reviewed

Why this matters: DeepSeek V4 is the only open-weight model with a 1M-token context window that ships under MIT license. No closed competitor offers self-hosting; no other open-weight model matches the spec. If you need frontier capabilities without sending data to OpenAI, Anthropic, or Google, this is the answer.

Key takeaways

→Two sizes: V4-Pro 1.6T/49B active for serious infra; V4-Flash 284B/13B active for prosumer hardware.
→1M context — only open-weight model that matches Gemini 3.1 Pro on context length.
→MIT licensed — unlimited commercial use, no royalties, no usage caps.
→82.6% SWE-Bench Verified (V4-Pro) — within 10 points of Claude Sonnet 5.
→Runs on Ollama, llama.cpp, vLLM — V4-Flash works on 4× RTX 5090 at Q4.

Quick verdict

DeepSeek V4-Flash is the new default for serious self-hosters. If you have a multi-GPU rig — even a consumer 4× RTX 5090 build — V4-Flash gives you 1M context, MIT licensing, and within ~10% of frontier closed-model quality at zero per-token cost.

V4-Pro is for research labs and infrastructure providers. The 1.6T MoE needs 8× H100 minimum; most teams get more value from running V4-Flash and reaching for Claude Sonnet 5 / GPT-5.5 via API for the hardest 5-10% of problems.

Specs at a glance

Property	V4-Pro	V4-Flash
Total parameters	1,650 billion (MoE)	284 billion (MoE)
Active parameters	49B per token	13B per token
Experts per layer	256 (top-8 routed)	64 (top-8 routed)
Context window	1,000,000 tokens	1,000,000 tokens
License	MIT	MIT
Storage (BF16)	~3.2 TB	~570 GB
Storage (Q4_K_M)	~800 GB	~150 GB
Hugging Face	`deepseek-ai/DeepSeek-V4-Pro`	`deepseek-ai/DeepSeek-V4-Flash`

V4-Pro vs V4-Flash: pick by hardware

DeepSeek V4-Flash

✓ Fits on 4× RTX 5090 at Q4 (~$10K consumer rig)
✓ Or 2× H100 / 1× B200 (~$25K prosumer)
✓ 78.4% SWE-Bench Verified — strong coding
✓ 1M context with stable long-context recall
✓ Recommended starting point for self-hosters

DeepSeek V4-Pro

✓ Needs 8× H100 / 4× B200 minimum (~$200K+)
✓ 82.6% SWE-Bench Verified — closer to Sonnet 5
✓ Better long-context recall above 500K tokens
✓ Best for AI providers / research labs
✓ Most users don't need this — start with Flash

Hardware requirements

Configuration	Hardware	Quantization	Tokens/sec (estimated)
V4-Flash budget	4× RTX 5090 (32GB each, 128GB total)	Q4_K_M (~150 GB)	25-40 tok/s
V4-Flash sweet spot	2× H100 80GB (160GB total)	Q5_K_M or Q6	60-90 tok/s
V4-Flash production	1× B200 180GB	BF16 (~570 GB) or FP8	100-180 tok/s
V4-Pro minimum	8× H100 80GB	Q4_K_M (~800 GB)	40-65 tok/s
V4-Pro production	4× B200 180GB	FP8 / BF16	90-150 tok/s

Tokens/sec assumes single-user inference at 8K context. With KV-cache offloading, throughput drops 30-50% once you exceed VRAM. For multi-user serving, vLLM's continuous batching boosts aggregate throughput 5-15× depending on workload.

Local setup

Option 1 · Ollama (easiest)

Best for: getting V4-Flash running in 10 minutes. Uses GGUF backend (llama.cpp under the hood).

# Install Ollama (skip if you have it)
curl -fsSL https://ollama.com/install.sh | sh

# Pull the Q4 quantization (~150 GB)
ollama pull deepseek-v4-flash:13b-q4

# Run it
ollama run deepseek-v4-flash

# Or expose an API endpoint
ollama serve
# curl http://localhost:11434/api/generate -d '{"model":"deepseek-v4-flash","prompt":"hello"}'

Option 2 · llama.cpp (max performance, manual)

Best for: when you need exact quantization control or are on Apple Silicon / non-CUDA hardware.

# Build llama.cpp with CUDA support
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
make GGML_CUDA=1 -j

# Download GGUF (precomputed by community on HF)
huggingface-cli download bartowski/DeepSeek-V4-Flash-GGUF \
  DeepSeek-V4-Flash-Q4_K_M.gguf \
  --local-dir models

# Serve OpenAI-compatible API on :8080
./llama-server \
  -m models/DeepSeek-V4-Flash-Q4_K_M.gguf \
  -ngl 99 \        # offload all layers to GPU
  -c 32768 \       # context (set lower if VRAM-limited)
  --host 0.0.0.0 --port 8080

Option 3 · vLLM (production / multi-user)

Best for: serving many users. Continuous batching gives 5-15× aggregate throughput vs Ollama/llama.cpp.

pip install vllm

# Serve V4-Flash on 2× H100 with tensor parallelism
python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 2 \
  --max-model-len 1000000 \
  --gpu-memory-utilization 0.92 \
  --port 8000

# OpenAI-compatible: use any OpenAI client with base_url=http://...:8000/v1

For full production deployment patterns including multi-GPU sharding, KV-cache management, and high-throughput serving, see our Local AI Deployment course — full GitHub repo included.

Benchmarks

Benchmark	V4-Pro	V4-Flash	Claude Sonnet 5	GPT-5.5	DeepSeek V3.1
SWE-Bench Verified	82.6%	78.4%	92.4%	85.1%	68.4%
MMLU-Pro	86.3%	83.8%	87.9%	90.1%	81.4%
GPQA Diamond	81.4%	76.9%	85.7%	86.0%	71.2%
AIME 2025	88.7%	82.4%	91.5%	95.2%	79.6%
ARC-AGI-2	59.8%	52.4%	68.4%	71.3%	48.7%
Aider polyglot (coding)	79.3%	74.1%	87.1%	81.4%	66.8%

Sources: DeepSeek V4 technical report (Apr 2026), SWE-Bench Verified leaderboard, Artificial Analysis, Aider public benchmarks.

When to pick DeepSeek V4

✓You need frontier-class capability without sending data to a third party.
✓You want predictable monthly costs (one-time hardware vs per-token API).
✓You need 1M-token context and don't want to pay Gemini 3.1 Pro's API rate.
✓You're building a product where the model is core IP and MIT licensing matters.

When to use a closed model instead

→You need the absolute best coding quality → Claude Sonnet 5.
→Hard math / reasoning where every percent counts → GPT-5.5.
→You don't have multi-GPU hardware and don't want to manage infra → use any closed-API model.

Frequently asked questions

What is DeepSeek V4?

DeepSeek V4 is the April 24, 2026 successor to DeepSeek V3.1, shipped in two sizes: V4-Pro (1.6 trillion total parameters with 49 billion active per forward pass) and V4-Flash (284 billion total / 13 billion active). Both are Mixture-of-Experts models with 1-million-token context windows, MIT licensed for unrestricted commercial use, and downloadable from Hugging Face. They're the closest open-weight alternative to closed frontier models like Gemini 3.1 Pro and GPT-5.5.

How much VRAM do I need to run DeepSeek V4 locally?

DeepSeek V4-Pro requires ~3.2 TB of storage for full BF16 weights and 8× H100 (80 GB each) or 4× B200 (180 GB) for production inference. Q4 quantization brings it to ~800 GB and 4× H100 / 2× B200. V4-Flash is much more accessible: ~570 GB BF16, ~150 GB Q4, runs on 2× H100 or 4× RTX 5090 (32 GB each) at Q4. For consumer hardware, V4-Flash quantized to Q3 or with CPU offloading can run on a single high-end workstation with 256 GB RAM and one RTX 5090, though tokens-per-second will be modest.

How do I install DeepSeek V4 with Ollama?

Ollama supports V4-Flash out of the box (V4-Pro needs custom GGUF builds). Install Ollama, pull the model, and run: `ollama pull deepseek-v4-flash:13b-q4` then `ollama run deepseek-v4-flash`. The default Q4 quant fits in ~150 GB and runs on multi-GPU consumer rigs. For V4-Pro, build GGUF weights from the HF release using `convert-hf-to-gguf.py` from llama.cpp and serve with `llama-server -m deepseek-v4-pro.Q4_K_M.gguf -ngl 99`. Full setup walkthrough is in our local-AI deployment course.

How does DeepSeek V4 compare to Claude Sonnet 5 and GPT-5.5?

On most benchmarks DeepSeek V4-Pro lands within 5-10% of the closed frontier leaders. Examples: SWE-Bench Verified — V4-Pro 82.6%, Sonnet 5 92.4%, GPT-5.5 85.1%. ARC-AGI-2 — V4-Pro 59.8%, Sonnet 5 68.4%, GPT-5.5 71.3%. MMLU-Pro — V4-Pro 86.3%, Sonnet 5 87.9%, GPT-5.5 90.1%. The gap is real but small; for many production workloads it's well worth the privacy, cost, and offline-operation benefits of self-hosting. V4-Pro also matches Gemini 3.1 Pro on context length (1M tokens) — no other open-weight model has that.

Why MIT license matters for DeepSeek V4

MIT is the most permissive open-source license — you can use DeepSeek V4 commercially with no restrictions, no royalties, no usage caps, no acceptable-use policies, and no required disclosure of derivatives. Compare this to Llama 4 (modified license with usage thresholds), Gemma 3 (Google's "Gemma Terms"), or proprietary models (you can't self-host at all). For startups and enterprises, MIT means you can fine-tune, distill, embed in products, or rebrand without legal review. This is why DeepSeek consistently leads downloads on Hugging Face within 48 hours of release.

V4-Pro vs V4-Flash: which should I download?

V4-Flash (284B/13B active) is the right choice for ~95% of self-hosters. It runs on 2× H100 or 4× RTX 5090 at Q4 quantization, scores 78.4% on SWE-Bench Verified (within 4-5 points of V4-Pro), and has the same 1M context. V4-Pro is for serious infrastructure: research labs, AI providers, or teams with 8× H100 / 4× B200 minimum. V4-Pro adds 2-4% on most benchmarks and significantly better long-context recall above 500K tokens. For most users: start with V4-Flash, upgrade to V4-Pro only if you have the hardware and you're hitting V4-Flash's ceiling on real workloads.

Can DeepSeek V4 replace ChatGPT or Claude in my workflow?

For coding, research, and content tasks, yes — most users running V4-Flash report it handles 70-85% of what they'd previously use ChatGPT or Claude for. The 15-30% gap is the hardest problems: novel algorithm design, multi-file refactors with subtle dependencies, and ambiguous specifications where the model needs to reason carefully. For those cases, most teams keep a Claude Sonnet 5 or GPT-5.5 API account for the hard 20% and use V4 locally for the routine 80%. Typical cost savings: 60-85% versus pure API usage, plus full data privacy.

How does DeepSeek V4 do MoE routing?

DeepSeek V4-Pro has 256 experts per layer; V4-Flash has 64. Each token activates the top-8 experts based on a learned router network, contributing to the 49B (Pro) or 13B (Flash) active-parameter count. Compared to V3, V4 introduces "shared experts" — a small set of always-active experts that handle common token patterns, reducing the cold-start cost of routing decisions. Load balancing uses an auxiliary loss similar to V3's, plus a new "expert lock" mechanism that prevents collapse during long-context inference. Net result: ~30% better throughput on multi-turn agentic workloads vs V3.1.

Run DeepSeek V4 in production

Local AI Master's deployment course covers full V4 production setup — multi-GPU sharding, KV-cache management, vLLM tuning, and OpenAI-compatible serving. Real production code, full GitHub repo.

See the deployment course →

Related models

→ DeepSeek V3 vs V3.1 — predecessors, still solid for coding
→ GLM-5 — 745B/44B MoE, MIT license, smaller hardware footprint
→ Qwen3-Coder-Next — best self-hostable coding model
→ Kimi K2.6 — open-weight 1T MoE, Moonshot
→ Gemini 3.1 Pro — closest closed alternative on context length
→ Best AI models May 2026: complete comparison