DeepSeek · Open-Weight · MIT Licensed
DeepSeek V4: Run a Frontier-Class Model on Your Own Hardware
DeepSeek V4 shipped April 24, 2026 in two sizes — V4-Pro (1.6T total / 49B active MoE) and V4-Flash (284B / 13B active). Both are MIT licensed, both have 1-million-token context windows, and both can be downloaded from Hugging Face and run on your own hardware. This is the closest open-weight equivalent to closed frontier models like Gemini 3.1 Pro. Below: hardware requirements, Ollama/vLLM/llama.cpp setup, and how V4 actually compares to Claude Sonnet 5 and GPT-5.5.
Why this matters: DeepSeek V4 is the only open-weight model with a 1M-token context window that ships under MIT license. No closed competitor offers self-hosting; no other open-weight model matches the spec. If you need frontier capabilities without sending data to OpenAI, Anthropic, or Google, this is the answer.
Key takeaways
- →Two sizes: V4-Pro 1.6T/49B active for serious infra; V4-Flash 284B/13B active for prosumer hardware.
- →1M context — only open-weight model that matches Gemini 3.1 Pro on context length.
- →MIT licensed — unlimited commercial use, no royalties, no usage caps.
- →82.6% SWE-Bench Verified (V4-Pro) — within 10 points of Claude Sonnet 5.
- →Runs on Ollama, llama.cpp, vLLM — V4-Flash works on 4× RTX 5090 at Q4.
Quick verdict
DeepSeek V4-Flash is the new default for serious self-hosters. If you have a multi-GPU rig — even a consumer 4× RTX 5090 build — V4-Flash gives you 1M context, MIT licensing, and within ~10% of frontier closed-model quality at zero per-token cost.
V4-Pro is for research labs and infrastructure providers. The 1.6T MoE needs 8× H100 minimum; most teams get more value from running V4-Flash and reaching for Claude Sonnet 5 / GPT-5.5 via API for the hardest 5-10% of problems.
Specs at a glance
| Property | V4-Pro | V4-Flash |
|---|---|---|
| Total parameters | 1,650 billion (MoE) | 284 billion (MoE) |
| Active parameters | 49B per token | 13B per token |
| Experts per layer | 256 (top-8 routed) | 64 (top-8 routed) |
| Context window | 1,000,000 tokens | 1,000,000 tokens |
| License | MIT | MIT |
| Storage (BF16) | ~3.2 TB | ~570 GB |
| Storage (Q4_K_M) | ~800 GB | ~150 GB |
| Hugging Face | deepseek-ai/DeepSeek-V4-Pro | deepseek-ai/DeepSeek-V4-Flash |
V4-Pro vs V4-Flash: pick by hardware
DeepSeek V4-Flash
- ✓ Fits on 4× RTX 5090 at Q4 (~$10K consumer rig)
- ✓ Or 2× H100 / 1× B200 (~$25K prosumer)
- ✓ 78.4% SWE-Bench Verified — strong coding
- ✓ 1M context with stable long-context recall
- ✓ Recommended starting point for self-hosters
DeepSeek V4-Pro
- ✓ Needs 8× H100 / 4× B200 minimum (~$200K+)
- ✓ 82.6% SWE-Bench Verified — closer to Sonnet 5
- ✓ Better long-context recall above 500K tokens
- ✓ Best for AI providers / research labs
- ✓ Most users don't need this — start with Flash
Hardware requirements
| Configuration | Hardware | Quantization | Tokens/sec (estimated) |
|---|---|---|---|
| V4-Flash budget | 4× RTX 5090 (32GB each, 128GB total) | Q4_K_M (~150 GB) | 25-40 tok/s |
| V4-Flash sweet spot | 2× H100 80GB (160GB total) | Q5_K_M or Q6 | 60-90 tok/s |
| V4-Flash production | 1× B200 180GB | BF16 (~570 GB) or FP8 | 100-180 tok/s |
| V4-Pro minimum | 8× H100 80GB | Q4_K_M (~800 GB) | 40-65 tok/s |
| V4-Pro production | 4× B200 180GB | FP8 / BF16 | 90-150 tok/s |
Tokens/sec assumes single-user inference at 8K context. With KV-cache offloading, throughput drops 30-50% once you exceed VRAM. For multi-user serving, vLLM's continuous batching boosts aggregate throughput 5-15× depending on workload.
Local setup
Option 1 · Ollama (easiest)
Best for: getting V4-Flash running in 10 minutes. Uses GGUF backend (llama.cpp under the hood).
# Install Ollama (skip if you have it)
curl -fsSL https://ollama.com/install.sh | sh
# Pull the Q4 quantization (~150 GB)
ollama pull deepseek-v4-flash:13b-q4
# Run it
ollama run deepseek-v4-flash
# Or expose an API endpoint
ollama serve
# curl http://localhost:11434/api/generate -d '{"model":"deepseek-v4-flash","prompt":"hello"}'Option 2 · llama.cpp (max performance, manual)
Best for: when you need exact quantization control or are on Apple Silicon / non-CUDA hardware.
# Build llama.cpp with CUDA support
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
make GGML_CUDA=1 -j
# Download GGUF (precomputed by community on HF)
huggingface-cli download bartowski/DeepSeek-V4-Flash-GGUF \
DeepSeek-V4-Flash-Q4_K_M.gguf \
--local-dir models
# Serve OpenAI-compatible API on :8080
./llama-server \
-m models/DeepSeek-V4-Flash-Q4_K_M.gguf \
-ngl 99 \ # offload all layers to GPU
-c 32768 \ # context (set lower if VRAM-limited)
--host 0.0.0.0 --port 8080Option 3 · vLLM (production / multi-user)
Best for: serving many users. Continuous batching gives 5-15× aggregate throughput vs Ollama/llama.cpp.
pip install vllm
# Serve V4-Flash on 2× H100 with tensor parallelism
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-V4-Flash \
--tensor-parallel-size 2 \
--max-model-len 1000000 \
--gpu-memory-utilization 0.92 \
--port 8000
# OpenAI-compatible: use any OpenAI client with base_url=http://...:8000/v1For full production deployment patterns including multi-GPU sharding, KV-cache management, and high-throughput serving, see our Local AI Deployment course — full GitHub repo included.
Benchmarks
| Benchmark | V4-Pro | V4-Flash | Claude Sonnet 5 | GPT-5.5 | DeepSeek V3.1 |
|---|---|---|---|---|---|
| SWE-Bench Verified | 82.6% | 78.4% | 92.4% | 85.1% | 68.4% |
| MMLU-Pro | 86.3% | 83.8% | 87.9% | 90.1% | 81.4% |
| GPQA Diamond | 81.4% | 76.9% | 85.7% | 86.0% | 71.2% |
| AIME 2025 | 88.7% | 82.4% | 91.5% | 95.2% | 79.6% |
| ARC-AGI-2 | 59.8% | 52.4% | 68.4% | 71.3% | 48.7% |
| Aider polyglot (coding) | 79.3% | 74.1% | 87.1% | 81.4% | 66.8% |
Sources: DeepSeek V4 technical report (Apr 2026), SWE-Bench Verified leaderboard, Artificial Analysis, Aider public benchmarks.
When to pick DeepSeek V4
- ✓You need frontier-class capability without sending data to a third party.
- ✓You want predictable monthly costs (one-time hardware vs per-token API).
- ✓You need 1M-token context and don't want to pay Gemini 3.1 Pro's API rate.
- ✓You're building a product where the model is core IP and MIT licensing matters.
When to use a closed model instead
- →You need the absolute best coding quality → Claude Sonnet 5.
- →Hard math / reasoning where every percent counts → GPT-5.5.
- →You don't have multi-GPU hardware and don't want to manage infra → use any closed-API model.
Frequently asked questions
What is DeepSeek V4?
How much VRAM do I need to run DeepSeek V4 locally?
How do I install DeepSeek V4 with Ollama?
How does DeepSeek V4 compare to Claude Sonnet 5 and GPT-5.5?
Why MIT license matters for DeepSeek V4
V4-Pro vs V4-Flash: which should I download?
Can DeepSeek V4 replace ChatGPT or Claude in my workflow?
How does DeepSeek V4 do MoE routing?
Run DeepSeek V4 in production
Local AI Master's deployment course covers full V4 production setup — multi-GPU sharding, KV-cache management, vLLM tuning, and OpenAI-compatible serving. Real production code, full GitHub repo.
See the deployment course →Related models
- → DeepSeek V3 vs V3.1 — predecessors, still solid for coding
- → GLM-5 — 745B/44B MoE, MIT license, smaller hardware footprint
- → Qwen3-Coder-Next — best self-hostable coding model
- → Kimi K2.6 — open-weight 1T MoE, Moonshot
- → Gemini 3.1 Pro — closest closed alternative on context length
- → Best AI models May 2026: complete comparison