★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds

DeepSeek · Open-Weight · MIT Licensed

DeepSeek V4: Run a Frontier-Class Model on Your Own Hardware

DeepSeek V4 shipped April 24, 2026 in two sizes — V4-Pro (1.6T total / 49B active MoE) and V4-Flash (284B / 13B active). Both are MIT licensed, both have 1-million-token context windows, and both can be downloaded from Hugging Face and run on your own hardware. This is the closest open-weight equivalent to closed frontier models like Gemini 3.1 Pro. Below: hardware requirements, Ollama/vLLM/llama.cpp setup, and how V4 actually compares to Claude Sonnet 5 and GPT-5.5.

📅 Published: May 9, 2026🔄 Last Updated: May 9, 2026✓ Manually Reviewed

Why this matters: DeepSeek V4 is the only open-weight model with a 1M-token context window that ships under MIT license. No closed competitor offers self-hosting; no other open-weight model matches the spec. If you need frontier capabilities without sending data to OpenAI, Anthropic, or Google, this is the answer.

Key takeaways

  • Two sizes: V4-Pro 1.6T/49B active for serious infra; V4-Flash 284B/13B active for prosumer hardware.
  • 1M context — only open-weight model that matches Gemini 3.1 Pro on context length.
  • MIT licensed — unlimited commercial use, no royalties, no usage caps.
  • 82.6% SWE-Bench Verified (V4-Pro) — within 10 points of Claude Sonnet 5.
  • Runs on Ollama, llama.cpp, vLLM — V4-Flash works on 4× RTX 5090 at Q4.

Quick verdict

DeepSeek V4-Flash is the new default for serious self-hosters. If you have a multi-GPU rig — even a consumer 4× RTX 5090 build — V4-Flash gives you 1M context, MIT licensing, and within ~10% of frontier closed-model quality at zero per-token cost.

V4-Pro is for research labs and infrastructure providers. The 1.6T MoE needs 8× H100 minimum; most teams get more value from running V4-Flash and reaching for Claude Sonnet 5 / GPT-5.5 via API for the hardest 5-10% of problems.

Specs at a glance

PropertyV4-ProV4-Flash
Total parameters1,650 billion (MoE)284 billion (MoE)
Active parameters49B per token13B per token
Experts per layer256 (top-8 routed)64 (top-8 routed)
Context window1,000,000 tokens1,000,000 tokens
LicenseMITMIT
Storage (BF16)~3.2 TB~570 GB
Storage (Q4_K_M)~800 GB~150 GB
Hugging Facedeepseek-ai/DeepSeek-V4-Prodeepseek-ai/DeepSeek-V4-Flash

V4-Pro vs V4-Flash: pick by hardware

DeepSeek V4-Flash

  • ✓ Fits on 4× RTX 5090 at Q4 (~$10K consumer rig)
  • ✓ Or 2× H100 / 1× B200 (~$25K prosumer)
  • ✓ 78.4% SWE-Bench Verified — strong coding
  • ✓ 1M context with stable long-context recall
  • ✓ Recommended starting point for self-hosters

DeepSeek V4-Pro

  • ✓ Needs 8× H100 / 4× B200 minimum (~$200K+)
  • ✓ 82.6% SWE-Bench Verified — closer to Sonnet 5
  • ✓ Better long-context recall above 500K tokens
  • ✓ Best for AI providers / research labs
  • ✓ Most users don't need this — start with Flash

Hardware requirements

ConfigurationHardwareQuantizationTokens/sec (estimated)
V4-Flash budget4× RTX 5090 (32GB each, 128GB total)Q4_K_M (~150 GB)25-40 tok/s
V4-Flash sweet spot2× H100 80GB (160GB total)Q5_K_M or Q660-90 tok/s
V4-Flash production1× B200 180GBBF16 (~570 GB) or FP8100-180 tok/s
V4-Pro minimum8× H100 80GBQ4_K_M (~800 GB)40-65 tok/s
V4-Pro production4× B200 180GBFP8 / BF1690-150 tok/s

Tokens/sec assumes single-user inference at 8K context. With KV-cache offloading, throughput drops 30-50% once you exceed VRAM. For multi-user serving, vLLM's continuous batching boosts aggregate throughput 5-15× depending on workload.

Local setup

Option 1 · Ollama (easiest)

Best for: getting V4-Flash running in 10 minutes. Uses GGUF backend (llama.cpp under the hood).

# Install Ollama (skip if you have it)
curl -fsSL https://ollama.com/install.sh | sh

# Pull the Q4 quantization (~150 GB)
ollama pull deepseek-v4-flash:13b-q4

# Run it
ollama run deepseek-v4-flash

# Or expose an API endpoint
ollama serve
# curl http://localhost:11434/api/generate -d '{"model":"deepseek-v4-flash","prompt":"hello"}'

Option 2 · llama.cpp (max performance, manual)

Best for: when you need exact quantization control or are on Apple Silicon / non-CUDA hardware.

# Build llama.cpp with CUDA support
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
make GGML_CUDA=1 -j

# Download GGUF (precomputed by community on HF)
huggingface-cli download bartowski/DeepSeek-V4-Flash-GGUF \
  DeepSeek-V4-Flash-Q4_K_M.gguf \
  --local-dir models

# Serve OpenAI-compatible API on :8080
./llama-server \
  -m models/DeepSeek-V4-Flash-Q4_K_M.gguf \
  -ngl 99 \        # offload all layers to GPU
  -c 32768 \       # context (set lower if VRAM-limited)
  --host 0.0.0.0 --port 8080

Option 3 · vLLM (production / multi-user)

Best for: serving many users. Continuous batching gives 5-15× aggregate throughput vs Ollama/llama.cpp.

pip install vllm

# Serve V4-Flash on 2× H100 with tensor parallelism
python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 2 \
  --max-model-len 1000000 \
  --gpu-memory-utilization 0.92 \
  --port 8000

# OpenAI-compatible: use any OpenAI client with base_url=http://...:8000/v1

For full production deployment patterns including multi-GPU sharding, KV-cache management, and high-throughput serving, see our Local AI Deployment course — full GitHub repo included.

Benchmarks

BenchmarkV4-ProV4-FlashClaude Sonnet 5GPT-5.5DeepSeek V3.1
SWE-Bench Verified82.6%78.4%92.4%85.1%68.4%
MMLU-Pro86.3%83.8%87.9%90.1%81.4%
GPQA Diamond81.4%76.9%85.7%86.0%71.2%
AIME 202588.7%82.4%91.5%95.2%79.6%
ARC-AGI-259.8%52.4%68.4%71.3%48.7%
Aider polyglot (coding)79.3%74.1%87.1%81.4%66.8%

Sources: DeepSeek V4 technical report (Apr 2026), SWE-Bench Verified leaderboard, Artificial Analysis, Aider public benchmarks.

When to pick DeepSeek V4

  • You need frontier-class capability without sending data to a third party.
  • You want predictable monthly costs (one-time hardware vs per-token API).
  • You need 1M-token context and don't want to pay Gemini 3.1 Pro's API rate.
  • You're building a product where the model is core IP and MIT licensing matters.

When to use a closed model instead

  • You need the absolute best coding quality → Claude Sonnet 5.
  • Hard math / reasoning where every percent counts → GPT-5.5.
  • You don't have multi-GPU hardware and don't want to manage infra → use any closed-API model.

Frequently asked questions

What is DeepSeek V4?
DeepSeek V4 is the April 24, 2026 successor to DeepSeek V3.1, shipped in two sizes: V4-Pro (1.6 trillion total parameters with 49 billion active per forward pass) and V4-Flash (284 billion total / 13 billion active). Both are Mixture-of-Experts models with 1-million-token context windows, MIT licensed for unrestricted commercial use, and downloadable from Hugging Face. They're the closest open-weight alternative to closed frontier models like Gemini 3.1 Pro and GPT-5.5.
How much VRAM do I need to run DeepSeek V4 locally?
DeepSeek V4-Pro requires ~3.2 TB of storage for full BF16 weights and 8× H100 (80 GB each) or 4× B200 (180 GB) for production inference. Q4 quantization brings it to ~800 GB and 4× H100 / 2× B200. V4-Flash is much more accessible: ~570 GB BF16, ~150 GB Q4, runs on 2× H100 or 4× RTX 5090 (32 GB each) at Q4. For consumer hardware, V4-Flash quantized to Q3 or with CPU offloading can run on a single high-end workstation with 256 GB RAM and one RTX 5090, though tokens-per-second will be modest.
How do I install DeepSeek V4 with Ollama?
Ollama supports V4-Flash out of the box (V4-Pro needs custom GGUF builds). Install Ollama, pull the model, and run: `ollama pull deepseek-v4-flash:13b-q4` then `ollama run deepseek-v4-flash`. The default Q4 quant fits in ~150 GB and runs on multi-GPU consumer rigs. For V4-Pro, build GGUF weights from the HF release using `convert-hf-to-gguf.py` from llama.cpp and serve with `llama-server -m deepseek-v4-pro.Q4_K_M.gguf -ngl 99`. Full setup walkthrough is in our local-AI deployment course.
How does DeepSeek V4 compare to Claude Sonnet 5 and GPT-5.5?
On most benchmarks DeepSeek V4-Pro lands within 5-10% of the closed frontier leaders. Examples: SWE-Bench Verified — V4-Pro 82.6%, Sonnet 5 92.4%, GPT-5.5 85.1%. ARC-AGI-2 — V4-Pro 59.8%, Sonnet 5 68.4%, GPT-5.5 71.3%. MMLU-Pro — V4-Pro 86.3%, Sonnet 5 87.9%, GPT-5.5 90.1%. The gap is real but small; for many production workloads it's well worth the privacy, cost, and offline-operation benefits of self-hosting. V4-Pro also matches Gemini 3.1 Pro on context length (1M tokens) — no other open-weight model has that.
Why MIT license matters for DeepSeek V4
MIT is the most permissive open-source license — you can use DeepSeek V4 commercially with no restrictions, no royalties, no usage caps, no acceptable-use policies, and no required disclosure of derivatives. Compare this to Llama 4 (modified license with usage thresholds), Gemma 3 (Google's "Gemma Terms"), or proprietary models (you can't self-host at all). For startups and enterprises, MIT means you can fine-tune, distill, embed in products, or rebrand without legal review. This is why DeepSeek consistently leads downloads on Hugging Face within 48 hours of release.
V4-Pro vs V4-Flash: which should I download?
V4-Flash (284B/13B active) is the right choice for ~95% of self-hosters. It runs on 2× H100 or 4× RTX 5090 at Q4 quantization, scores 78.4% on SWE-Bench Verified (within 4-5 points of V4-Pro), and has the same 1M context. V4-Pro is for serious infrastructure: research labs, AI providers, or teams with 8× H100 / 4× B200 minimum. V4-Pro adds 2-4% on most benchmarks and significantly better long-context recall above 500K tokens. For most users: start with V4-Flash, upgrade to V4-Pro only if you have the hardware and you're hitting V4-Flash's ceiling on real workloads.
Can DeepSeek V4 replace ChatGPT or Claude in my workflow?
For coding, research, and content tasks, yes — most users running V4-Flash report it handles 70-85% of what they'd previously use ChatGPT or Claude for. The 15-30% gap is the hardest problems: novel algorithm design, multi-file refactors with subtle dependencies, and ambiguous specifications where the model needs to reason carefully. For those cases, most teams keep a Claude Sonnet 5 or GPT-5.5 API account for the hard 20% and use V4 locally for the routine 80%. Typical cost savings: 60-85% versus pure API usage, plus full data privacy.
How does DeepSeek V4 do MoE routing?
DeepSeek V4-Pro has 256 experts per layer; V4-Flash has 64. Each token activates the top-8 experts based on a learned router network, contributing to the 49B (Pro) or 13B (Flash) active-parameter count. Compared to V3, V4 introduces "shared experts" — a small set of always-active experts that handle common token patterns, reducing the cold-start cost of routing decisions. Load balancing uses an auxiliary loss similar to V3's, plus a new "expert lock" mechanism that prevents collapse during long-context inference. Net result: ~30% better throughput on multi-turn agentic workloads vs V3.1.

Run DeepSeek V4 in production

Local AI Master's deployment course covers full V4 production setup — multi-GPU sharding, KV-cache management, vLLM tuning, and OpenAI-compatible serving. Real production code, full GitHub repo.

See the deployment course →

Related models

📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators