★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds

DeepSeek · Open-Weight · MIT Licensed

DeepSeek V4: Run a Frontier-Class Model on Your Own Hardware

DeepSeek V4 shipped April 24, 2026 in two sizes — V4-Pro (1.6T total / 49B active MoE) and V4-Flash (284B / 13B active). Both are MIT licensed, both have 1-million-token context windows, and both can be downloaded from Hugging Face and run on your own hardware. This is the closest open-weight equivalent to closed frontier models like Gemini 3.1 Pro. It's also the rare frontier model you can rentcheaply — DeepSeek's own API runs V4-Flash at $0.14/$0.28 and V4-Pro at $0.435/$0.87 per million tokens (roughly 28-34× cheaper output than GPT-5.5 or Claude Opus 4.7). Below: hardware requirements, Ollama/vLLM/llama.cpp setup, the cloud-API pricing, and how V4 actually compares to Claude Sonnet 4.6 and GPT-5.5.

📅 Published: May 9, 2026🔄 Last Updated: May 9, 2026✓ Manually Reviewed

Why this matters: DeepSeek V4 is the only open-weight model with a 1M-token context window that ships under MIT license. No closed competitor offers self-hosting; no other open-weight model matches the spec. If you need frontier capabilities without sending data to OpenAI, Anthropic, or Google, this is the answer.

Key takeaways

  • Two sizes: V4-Pro 1.6T/49B active for serious infra; V4-Flash 284B/13B active for prosumer hardware.
  • 1M context — only open-weight model that matches Gemini 3.1 Pro on context length.
  • MIT licensed — unlimited commercial use, no royalties, no usage caps.
  • 82.6% SWE-Bench Verified (V4-Pro) — within a few points of GPT-5.5, and ahead of Claude Sonnet 4.6's 79.6%.
  • Runs on Ollama, llama.cpp, vLLM — V4-Flash works on 4× RTX 5090 at Q4.

Quick verdict

DeepSeek V4-Flash is the new default for serious self-hosters. If you have a multi-GPU rig — even a consumer 4× RTX 5090 build — V4-Flash gives you 1M context, MIT licensing, and within ~10% of frontier closed-model quality at zero per-token cost.

V4-Pro is for research labs and infrastructure providers. The 1.6T MoE needs 8× H100 minimum; most teams get more value from running V4-Flash and reaching for Claude Sonnet 4.6 / GPT-5.5 via API for the hardest 5-10% of problems.

Specs at a glance

PropertyV4-ProV4-Flash
Total parameters1,650 billion (MoE)284 billion (MoE)
Active parameters49B per token13B per token
Experts per layer256 (top-8 routed)64 (top-8 routed)
Context window1,000,000 tokens1,000,000 tokens
LicenseMITMIT
Storage (BF16)~3.2 TB~570 GB
Storage (Q4_K_M)~800 GB~150 GB
Hugging Facedeepseek-ai/DeepSeek-V4-Prodeepseek-ai/DeepSeek-V4-Flash

V4-Pro vs V4-Flash: pick by hardware

DeepSeek V4-Flash

  • ✓ Fits on 4× RTX 5090 at Q4 (~$10K consumer rig)
  • ✓ Or 2× H100 / 1× B200 (~$25K prosumer)
  • ✓ 78.4% SWE-Bench Verified — strong coding
  • ✓ 1M context with stable long-context recall
  • ✓ Recommended starting point for self-hosters

DeepSeek V4-Pro

  • ✓ Needs 8× H100 / 4× B200 minimum (~$200K+)
  • ✓ 82.6% SWE-Bench Verified — ahead of Sonnet 4.6, near GPT-5.5
  • ✓ Better long-context recall above 500K tokens
  • ✓ Best for AI providers / research labs
  • ✓ Most users don't need this — start with Flash

Hardware requirements

ConfigurationHardwareQuantizationTokens/sec (estimated)
V4-Flash budget4× RTX 5090 (32GB each, 128GB total)Q4_K_M (~150 GB)25-40 tok/s
V4-Flash sweet spot2× H100 80GB (160GB total)Q5_K_M or Q660-90 tok/s
V4-Flash production1× B200 180GBBF16 (~570 GB) or FP8100-180 tok/s
V4-Pro minimum8× H100 80GBQ4_K_M (~800 GB)40-65 tok/s
V4-Pro production4× B200 180GBFP8 / BF1690-150 tok/s

Tokens/sec assumes single-user inference at 8K context. With KV-cache offloading, throughput drops 30-50% once you exceed VRAM. For multi-user serving, vLLM's continuous batching boosts aggregate throughput 5-15× depending on workload.

Local setup

Option 1 · Ollama (easiest)

Best for: getting V4-Flash running in 10 minutes. Uses GGUF backend (llama.cpp under the hood).

# Install Ollama (skip if you have it)
curl -fsSL https://ollama.com/install.sh | sh

# Pull the Q4 quantization (~150 GB)
ollama pull deepseek-v4-flash:13b-q4

# Run it
ollama run deepseek-v4-flash

# Or expose an API endpoint
ollama serve
# curl http://localhost:11434/api/generate -d '{"model":"deepseek-v4-flash","prompt":"hello"}'

Option 2 · llama.cpp (max performance, manual)

Best for: when you need exact quantization control or are on Apple Silicon / non-CUDA hardware.

# Build llama.cpp with CUDA support
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
make GGML_CUDA=1 -j

# Download GGUF (precomputed by community on HF)
huggingface-cli download bartowski/DeepSeek-V4-Flash-GGUF \
  DeepSeek-V4-Flash-Q4_K_M.gguf \
  --local-dir models

# Serve OpenAI-compatible API on :8080
./llama-server \
  -m models/DeepSeek-V4-Flash-Q4_K_M.gguf \
  -ngl 99 \        # offload all layers to GPU
  -c 32768 \       # context (set lower if VRAM-limited)
  --host 0.0.0.0 --port 8080

Option 3 · vLLM (production / multi-user)

Best for: serving many users. Continuous batching gives 5-15× aggregate throughput vs Ollama/llama.cpp.

pip install vllm

# Serve V4-Flash on 2× H100 with tensor parallelism
python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 2 \
  --max-model-len 1000000 \
  --gpu-memory-utilization 0.92 \
  --port 8000

# OpenAI-compatible: use any OpenAI client with base_url=http://...:8000/v1

For full production deployment patterns including multi-GPU sharding, KV-cache management, and high-throughput serving, see our Local AI Deployment course — full GitHub repo included.

Benchmarks

BenchmarkV4-ProV4-FlashClaude Sonnet 4.6GPT-5.5DeepSeek V3.1
SWE-Bench Verified82.6%78.4%79.6%85.1%68.4%
MMLU-Pro86.3%83.8%87.9%90.1%81.4%
GPQA Diamond81.4%76.9%85.7%86.0%71.2%
AIME 202588.7%82.4%91.5%95.2%79.6%
ARC-AGI-259.8%52.4%68.4%71.3%48.7%
Aider polyglot (coding)79.3%74.1%87.1%81.4%66.8%

Sources: DeepSeek V4 technical report (Apr 2026), SWE-Bench Verified leaderboard, Artificial Analysis, Aider public benchmarks. · DeepSeek-V4-Pro on Hugging Face

DeepSeek V4 cloud API pricing

Everything above is about self-hosting — but V4 is the rare frontier model you can also rent from the vendor for almost nothing. As of June 2026, DeepSeek's official API prices V4 in two tiers, both with the full 1M context included at no surcharge:

DeepSeek API (per 1M tokens)Input (cache miss)Input (cache hit)Output
V4-Flash$0.14~$0.0028$0.28
V4-Pro$0.435~$0.0036$0.87

V4-Pro's $0.435 / $0.87 is the 75%-off launch price DeepSeek made permanent on May 22, 2026 (original list was $1.74 / $3.48). Context caching is automatic — cache hits drop input to roughly 1/50th of the cache-miss rate, so repeated-prefix workloads (RAG, long system prompts, agentic loops) get even cheaper. Third-party hosts vary: OpenRouter, for example, has listed V4-Flash around $0.09 / $0.18, but those are reseller rates, not DeepSeek's own.

One correction worth flagging: there was never a “DeepSeek R2.” V4 isthe 2026 release. DeepSeek folded reasoning into the main line — V4-Pro and V4-Flash both ship with a thinking/non-thinking mode — so if you're waiting on R2, you're waiting on a model that was never shipped.

Self-host vs cheap API: which is actually cheaper?

Because V4 is both downloadable and rentable for pennies, the usual “self-host to escape API bills” math is different here than it is for closed models. The API is so cheap that, for most individuals and small teams, renting V4 beats buying GPUs on pure cost — self-hosting wins on privacy, control, and offline operation, not necessarily on dollars.

Path to running V4Up-frontMarginal costBest when
DeepSeek API (V4-Flash)$0$0.14 / $0.28 per MtokLow/spiky volume, no privacy requirement, fastest start
DeepSeek API (V4-Pro)$0$0.435 / $0.87 per MtokHardest tasks without owning 8× H100
Self-host V4-Flash~$10K (4× RTX 5090) up to ~$25KPower + amortized hardwareData can't leave your network, steady high volume, offline
Self-host V4-Pro~$200K+ (8× H100 / 4× B200)Power + amortized hardwareAI providers / research labs serving V4-Pro at scale

Rough break-even: a $10K V4-Flash rig only beats the V4-Flash API once you're burning well over a billion output tokens before the hardware is obsolete — so for most people the API is the cheaperand easier option, and self-hosting earns its keep through privacy, latency control, and no third-party dependency rather than raw price. The same workload on GPT-5.5 ($30 output) or Claude Opus 4.7 ($25 output) costs roughly 28-34× more per output token than V4-Pro — which is exactly why V4 is the default “cheap frontier” API even for teams that never plan to self-host.

When to pick DeepSeek V4

  • You need frontier-class capability without sending data to a third party.
  • You want predictable monthly costs (one-time hardware vs per-token API).
  • You need 1M-token context and don't want to pay Gemini 3.1 Pro's API rate.
  • You're building a product where the model is core IP and MIT licensing matters.

When to use a closed model instead

  • You need the absolute best coding quality → GPT-5.5 (top SWE-Bench score).
  • You want frontier-class quality at the lowest price → Claude Sonnet 4.6 ($3/$15 per Mtok, 1M context — best value of the closed flagships).
  • Hard math / reasoning where every percent counts → GPT-5.5.
  • You don't have multi-GPU hardware and don't want to manage infra → use any closed-API model.

Frequently asked questions

What is DeepSeek V4?
DeepSeek V4 is the April 24, 2026 successor to DeepSeek V3.1, shipped in two sizes: V4-Pro (1.6 trillion total parameters with 49 billion active per forward pass) and V4-Flash (284 billion total / 13 billion active). Both are Mixture-of-Experts models with 1-million-token context windows, MIT licensed for unrestricted commercial use, and downloadable from Hugging Face. They're the closest open-weight alternative to closed frontier models like Gemini 3.1 Pro and GPT-5.5.
How much VRAM do I need to run DeepSeek V4 locally?
DeepSeek V4-Pro requires ~3.2 TB of storage for full BF16 weights and 8× H100 (80 GB each) or 4× B200 (180 GB) for production inference. Q4 quantization brings it to ~800 GB and 4× H100 / 2× B200. V4-Flash is much more accessible: ~570 GB BF16, ~150 GB Q4, runs on 2× H100 or 4× RTX 5090 (32 GB each) at Q4. For consumer hardware, V4-Flash quantized to Q3 or with CPU offloading can run on a single high-end workstation with 256 GB RAM and one RTX 5090, though tokens-per-second will be modest.
How do I install DeepSeek V4 with Ollama?
Ollama supports V4-Flash out of the box (V4-Pro needs custom GGUF builds). Install Ollama, pull the model, and run: `ollama pull deepseek-v4-flash:13b-q4` then `ollama run deepseek-v4-flash`. The default Q4 quant fits in ~150 GB and runs on multi-GPU consumer rigs. For V4-Pro, build GGUF weights from the HF release using `convert-hf-to-gguf.py` from llama.cpp and serve with `llama-server -m deepseek-v4-pro.Q4_K_M.gguf -ngl 99`. Full setup walkthrough is in our local-AI deployment course.
How does DeepSeek V4 compare to Claude Sonnet 4.6 and GPT-5.5?
On most benchmarks DeepSeek V4-Pro lands within a few points of the closed frontier leaders, and on raw coding it actually edges some of them. Examples: SWE-Bench Verified — V4-Pro 82.6%, Sonnet 4.6 79.6%, GPT-5.5 85.1%. ARC-AGI-2 — V4-Pro 59.8%, Sonnet 4.6 68.4%, GPT-5.5 71.3%. MMLU-Pro — V4-Pro 86.3%, Sonnet 4.6 87.9%, GPT-5.5 90.1%. Note Sonnet 4.6 isn't the top coder here — its real strength is value (near-flagship quality at roughly 1/8 the price of the top models, $3/$15, with a 1M context). The gap to the leaders is real but small; for many production workloads V4 is well worth the privacy, cost, and offline-operation benefits of self-hosting. V4-Pro also matches Gemini 3.1 Pro on context length (1M tokens) — no other open-weight model has that.
Why MIT license matters for DeepSeek V4
MIT is the most permissive open-source license — you can use DeepSeek V4 commercially with no restrictions, no royalties, no usage caps, no acceptable-use policies, and no required disclosure of derivatives. Compare this to Llama 4 (modified license with usage thresholds), Gemma 3 (Google's "Gemma Terms"), or proprietary models (you can't self-host at all). For startups and enterprises, MIT means you can fine-tune, distill, embed in products, or rebrand without legal review. This is why DeepSeek consistently leads downloads on Hugging Face within 48 hours of release.
V4-Pro vs V4-Flash: which should I download?
V4-Flash (284B/13B active) is the right choice for ~95% of self-hosters. It runs on 2× H100 or 4× RTX 5090 at Q4 quantization, scores 78.4% on SWE-Bench Verified (within 4-5 points of V4-Pro), and has the same 1M context. V4-Pro is for serious infrastructure: research labs, AI providers, or teams with 8× H100 / 4× B200 minimum. V4-Pro adds 2-4% on most benchmarks and significantly better long-context recall above 500K tokens. For most users: start with V4-Flash, upgrade to V4-Pro only if you have the hardware and you're hitting V4-Flash's ceiling on real workloads.
Can DeepSeek V4 replace ChatGPT or Claude in my workflow?
For coding, research, and content tasks, yes — most users running V4-Flash report it handles 70-85% of what they'd previously use ChatGPT or Claude for. The 15-30% gap is the hardest problems: novel algorithm design, multi-file refactors with subtle dependencies, and ambiguous specifications where the model needs to reason carefully. For those cases, most teams keep a Claude Sonnet 4.6 or GPT-5.5 API account for the hard 20% and use V4 locally for the routine 80%. Typical cost savings: 60-85% versus pure API usage, plus full data privacy.
How much does the DeepSeek V4 cloud API cost?
V4 is unusually cheap to rent. As of June 2026, DeepSeek's official API charges V4-Flash $0.14 input / $0.28 output per million tokens, and V4-Pro $0.435 input / $0.87 output per million tokens. (V4-Pro's rate is the 75%-off launch price DeepSeek made permanent on May 22, 2026 — original list was $1.74 / $3.48.) Context caching cuts cache-hit input to roughly $0.0028/M on Flash and $0.0036/M on Pro, and the full 1M context is included at no surcharge. For comparison, GPT-5.5 standard is $5 / $30 and Claude Opus 4.7 is $5 / $25 — V4-Pro's $0.87 output is roughly 28-34× cheaper than those closed frontier models.
Did DeepSeek R2 ever release?
No. DeepSeek R2 was never shipped — V4 is the actual 2026 release. After R1, DeepSeek folded reasoning into the main line rather than putting out a standalone "R2," so V4-Pro and V4-Flash (April 24, 2026) both include a thinking/non-thinking mode and there is no R2 in DeepSeek's API model list. Any "R2 specs" you see circulating are leaks and speculation, not an official product.
How does DeepSeek V4 do MoE routing?
DeepSeek V4-Pro has 256 experts per layer; V4-Flash has 64. Each token activates the top-8 experts based on a learned router network, contributing to the 49B (Pro) or 13B (Flash) active-parameter count. Compared to V3, V4 introduces "shared experts" — a small set of always-active experts that handle common token patterns, reducing the cold-start cost of routing decisions. Load balancing uses an auxiliary loss similar to V3's, plus a new "expert lock" mechanism that prevents collapse during long-context inference. Net result: ~30% better throughput on multi-turn agentic workloads vs V3.1.

Run DeepSeek V4 in production

Local AI Master's deployment course covers full V4 production setup — multi-GPU sharding, KV-cache management, vLLM tuning, and OpenAI-compatible serving. Real production code, full GitHub repo.

See the deployment course →

Related models

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once
LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
More on AI Models Directory
See the full AI Models Directory guide.
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Found your model? Now build something with it.

20 hands-on courses — RAG, agents, fine-tuning — all running locally. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators