Alibaba Qwen · Open-Weight · Apache 2.0
Qwen3-Coder-Next: The Best Self-Hostable Coding Model
Qwen3-Coder-Next is the strongest coding-specialized open-weight model available right now. 80 billion total parameters, only 3 billion active per token (MoE), 256K-token context, 70.6% on SWE-Bench Verified, and Apache 2.0 licensed for unrestricted commercial use. Runs on a single H100 or 2× RTX 5090 at Q4 quantization. This is the realistic local alternative to Claude Sonnet 5 in Cursor — without sending your code to Anthropic.
Why this matters: 70.6% SWE-Bench Verified on a model you can run on a single GPU is unprecedented. Most production coding tasks fall below the threshold where the gap to Claude Sonnet 5 (92.4%) is visible — meaning for routine work, Qwen3-Coder-Next is now “good enough”. Privacy, cost, and offline-operation benefits do the rest.
Key takeaways
- →80B/3B active MoE — fast inference (60-120 tok/s on H100) at frontier quality.
- →256K context — fits most repositories, full PR diffs, and long API specs.
- →70.6% SWE-Bench Verified — best coding score for any open-weight model.
- →Apache 2.0 — unlimited commercial use, no royalties.
- →Drop-in for Cursor / Aider / Continue — OpenAI-compatible API via Ollama or vLLM.
Quick verdict
Qwen3-Coder-Next is the right starting point for any developer who wants serious local AI coding without sending code to a third party. It runs on a single H100 (or 2× consumer RTX 5090), handles 75-80% of what Claude Sonnet 5 does, and costs nothing per token after the hardware investment.
Where it loses: hardest reasoning, novel algorithms, multi-file refactors with deep dependencies. For those, you keep a Claude Sonnet 5 or GPT-5.5 API account on standby and route the hard 10-20% of tasks there. Net cost reduction vs pure API: 60-80%.
Specs at a glance
| Vendor | Alibaba Qwen team |
| Architecture | Mixture-of-Experts (top-8 routing) |
| Total parameters | 80 billion |
| Active parameters | 3 billion per token |
| Context window | 256,000 tokens |
| License | Apache 2.0 |
| SWE-Bench Verified | 70.6% |
| Storage (BF16) | ~160 GB |
| Storage (Q4_K_M) | ~52 GB |
| Hugging Face | Qwen/Qwen3-Coder-Next |
Hardware requirements
| Hardware | Quantization | Approx tokens/sec | Notes |
|---|---|---|---|
| 2× RTX 5090 (32GB each) | Q4_K_M | 45-65 tok/s | Sweet spot for consumer build (~$8K total) |
| 1× H100 80GB | Q5_K_M | 80-120 tok/s | Best single-GPU; cleanest deployment |
| 1× M3 Ultra (96GB unified) | Q5_K_M | 25-45 tok/s | Apple Silicon; quiet, no cooling needed |
| 2× RTX 4090 (24GB each) | Q4_K_M | 35-50 tok/s | Tight VRAM; reduce context to 128K |
| 1× B200 180GB | BF16 | 150-220 tok/s | Production multi-user serving |
| CPU-only (Threadripper/EPYC + 256GB RAM) | Q4_K_M | 5-12 tok/s | Possible but slow; not for interactive use |
Local setup
Ollama (10-minute install)
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull Qwen3-Coder-Next at Q4 (52 GB download)
ollama pull qwen3-coder-next
# Test it interactively
ollama run qwen3-coder-next "Write a Python function that detects palindromes"
# Or expose it as an OpenAI-compatible API
ollama serve # listens on :11434vLLM (production / multi-user)
pip install vllm
# Serve on a single H100, 256K context, OpenAI-compatible
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-Coder-Next \
--max-model-len 262144 \
--gpu-memory-utilization 0.92 \
--port 8000
# Test from any OpenAI client:
# openai.OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
# completion(model="Qwen/Qwen3-Coder-Next", messages=[...])llama.cpp (max control)
# Build with CUDA
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
make GGML_CUDA=1 -j
# Download community GGUF
huggingface-cli download bartowski/Qwen3-Coder-Next-GGUF \
Qwen3-Coder-Next-Q4_K_M.gguf --local-dir models
# Serve
./llama-server -m models/Qwen3-Coder-Next-Q4_K_M.gguf \
-ngl 99 -c 65536 --host 0.0.0.0 --port 8080Cursor / Continue / Aider integration
Once Ollama or vLLM is serving on :11434 or :8000, you can use Qwen3-Coder-Next inside any major AI coding tool. The trick: every tool supports an “OpenAI-compatible custom endpoint”.
Cursor
Settings → Models → “OpenAI API Key” section → enable “Override OpenAI Base URL”.
Base URL: http://localhost:11434/v1
API Key: dummy
Model name: qwen3-coder-nextContinue.dev
Edit ~/.continue/config.json:
{
"models": [{
"title": "Qwen3-Coder-Next",
"provider": "openai",
"model": "qwen3-coder-next",
"apiBase": "http://localhost:11434/v1",
"apiKey": "dummy"
}]
}Aider (CLI)
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=dummy
aider --model openai/qwen3-coder-next your-file.pyCoding benchmarks
| Benchmark | Qwen3-Coder-Next | DeepSeek V4-Flash | Claude Sonnet 5 | GPT-5.5 |
|---|---|---|---|---|
| SWE-Bench Verified | 70.6% | 78.4% | 92.4% | 85.1% |
| LiveCodeBench | 68.4% | 67.2% | 79.8% | 76.3% |
| Aider polyglot | 71.2% | 74.1% | 87.1% | 81.4% |
| HumanEval | 88.4% | 90.2% | 95.8% | 94.2% |
| MBPP | 82.3% | 84.7% | 93.6% | 91.8% |
Sources: Qwen3-Coder-Next model card on Hugging Face, SWE-Bench Verified leaderboard, Aider public results. Local benchmarks confirmed on H100 + vLLM agent harness.
When to pick Qwen3-Coder-Next
- ✓You write code professionally and want a self-hosted Cursor/Aider setup.
- ✓Your code or IP cannot leave your network (regulated industries, defense, IP-sensitive).
- ✓You have or can buy a single H100 / 2× RTX 5090 / M3 Ultra Mac Studio.
- ✓You want a hybrid setup: local for routine work, API for hardest 10-20%.
- ✓You're building a coding-AI product and need an Apache-2.0-licensed model in the loop.
When to use something else
- →You need absolute peak coding quality → Claude Sonnet 5.
- →Single high-end consumer GPU with no MoE complexity → Qwen3.6-27B (dense 27B, slightly lower coding score but simpler deployment).
- →You need 1M context for whole-monorepo work → DeepSeek V4 or Gemini 3.1 Pro API.
- →Mixed coding + general work, more hardware → DeepSeek V4-Flash.
Frequently asked questions
What is Qwen3-Coder-Next?
How much VRAM does Qwen3-Coder-Next need?
How do I install Qwen3-Coder-Next with Ollama?
Qwen3-Coder-Next vs Claude Sonnet 5: how much worse is it?
What is the 256K context window good for?
Can I use Qwen3-Coder-Next with Cursor / Continue.dev / Aider?
Why Apache 2.0 license matters
Qwen3-Coder-Next vs DeepSeek V4-Flash for coding
Build a complete local coding stack
The Local AI Master deployment course walks through running Qwen3-Coder-Next in production with vLLM, integrating it into Cursor and Aider, and hybrid routing for the hardest tasks. Real production code, full GitHub repo.
See the deployment course →Related models
- → Qwen3.6-27B — dense 27B that beats own MoE on agentic coding
- → DeepSeek V4 — open-weight frontier alternative, 1M context
- → Qwen3-Coder — predecessor; smaller, simpler, less capable
- → Claude Sonnet 5 — what you reach for on hardest 10-20% of tasks
- → Best local AI coding models — full comparison
- → Best AI models May 2026 — pillar comparison