Alibaba Qwen · Open-Weight · Apache 2.0

Qwen3-Coder-Next: The Best Self-Hostable Coding Model

Qwen3-Coder-Next is the strongest coding-specialized open-weight model available right now. 80 billion total parameters, only 3 billion active per token (MoE), 256K-token context, 70.6% on SWE-Bench Verified, and Apache 2.0 licensed for unrestricted commercial use. Runs on a single H100 or 2× RTX 5090 at Q4 quantization. This is the realistic local alternative to Claude Sonnet 5 in Cursor — without sending your code to Anthropic.

📅 Published: May 9, 2026🔄 Last Updated: May 9, 2026✓ Manually Reviewed

Why this matters: 70.6% SWE-Bench Verified on a model you can run on a single GPU is unprecedented. Most production coding tasks fall below the threshold where the gap to Claude Sonnet 5 (92.4%) is visible — meaning for routine work, Qwen3-Coder-Next is now “good enough”. Privacy, cost, and offline-operation benefits do the rest.

Key takeaways

→80B/3B active MoE — fast inference (60-120 tok/s on H100) at frontier quality.
→256K context — fits most repositories, full PR diffs, and long API specs.
→70.6% SWE-Bench Verified — best coding score for any open-weight model.
→Apache 2.0 — unlimited commercial use, no royalties.
→Drop-in for Cursor / Aider / Continue — OpenAI-compatible API via Ollama or vLLM.

Quick verdict

Qwen3-Coder-Next is the right starting point for any developer who wants serious local AI coding without sending code to a third party. It runs on a single H100 (or 2× consumer RTX 5090), handles 75-80% of what Claude Sonnet 5 does, and costs nothing per token after the hardware investment.

Where it loses: hardest reasoning, novel algorithms, multi-file refactors with deep dependencies. For those, you keep a Claude Sonnet 5 or GPT-5.5 API account on standby and route the hard 10-20% of tasks there. Net cost reduction vs pure API: 60-80%.

Specs at a glance

Vendor	Alibaba Qwen team
Architecture	Mixture-of-Experts (top-8 routing)
Total parameters	80 billion
Active parameters	3 billion per token
Context window	256,000 tokens
License	Apache 2.0
SWE-Bench Verified	70.6%
Storage (BF16)	~160 GB
Storage (Q4_K_M)	~52 GB
Hugging Face	`Qwen/Qwen3-Coder-Next`

Hardware requirements

Hardware	Quantization	Approx tokens/sec	Notes
2× RTX 5090 (32GB each)	Q4_K_M	45-65 tok/s	Sweet spot for consumer build (~$8K total)
1× H100 80GB	Q5_K_M	80-120 tok/s	Best single-GPU; cleanest deployment
1× M3 Ultra (96GB unified)	Q5_K_M	25-45 tok/s	Apple Silicon; quiet, no cooling needed
2× RTX 4090 (24GB each)	Q4_K_M	35-50 tok/s	Tight VRAM; reduce context to 128K
1× B200 180GB	BF16	150-220 tok/s	Production multi-user serving
CPU-only (Threadripper/EPYC + 256GB RAM)	Q4_K_M	5-12 tok/s	Possible but slow; not for interactive use

Local setup

Ollama (10-minute install)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Qwen3-Coder-Next at Q4 (52 GB download)
ollama pull qwen3-coder-next

# Test it interactively
ollama run qwen3-coder-next "Write a Python function that detects palindromes"

# Or expose it as an OpenAI-compatible API
ollama serve  # listens on :11434

vLLM (production / multi-user)

pip install vllm

# Serve on a single H100, 256K context, OpenAI-compatible
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-Coder-Next \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.92 \
  --port 8000

# Test from any OpenAI client:
# openai.OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
# completion(model="Qwen/Qwen3-Coder-Next", messages=[...])

llama.cpp (max control)

# Build with CUDA
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
make GGML_CUDA=1 -j

# Download community GGUF
huggingface-cli download bartowski/Qwen3-Coder-Next-GGUF \
  Qwen3-Coder-Next-Q4_K_M.gguf --local-dir models

# Serve
./llama-server -m models/Qwen3-Coder-Next-Q4_K_M.gguf \
  -ngl 99 -c 65536 --host 0.0.0.0 --port 8080

Cursor / Continue / Aider integration

Once Ollama or vLLM is serving on :11434 or :8000, you can use Qwen3-Coder-Next inside any major AI coding tool. The trick: every tool supports an “OpenAI-compatible custom endpoint”.

Cursor

Settings → Models → “OpenAI API Key” section → enable “Override OpenAI Base URL”.

Base URL: http://localhost:11434/v1
API Key: dummy
Model name: qwen3-coder-next

Continue.dev

Edit ~/.continue/config.json:

{
  "models": [{
    "title": "Qwen3-Coder-Next",
    "provider": "openai",
    "model": "qwen3-coder-next",
    "apiBase": "http://localhost:11434/v1",
    "apiKey": "dummy"
  }]
}

Aider (CLI)

export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=dummy
aider --model openai/qwen3-coder-next your-file.py

Coding benchmarks

Benchmark	Qwen3-Coder-Next	DeepSeek V4-Flash	Claude Sonnet 5	GPT-5.5
SWE-Bench Verified	70.6%	78.4%	92.4%	85.1%
LiveCodeBench	68.4%	67.2%	79.8%	76.3%
Aider polyglot	71.2%	74.1%	87.1%	81.4%
HumanEval	88.4%	90.2%	95.8%	94.2%
MBPP	82.3%	84.7%	93.6%	91.8%

Sources: Qwen3-Coder-Next model card on Hugging Face, SWE-Bench Verified leaderboard, Aider public results. Local benchmarks confirmed on H100 + vLLM agent harness.

When to pick Qwen3-Coder-Next

✓You write code professionally and want a self-hosted Cursor/Aider setup.
✓Your code or IP cannot leave your network (regulated industries, defense, IP-sensitive).
✓You have or can buy a single H100 / 2× RTX 5090 / M3 Ultra Mac Studio.
✓You want a hybrid setup: local for routine work, API for hardest 10-20%.
✓You're building a coding-AI product and need an Apache-2.0-licensed model in the loop.

When to use something else

→You need absolute peak coding quality → Claude Sonnet 5.
→Single high-end consumer GPU with no MoE complexity → Qwen3.6-27B (dense 27B, slightly lower coding score but simpler deployment).
→You need 1M context for whole-monorepo work → DeepSeek V4 or Gemini 3.1 Pro API.
→Mixed coding + general work, more hardware → DeepSeek V4-Flash.

Frequently asked questions

What is Qwen3-Coder-Next?

Qwen3-Coder-Next is Alibaba Qwen team's flagship coding model — a Mixture-of-Experts architecture with 80 billion total parameters and 3 billion active per forward pass. It has a 256,000-token context window, scores 70.6% on SWE-Bench Verified, and ships under Apache 2.0 license (fully open for commercial use). Released early 2026, it's the best self-hostable coding model available right now and the closest open-weight alternative to Claude Sonnet 5 and GPT-5.5 for code-heavy workloads.

How much VRAM does Qwen3-Coder-Next need?

At Q4_K_M quantization (the most common balance of quality and size), Qwen3-Coder-Next needs ~52 GB of VRAM. That fits in 2× RTX 5090 (32 GB each), 1× H100 (80 GB), 1× A100 80GB, 2× RTX 4090 (24 GB each — tight, may need to reduce context), or 1× M3 Ultra Mac Studio with 96 GB unified memory. Q5_K_M is ~64 GB, Q6 is ~75 GB, BF16 is ~160 GB. Because only 3B parameters are active per token, inference is fast — typically 60-120 tokens/second on a single H100.

How do I install Qwen3-Coder-Next with Ollama?

Install Ollama, then run: `ollama pull qwen3-coder-next` (defaults to Q4 quantization). Start it with `ollama run qwen3-coder-next`. To expose it as an OpenAI-compatible API for Cursor, Continue.dev, or Aider: run `ollama serve` and point your tool at `http://localhost:11434/v1` with model name `qwen3-coder-next`. Cursor specifically: Settings → Models → Override OpenAI Base URL to your Ollama endpoint, and you can use Qwen3-Coder-Next exactly like Claude Sonnet 5.

Qwen3-Coder-Next vs Claude Sonnet 5: how much worse is it?

On SWE-Bench Verified: Qwen3-Coder-Next scores 70.6%, Claude Sonnet 5 scores 92.4%. That's a 22-point gap. In practical terms, Qwen3-Coder-Next handles ~75-80% of the same coding tasks Sonnet 5 does — it gets the routine refactors, bug fixes, function implementations, and documentation correct. Where Sonnet 5 still wins decisively: novel algorithm design, multi-file refactors with subtle dependencies, ambiguous specs, and any task requiring careful step-by-step reasoning. The pragmatic pattern most production teams use: Qwen3-Coder-Next locally for the routine 80%, Sonnet 5 via API for the hard 20%. Cost reduction: typically 60-80%.

What is the 256K context window good for?

A 256,000-token context window holds roughly 200,000 lines of code or 400 pages of documentation. Practical uses: load an entire small-to-medium repo (most are under 100K LOC) and ask whole-codebase questions; refactor across many files in one prompt; analyze a complete API spec + multiple implementations together; or feed a long PR diff with full file context. For most engineering work, 256K is more than enough. If you need 1M context (whole monorepo), Gemini 3.1 Pro (closed API) or DeepSeek V4 (open weight) are the alternatives.

Can I use Qwen3-Coder-Next with Cursor / Continue.dev / Aider?

Yes — all three support custom OpenAI-compatible endpoints. Cursor: Settings → Models → "Override OpenAI Base URL" → `http://localhost:11434/v1` (Ollama) and add model `qwen3-coder-next`. Continue.dev: edit `~/.continue/config.json` and add a model entry pointing to your local endpoint. Aider: pass `--model openai/qwen3-coder-next` and `--openai-api-base http://localhost:11434/v1`. All three work indistinguishably from Claude/GPT integration once configured. For best agent-loop performance, use vLLM instead of Ollama — continuous batching gives 5-15× higher aggregate throughput when you have multiple Cursor/Aider sessions running.

Why Apache 2.0 license matters

Apache 2.0 is a permissive open-source license — you can use Qwen3-Coder-Next commercially, modify it, redistribute it, embed it in products, and fine-tune it without paying royalties or asking permission. The only requirements are preserving copyright notices and disclosing significant changes. This is the same license as Apache HTTP Server, Kubernetes, and TensorFlow. Compared to Llama 4 (modified license with usage thresholds and attribution requirements), Apache 2.0 means you can build a product on Qwen3-Coder-Next without lawyers reviewing the agreement.

Qwen3-Coder-Next vs DeepSeek V4-Flash for coding

Qwen3-Coder-Next is coding-specialized; DeepSeek V4-Flash is general-purpose. Coding benchmarks favor Qwen3-Coder-Next slightly: 70.6% vs 78.4% SWE-Bench Verified (V4-Flash actually edges ahead). LiveCodeBench: Qwen3-Coder-Next 68.4%, V4-Flash 67.2%. Aider polyglot: Qwen3-Coder-Next 71.2%, V4-Flash 74.1%. Hardware: Qwen3-Coder-Next is much smaller (~52 GB Q4 vs ~150 GB Q4 for V4-Flash) — runs on 1× H100 vs 2× H100. For pure coding workloads on a single high-end GPU, pick Qwen3-Coder-Next. For mixed coding + research + general work, pick V4-Flash if you have the hardware.

Build a complete local coding stack

The Local AI Master deployment course walks through running Qwen3-Coder-Next in production with vLLM, integrating it into Cursor and Aider, and hybrid routing for the hardest tasks. Real production code, full GitHub repo.

See the deployment course →

Related models

→ Qwen3.6-27B — dense 27B that beats own MoE on agentic coding
→ DeepSeek V4 — open-weight frontier alternative, 1M context
→ Qwen3-Coder — predecessor; smaller, simpler, less capable
→ Claude Sonnet 5 — what you reach for on hardest 10-20% of tasks
→ Best local AI coding models — full comparison
→ Best AI models May 2026 — pillar comparison