Copilot+ PC vs RTX GPU for Local AI (2026): NPU or GPU?
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Picked your coding model? Build a real AI dev workflow. From local copilots to agents that ship code — the structured path, running on your hardware. First chapter free.
For actually running local LLMs in 2026, a discrete RTX GPU beats a Copilot+ PC's NPU — and it is not close. Copilot+ NPUs (Snapdragon X at 45 TOPS, Snapdragon X2 at 80 TOPS, Intel Lunar Lake at 48 TOPS, AMD Strix Point at 50 TOPS) are real AI accelerators, but most local LLM tools (Ollama, llama.cpp) still run on the GPU or CPU, not the NPU. On a Snapdragon X Elite, Ollama runs CPU-only and an 8B model lands around 5-10 tokens/sec; a used RTX 3090 runs the same model at roughly 100 tokens/sec because token generation is bound by memory bandwidth (a 3090 has ~936 GB/s versus a Copilot+ laptop's ~135-152 GB/s of shared LPDDR5x). Buy a Copilot+ PC for all-day battery and lightweight built-in AI features; buy or build an RTX machine if running 7B-70B local models fast is the actual goal.
This guide compares Copilot+ PC NPUs against a discrete RTX GPU for one specific job — running local large language models — with verified specs, real throughput numbers, and an honest recommendation.
What is a Copilot+ PC, and what is its NPU?
A Copilot+ PC is Microsoft's certification for laptops with a dedicated NPU (Neural Processing Unit) rated at 40+ TOPS (trillion operations per second), paired with at least 16GB of RAM and 256GB of storage. The NPU is a fixed-function AI accelerator that sits next to the CPU and GPU on the same chip, designed to run small AI models at very low power — think live captions, background blur, image cleanup, and Windows' on-device Copilot features.
As of mid-2026 the main Copilot+ chips are:
- Qualcomm Snapdragon X Elite / X Plus — Arm-based, 45 TOPS Hexagon NPU.
- Qualcomm Snapdragon X2 Elite / X2 Plus — the 2026 refresh, 80 TOPS Hexagon NPU.
- Intel Core Ultra 200V "Lunar Lake" — x86, 48 TOPS NPU.
- AMD Ryzen AI 300 "Strix Point" — x86, 50 TOPS XDNA 2 NPU.
That TOPS number sounds enormous next to a phone, but TOPS measures peak low-precision throughput for small models — it is not the metric that decides how fast a multi-billion-parameter LLM generates text. The thing that decides that is memory bandwidth, and that is where the comparison gets interesting.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
NPU vs discrete GPU: spec-for-spec
A Copilot+ NPU and a discrete RTX GPU are built for very different jobs. The NPU optimizes for performance per watt; the GPU optimizes for raw parallel throughput and bandwidth. For LLM token generation, bandwidth is king.
| Spec | Copilot+ NPU (e.g. Snapdragon X2) | RTX 3090 (discrete) | RTX 4090 (discrete) |
|---|---|---|---|
| AI rating | ~80 TOPS (NPU) | ~285 TFLOPS tensor / huge | ~660+ TFLOPS tensor |
| Memory | Shared LPDDR5x (system RAM) | 24GB dedicated GDDR6X | 24GB dedicated GDDR6X |
| Memory bandwidth | ~152 GB/s (shared) | ~936 GB/s | ~1,008 GB/s |
| Typical power draw | ~10-30W (whole laptop) | 350W (card) | 450W (card) |
| Runs Ollama/llama.cpp on the accelerator? | Mostly no — CPU/iGPU path | Yes — full CUDA | Yes — full CUDA |
| Max practical local model | ~7B-8B (slow) | up to ~70B (tight quant) | up to ~70B (tight quant) |
The row that decides everything is memory bandwidth. A Copilot+ laptop shares one ~135-152 GB/s pool of LPDDR5x between the CPU, GPU, and NPU (135 GB/s on the first-gen Snapdragon X Elite, ~152 GB/s on the 2026 Snapdragon X2 Elite). A discrete RTX card has its own dedicated 24GB at ~900-1,000 GB/s — roughly 6x more bandwidth feeding the chip. Since LLM text generation streams the entire model's weights through memory for every single token, that 6x bandwidth gap translates almost directly into a 6x throughput gap.
Can an NPU actually run a local LLM today?
Here is the part the marketing skips. Most local LLM tooling does not use the NPU at all. As of mid-2026:
- Ollama has Arm64 Windows builds for Snapdragon, but GPU/NPU acceleration is not exposed — on a Snapdragon X Elite, everything runs CPU-only through the Ollama CLI.
- llama.cpp runs on these machines but again leans on CPU/iGPU; reliable, production-grade NPU offload is not there for mainstream use.
- Intel is furthest along: its IPEX-LLM library can target the Lunar Lake NPU and integrates with llama.cpp/Ollama on Windows, but it is a separate stack, model-limited, and far from plug-and-play.
- Qualcomm's own QNN runtime can push an LLM onto the Hexagon NPU — but you are using Qualcomm-converted models through a separate SDK, not just
ollama run.
So when you "run a local model" on a Copilot+ PC with the normal tools, you are almost always running it on the CPU, not the headline NPU. The NPU is busy doing what it was actually designed for: small, fixed AI tasks like Windows Studio Effects, live captions, and Recall — not 8-billion-parameter chatbots.
First-hand note: in our testing, a discrete RTX 3090 (24GB) runs an 8B model at roughly 100 tokens/sec at Q4_K_M through Ollama, no special setup — just ollama run. Independent reports put an 8B model on Snapdragon X Elite at around 5 tokens/sec on the CPU path, and even with hand-tuned QNN/NPU acceleration on a Qualcomm-converted 8B model, throughput lands near 10 tokens/sec. That is below the ~20 tok/s "feels like real-time typing" threshold — usable for short answers, painful for anything long.
Why does the GPU win so decisively on tokens per second?
It comes down to how LLM inference works. There are two phases:
- Prefill (prompt processing) — your whole prompt is read in parallel. This is compute-bound, and it is the one phase where a high-TOPS NPU can look good.
- Decode (generating the answer) — tokens come out one at a time, and each token requires streaming all of the model's weights from memory. This is memory-bandwidth-bound.
Almost all of the time you "feel" in a chat is the decode phase, and decode speed is gated by bandwidth. A discrete RTX GPU has ~900-1,000 GB/s of dedicated VRAM bandwidth; a Copilot+ laptop has ~135-152 GB/s of shared system memory. Roughly double the bandwidth and you roughly double the tokens per second — so a 6x bandwidth gap produces the kind of ~5-10 tok/s vs ~100 tok/s difference we see in practice.
The NPU's big TOPS number helps with prefill and with tiny always-on models, but it does nothing to widen the narrow memory pipe that decode depends on. That is why a $1,800 Copilot+ laptop can lose badly to a ~$900 used 3090 at the one job of running a chatbot fast.
| Workload | Copilot+ NPU/CPU | Discrete RTX GPU | Winner |
|---|---|---|---|
| 8B model chat (tok/s) | ~5-10 | ~100 | RTX (decisive) |
| Loading a 32B-70B model | Won't fit / unusable | Yes, with quant | RTX |
| All-day unplugged battery | Excellent | N/A (desktop) / poor | Copilot+ |
| Background AI (captions, blur) | Excellent, near-zero battery | Wasteful | Copilot+ |
| Image generation (SD/FLUX) | Slow | Fast | RTX |
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Where the NPU genuinely wins: battery and power
None of this means the NPU is pointless — it is excellent at the job it was built for. The whole point of an NPU is performance per watt. It can run small models continuously while sipping power, which is exactly what you want for features that run all day: live captions, noise suppression, eye-contact correction, background blur, and on-device Copilot tasks. Research consistently shows NPUs delivering comparable small-model results at less than half the power of doing the same work on a GPU.
That efficiency is why Copilot+ laptops get genuinely outstanding battery life. A discrete RTX laptop GPU under an LLM load can pull 100-450W and drain a battery in well under an hour; an NPU doing its intended lightweight AI barely registers. So if your priority is a thin, silent, all-day laptop that does helpful background AI, the Copilot+ NPU is the right tool — it just is not a local-LLM workhorse.
The honest framing: NPU and discrete GPU are not really competitors. The NPU is for efficient, always-on, small AI. The discrete GPU is for fast, heavy, on-demand AI like running 7B-70B LLMs. They optimize for opposite ends of the trade-off.
So which should you actually buy for local AI?
Match the hardware to what you really plan to do:
- You mainly want to run local LLMs (7B-70B) fast — chat, coding, RAG, agents. Get a discrete RTX GPU. A used RTX 3090 (24GB, ~900 GB/s) is the value pick; it runs the same model 10x+ faster than any current Copilot+ NPU and fits far bigger models. A Copilot+ PC will frustrate you here.
- You want a thin, light, all-day laptop with built-in AI features and occasional small-model dabbling. A Copilot+ PC is great — outstanding battery, silent, and the NPU shines on Windows' on-device AI. Just keep LLM expectations to small (3B-8B) models at modest speeds.
- You want both portability and real local-LLM speed. Today that usually means a discrete-GPU laptop (or an eGPU / desktop), accepting worse battery — or running models on phones for light tasks where an NPU-class mobile chip already does a lot.
- You only have a Copilot+ PC and want to try local AI anyway. You can — install Ollama and run a 3B-8B model on the CPU. It works, it is just slow, and you are not using the NPU.
The blunt summary: Copilot+ NPUs are a battery story, not a local-LLM-speed story. For running real local models fast in 2026, a discrete RTX GPU still wins comfortably, and it usually costs less than a flagship Copilot+ laptop.
Key Takeaways
- Copilot+ NPUs are real but specialized. Snapdragon X (45 TOPS), Snapdragon X2 (80 TOPS), Lunar Lake (48 TOPS), and Strix Point (50 TOPS) are built for efficient, always-on small AI — not big-model throughput.
- Most local LLM tools don't use the NPU. On a Snapdragon X Elite, Ollama runs CPU-only; reliable NPU offload for mainstream stacks isn't here yet (Intel's IPEX-LLM is the closest, and it's a separate, model-limited path).
- Bandwidth decides token speed. A discrete RTX has ~900-1,000 GB/s of dedicated VRAM vs a Copilot+ laptop's ~135-152 GB/s shared memory — about 6x, and it shows: ~5-10 tok/s vs ~100 tok/s on an 8B model.
- The NPU wins on power. It runs lightweight AI at a fraction of a GPU's wattage, which is why Copilot+ laptops have such great battery life.
- Recommendation: buy a discrete RTX GPU to run local LLMs fast; buy a Copilot+ PC for portability, battery, and built-in Windows AI. They solve opposite problems.
Next Steps
- Pick the right card with our full ranked guide: Best GPUs for Local AI, from the RTX 3060 up through the latest flagships, with tested tok/s.
- See why the value champion holds up in RTX 3090 for local AI — the best 24GB GPU per dollar for running local models.
- Want truly portable local AI? Read how to run an LLM on your phone — where mobile NPUs and small models actually shine.
- Not sure which GPU fits your models and budget? Use our Which GPU to buy interactive picker.
For the official requirements and capabilities, Microsoft documents the Copilot+ PC NPU developer guide, and the open-source llama.cpp project is the easiest way to benchmark CPU, GPU, and (where supported) NPU paths on your own hardware.
Picked your coding model? Build a real AI dev workflow.
From local copilots to agents that ship code — the structured path, running on your hardware. First chapter free.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARBest Local AI for Coding 2026: 10 Models Tested & Ranked
- 7B vs 14B vs 32B vs 70B for Coding (2026): What Size?
- AI Context Windows: 4K vs 128K vs 1M vs 10M Tokens (2026)
- AI vs Coding for Kids: Which Should Children Learn First?
- Best 14B Coding Models (2026): Ranked by HumanEval + VRAM
- Best AI Coding Models 2026: Top 12 Ranked on SWE-Bench
- Best AI for JavaScript & TypeScript 2026: 10 Models Ranked
- Best AI Models for Python Development 2026: Top 10 Ranked
- Best Claude Model for Coding (2026): Opus 4.8 vs Sonnet 4.6 vs Haiku
- Best Local AI Coding Models 2026: Qwen3-Coder, DeepSeek & Llama, Ranked
Comments (0)
No comments yet. Be the first to share your thoughts!