For actually running local LLMs in 2026, a discrete RTX GPU beats a Copilot+ PC's NPU — and it is not close. Copilot+ NPUs (Snapdragon X at 45 TOPS, Snapdragon X2 at 80 TOPS, Intel Lunar Lake at 48 TOPS, AMD Strix Point at 50 TOPS) are real AI accelerators, but most local LLM tools (Ollama, llama.cpp) still run on the GPU or CPU, not the NPU. On a Snapdragon X Elite, Ollama runs CPU-only and an 8B model lands around 5-10 tokens/sec; a used RTX 3090 runs the same model at roughly 100 tokens/sec because token generation is bound by memory bandwidth (a 3090 has ~936 GB/s versus a Copilot+ laptop's ~135-152 GB/s of shared LPDDR5x). Buy a Copilot+ PC for all-day battery and lightweight built-in AI features; buy or build an RTX machine if running 7B-70B local models fast is the actual goal.

This guide compares Copilot+ PC NPUs against a discrete RTX GPU for one specific job — running local large language models — with verified specs, real throughput numbers, and an honest recommendation.

What is a Copilot+ PC, and what is its NPU?

A Copilot+ PC is Microsoft's certification for laptops with a dedicated NPU (Neural Processing Unit) rated at 40+ TOPS (trillion operations per second), paired with at least 16GB of RAM and 256GB of storage. The NPU is a fixed-function AI accelerator that sits next to the CPU and GPU on the same chip, designed to run small AI models at very low power — think live captions, background blur, image cleanup, and Windows' on-device Copilot features.

As of mid-2026 the main Copilot+ chips are:

Qualcomm Snapdragon X Elite / X Plus — Arm-based, 45 TOPS Hexagon NPU.
Qualcomm Snapdragon X2 Elite / X2 Plus — the 2026 refresh, 80 TOPS Hexagon NPU.
Intel Core Ultra 200V "Lunar Lake" — x86, 48 TOPS NPU.
AMD Ryzen AI 300 "Strix Point" — x86, 50 TOPS XDNA 2 NPU.

That TOPS number sounds enormous next to a phone, but TOPS measures peak low-precision throughput for small models — it is not the metric that decides how fast a multi-billion-parameter LLM generates text. The thing that decides that is memory bandwidth, and that is where the comparison gets interesting.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

NPU vs discrete GPU: spec-for-spec

A Copilot+ NPU and a discrete RTX GPU are built for very different jobs. The NPU optimizes for performance per watt; the GPU optimizes for raw parallel throughput and bandwidth. For LLM token generation, bandwidth is king.

Spec	Copilot+ NPU (e.g. Snapdragon X2)	RTX 3090 (discrete)	RTX 4090 (discrete)
AI rating	~80 TOPS (NPU)	~285 TFLOPS tensor / huge	~660+ TFLOPS tensor
Memory	Shared LPDDR5x (system RAM)	24GB dedicated GDDR6X	24GB dedicated GDDR6X
Memory bandwidth	~152 GB/s (shared)	~936 GB/s	~1,008 GB/s
Typical power draw	~10-30W (whole laptop)	350W (card)	450W (card)
Runs Ollama/llama.cpp on the accelerator?	Mostly no — CPU/iGPU path	Yes — full CUDA	Yes — full CUDA
Max practical local model	~7B-8B (slow)	up to ~70B (tight quant)	up to ~70B (tight quant)

The row that decides everything is memory bandwidth. A Copilot+ laptop shares one ~135-152 GB/s pool of LPDDR5x between the CPU, GPU, and NPU (135 GB/s on the first-gen Snapdragon X Elite, ~152 GB/s on the 2026 Snapdragon X2 Elite). A discrete RTX card has its own dedicated 24GB at ~900-1,000 GB/s — roughly 6x more bandwidth feeding the chip. Since LLM text generation streams the entire model's weights through memory for every single token, that 6x bandwidth gap translates almost directly into a 6x throughput gap.

Can an NPU actually run a local LLM today?

Here is the part the marketing skips. Most local LLM tooling does not use the NPU at all. As of mid-2026:

Ollama has Arm64 Windows builds for Snapdragon, but GPU/NPU acceleration is not exposed — on a Snapdragon X Elite, everything runs CPU-only through the Ollama CLI.
llama.cpp runs on these machines but again leans on CPU/iGPU; reliable, production-grade NPU offload is not there for mainstream use.
Intel is furthest along: its IPEX-LLM library can target the Lunar Lake NPU and integrates with llama.cpp/Ollama on Windows, but it is a separate stack, model-limited, and far from plug-and-play.
Qualcomm's own QNN runtime can push an LLM onto the Hexagon NPU — but you are using Qualcomm-converted models through a separate SDK, not just ollama run.

So when you "run a local model" on a Copilot+ PC with the normal tools, you are almost always running it on the CPU, not the headline NPU. The NPU is busy doing what it was actually designed for: small, fixed AI tasks like Windows Studio Effects, live captions, and Recall — not 8-billion-parameter chatbots.

First-hand note: in our testing, a discrete RTX 3090 (24GB) runs an 8B model at roughly 100 tokens/sec at Q4_K_M through Ollama, no special setup — just ollama run. Independent reports put an 8B model on Snapdragon X Elite at around 5 tokens/sec on the CPU path, and even with hand-tuned QNN/NPU acceleration on a Qualcomm-converted 8B model, throughput lands near 10 tokens/sec. That is below the ~20 tok/s "feels like real-time typing" threshold — usable for short answers, painful for anything long.

Why does the GPU win so decisively on tokens per second?

It comes down to how LLM inference works. There are two phases:

Prefill (prompt processing) — your whole prompt is read in parallel. This is compute-bound, and it is the one phase where a high-TOPS NPU can look good.
Decode (generating the answer) — tokens come out one at a time, and each token requires streaming all of the model's weights from memory. This is memory-bandwidth-bound.

Almost all of the time you "feel" in a chat is the decode phase, and decode speed is gated by bandwidth. A discrete RTX GPU has ~900-1,000 GB/s of dedicated VRAM bandwidth; a Copilot+ laptop has ~135-152 GB/s of shared system memory. Roughly double the bandwidth and you roughly double the tokens per second — so a 6x bandwidth gap produces the kind of ~5-10 tok/s vs ~100 tok/s difference we see in practice.

The NPU's big TOPS number helps with prefill and with tiny always-on models, but it does nothing to widen the narrow memory pipe that decode depends on. That is why a $1,800 Copilot+ laptop can lose badly to a ~$900 used 3090 at the one job of running a chatbot fast.

Workload	Copilot+ NPU/CPU	Discrete RTX GPU	Winner
8B model chat (tok/s)	~5-10	~100	RTX (decisive)
Loading a 32B-70B model	Won't fit / unusable	Yes, with quant	RTX
All-day unplugged battery	Excellent	N/A (desktop) / poor	Copilot+
Background AI (captions, blur)	Excellent, near-zero battery	Wasteful	Copilot+
Image generation (SD/FLUX)	Slow	Fast	RTX

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Where the NPU genuinely wins: battery and power

None of this means the NPU is pointless — it is excellent at the job it was built for. The whole point of an NPU is performance per watt. It can run small models continuously while sipping power, which is exactly what you want for features that run all day: live captions, noise suppression, eye-contact correction, background blur, and on-device Copilot tasks. Research consistently shows NPUs delivering comparable small-model results at less than half the power of doing the same work on a GPU.

That efficiency is why Copilot+ laptops get genuinely outstanding battery life. A discrete RTX laptop GPU under an LLM load can pull 100-450W and drain a battery in well under an hour; an NPU doing its intended lightweight AI barely registers. So if your priority is a thin, silent, all-day laptop that does helpful background AI, the Copilot+ NPU is the right tool — it just is not a local-LLM workhorse.

The honest framing: NPU and discrete GPU are not really competitors. The NPU is for efficient, always-on, small AI. The discrete GPU is for fast, heavy, on-demand AI like running 7B-70B LLMs. They optimize for opposite ends of the trade-off.

So which should you actually buy for local AI?

Match the hardware to what you really plan to do:

You mainly want to run local LLMs (7B-70B) fast — chat, coding, RAG, agents. Get a discrete RTX GPU. A used RTX 3090 (24GB, ~900 GB/s) is the value pick; it runs the same model 10x+ faster than any current Copilot+ NPU and fits far bigger models. A Copilot+ PC will frustrate you here.
You want a thin, light, all-day laptop with built-in AI features and occasional small-model dabbling. A Copilot+ PC is great — outstanding battery, silent, and the NPU shines on Windows' on-device AI. Just keep LLM expectations to small (3B-8B) models at modest speeds.
You want both portability and real local-LLM speed. Today that usually means a discrete-GPU laptop (or an eGPU / desktop), accepting worse battery — or running models on phones for light tasks where an NPU-class mobile chip already does a lot.
You only have a Copilot+ PC and want to try local AI anyway. You can — install Ollama and run a 3B-8B model on the CPU. It works, it is just slow, and you are not using the NPU.

The blunt summary: Copilot+ NPUs are a battery story, not a local-LLM-speed story. For running real local models fast in 2026, a discrete RTX GPU still wins comfortably, and it usually costs less than a flagship Copilot+ laptop.

Key Takeaways

Copilot+ NPUs are real but specialized. Snapdragon X (45 TOPS), Snapdragon X2 (80 TOPS), Lunar Lake (48 TOPS), and Strix Point (50 TOPS) are built for efficient, always-on small AI — not big-model throughput.
Most local LLM tools don't use the NPU. On a Snapdragon X Elite, Ollama runs CPU-only; reliable NPU offload for mainstream stacks isn't here yet (Intel's IPEX-LLM is the closest, and it's a separate, model-limited path).
Bandwidth decides token speed. A discrete RTX has ~900-1,000 GB/s of dedicated VRAM vs a Copilot+ laptop's ~135-152 GB/s shared memory — about 6x, and it shows: ~5-10 tok/s vs ~100 tok/s on an 8B model.
The NPU wins on power. It runs lightweight AI at a fraction of a GPU's wattage, which is why Copilot+ laptops have such great battery life.
Recommendation: buy a discrete RTX GPU to run local LLMs fast; buy a Copilot+ PC for portability, battery, and built-in Windows AI. They solve opposite problems.

Next Steps

Pick the right card with our full ranked guide: Best GPUs for Local AI, from the RTX 3060 up through the latest flagships, with tested tok/s.
See why the value champion holds up in RTX 3090 for local AI — the best 24GB GPU per dollar for running local models.
Want truly portable local AI? Read how to run an LLM on your phone — where mobile NPUs and small models actually shine.
Not sure which GPU fits your models and budget? Use our Which GPU to buy interactive picker.

For the official requirements and capabilities, Microsoft documents the Copilot+ PC NPU developer guide, and the open-source llama.cpp project is the easiest way to benchmark CPU, GPU, and (where supported) NPU paths on your own hardware.

Copilot+ PC vs RTX GPU for Local AI (2026): NPU or GPU?

Want to go deeper than this article?

What is a Copilot+ PC, and what is its NPU?

Reading articles is good. Building is better.

NPU vs discrete GPU: spec-for-spec

Can an NPU actually run a local LLM today?

Why does the GPU win so decisively on tokens per second?

Reading articles is good. Building is better.

Where the NPU genuinely wins: battery and power

So which should you actually buy for local AI?

Key Takeaways

Next Steps

Picked your coding model? Build a real AI dev workflow.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Go from reading about AI to building with AI

Ready to Go Beyond Tutorials?

Related Guides

Best GPUs for Local AI: RTX 3060 to 5090 Tested

RTX 3090 for Local AI: Still the Best 24GB Value in 2026

How to Run an LLM on Your Phone

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Picked your coding model? Build a real AI dev workflow.