★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Hardware

Copilot+ PC vs RTX GPU for Local AI (2026): NPU or GPU?

June 20, 2026
11 min
Local AI Master Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

📚AI Learning Path

Picked your coding model? Build a real AI dev workflow. From local copilots to agents that ship code — the structured path, running on your hardware. First chapter free.

Start free
Or own it for life — Lifetime $149, pay once

For actually running local LLMs in 2026, a discrete RTX GPU beats a Copilot+ PC's NPU — and it is not close. Copilot+ NPUs (Snapdragon X at 45 TOPS, Snapdragon X2 at 80 TOPS, Intel Lunar Lake at 48 TOPS, AMD Strix Point at 50 TOPS) are real AI accelerators, but most local LLM tools (Ollama, llama.cpp) still run on the GPU or CPU, not the NPU. On a Snapdragon X Elite, Ollama runs CPU-only and an 8B model lands around 5-10 tokens/sec; a used RTX 3090 runs the same model at roughly 100 tokens/sec because token generation is bound by memory bandwidth (a 3090 has ~936 GB/s versus a Copilot+ laptop's ~135-152 GB/s of shared LPDDR5x). Buy a Copilot+ PC for all-day battery and lightweight built-in AI features; buy or build an RTX machine if running 7B-70B local models fast is the actual goal.

This guide compares Copilot+ PC NPUs against a discrete RTX GPU for one specific job — running local large language models — with verified specs, real throughput numbers, and an honest recommendation.

What is a Copilot+ PC, and what is its NPU?

A Copilot+ PC is Microsoft's certification for laptops with a dedicated NPU (Neural Processing Unit) rated at 40+ TOPS (trillion operations per second), paired with at least 16GB of RAM and 256GB of storage. The NPU is a fixed-function AI accelerator that sits next to the CPU and GPU on the same chip, designed to run small AI models at very low power — think live captions, background blur, image cleanup, and Windows' on-device Copilot features.

As of mid-2026 the main Copilot+ chips are:

  • Qualcomm Snapdragon X Elite / X Plus — Arm-based, 45 TOPS Hexagon NPU.
  • Qualcomm Snapdragon X2 Elite / X2 Plus — the 2026 refresh, 80 TOPS Hexagon NPU.
  • Intel Core Ultra 200V "Lunar Lake" — x86, 48 TOPS NPU.
  • AMD Ryzen AI 300 "Strix Point" — x86, 50 TOPS XDNA 2 NPU.

That TOPS number sounds enormous next to a phone, but TOPS measures peak low-precision throughput for small models — it is not the metric that decides how fast a multi-billion-parameter LLM generates text. The thing that decides that is memory bandwidth, and that is where the comparison gets interesting.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

NPU vs discrete GPU: spec-for-spec

A Copilot+ NPU and a discrete RTX GPU are built for very different jobs. The NPU optimizes for performance per watt; the GPU optimizes for raw parallel throughput and bandwidth. For LLM token generation, bandwidth is king.

SpecCopilot+ NPU (e.g. Snapdragon X2)RTX 3090 (discrete)RTX 4090 (discrete)
AI rating~80 TOPS (NPU)~285 TFLOPS tensor / huge~660+ TFLOPS tensor
MemoryShared LPDDR5x (system RAM)24GB dedicated GDDR6X24GB dedicated GDDR6X
Memory bandwidth~152 GB/s (shared)~936 GB/s~1,008 GB/s
Typical power draw~10-30W (whole laptop)350W (card)450W (card)
Runs Ollama/llama.cpp on the accelerator?Mostly no — CPU/iGPU pathYes — full CUDAYes — full CUDA
Max practical local model~7B-8B (slow)up to ~70B (tight quant)up to ~70B (tight quant)

The row that decides everything is memory bandwidth. A Copilot+ laptop shares one ~135-152 GB/s pool of LPDDR5x between the CPU, GPU, and NPU (135 GB/s on the first-gen Snapdragon X Elite, ~152 GB/s on the 2026 Snapdragon X2 Elite). A discrete RTX card has its own dedicated 24GB at ~900-1,000 GB/s — roughly 6x more bandwidth feeding the chip. Since LLM text generation streams the entire model's weights through memory for every single token, that 6x bandwidth gap translates almost directly into a 6x throughput gap.

Can an NPU actually run a local LLM today?

Here is the part the marketing skips. Most local LLM tooling does not use the NPU at all. As of mid-2026:

  • Ollama has Arm64 Windows builds for Snapdragon, but GPU/NPU acceleration is not exposed — on a Snapdragon X Elite, everything runs CPU-only through the Ollama CLI.
  • llama.cpp runs on these machines but again leans on CPU/iGPU; reliable, production-grade NPU offload is not there for mainstream use.
  • Intel is furthest along: its IPEX-LLM library can target the Lunar Lake NPU and integrates with llama.cpp/Ollama on Windows, but it is a separate stack, model-limited, and far from plug-and-play.
  • Qualcomm's own QNN runtime can push an LLM onto the Hexagon NPU — but you are using Qualcomm-converted models through a separate SDK, not just ollama run.

So when you "run a local model" on a Copilot+ PC with the normal tools, you are almost always running it on the CPU, not the headline NPU. The NPU is busy doing what it was actually designed for: small, fixed AI tasks like Windows Studio Effects, live captions, and Recall — not 8-billion-parameter chatbots.

First-hand note: in our testing, a discrete RTX 3090 (24GB) runs an 8B model at roughly 100 tokens/sec at Q4_K_M through Ollama, no special setup — just ollama run. Independent reports put an 8B model on Snapdragon X Elite at around 5 tokens/sec on the CPU path, and even with hand-tuned QNN/NPU acceleration on a Qualcomm-converted 8B model, throughput lands near 10 tokens/sec. That is below the ~20 tok/s "feels like real-time typing" threshold — usable for short answers, painful for anything long.

Why does the GPU win so decisively on tokens per second?

It comes down to how LLM inference works. There are two phases:

  1. Prefill (prompt processing) — your whole prompt is read in parallel. This is compute-bound, and it is the one phase where a high-TOPS NPU can look good.
  2. Decode (generating the answer) — tokens come out one at a time, and each token requires streaming all of the model's weights from memory. This is memory-bandwidth-bound.

Almost all of the time you "feel" in a chat is the decode phase, and decode speed is gated by bandwidth. A discrete RTX GPU has ~900-1,000 GB/s of dedicated VRAM bandwidth; a Copilot+ laptop has ~135-152 GB/s of shared system memory. Roughly double the bandwidth and you roughly double the tokens per second — so a 6x bandwidth gap produces the kind of ~5-10 tok/s vs ~100 tok/s difference we see in practice.

The NPU's big TOPS number helps with prefill and with tiny always-on models, but it does nothing to widen the narrow memory pipe that decode depends on. That is why a $1,800 Copilot+ laptop can lose badly to a ~$900 used 3090 at the one job of running a chatbot fast.

WorkloadCopilot+ NPU/CPUDiscrete RTX GPUWinner
8B model chat (tok/s)~5-10~100RTX (decisive)
Loading a 32B-70B modelWon't fit / unusableYes, with quantRTX
All-day unplugged batteryExcellentN/A (desktop) / poorCopilot+
Background AI (captions, blur)Excellent, near-zero batteryWastefulCopilot+
Image generation (SD/FLUX)SlowFastRTX

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Where the NPU genuinely wins: battery and power

None of this means the NPU is pointless — it is excellent at the job it was built for. The whole point of an NPU is performance per watt. It can run small models continuously while sipping power, which is exactly what you want for features that run all day: live captions, noise suppression, eye-contact correction, background blur, and on-device Copilot tasks. Research consistently shows NPUs delivering comparable small-model results at less than half the power of doing the same work on a GPU.

That efficiency is why Copilot+ laptops get genuinely outstanding battery life. A discrete RTX laptop GPU under an LLM load can pull 100-450W and drain a battery in well under an hour; an NPU doing its intended lightweight AI barely registers. So if your priority is a thin, silent, all-day laptop that does helpful background AI, the Copilot+ NPU is the right tool — it just is not a local-LLM workhorse.

The honest framing: NPU and discrete GPU are not really competitors. The NPU is for efficient, always-on, small AI. The discrete GPU is for fast, heavy, on-demand AI like running 7B-70B LLMs. They optimize for opposite ends of the trade-off.

So which should you actually buy for local AI?

Match the hardware to what you really plan to do:

  • You mainly want to run local LLMs (7B-70B) fast — chat, coding, RAG, agents. Get a discrete RTX GPU. A used RTX 3090 (24GB, ~900 GB/s) is the value pick; it runs the same model 10x+ faster than any current Copilot+ NPU and fits far bigger models. A Copilot+ PC will frustrate you here.
  • You want a thin, light, all-day laptop with built-in AI features and occasional small-model dabbling. A Copilot+ PC is great — outstanding battery, silent, and the NPU shines on Windows' on-device AI. Just keep LLM expectations to small (3B-8B) models at modest speeds.
  • You want both portability and real local-LLM speed. Today that usually means a discrete-GPU laptop (or an eGPU / desktop), accepting worse battery — or running models on phones for light tasks where an NPU-class mobile chip already does a lot.
  • You only have a Copilot+ PC and want to try local AI anyway. You can — install Ollama and run a 3B-8B model on the CPU. It works, it is just slow, and you are not using the NPU.

The blunt summary: Copilot+ NPUs are a battery story, not a local-LLM-speed story. For running real local models fast in 2026, a discrete RTX GPU still wins comfortably, and it usually costs less than a flagship Copilot+ laptop.

Key Takeaways

  1. Copilot+ NPUs are real but specialized. Snapdragon X (45 TOPS), Snapdragon X2 (80 TOPS), Lunar Lake (48 TOPS), and Strix Point (50 TOPS) are built for efficient, always-on small AI — not big-model throughput.
  2. Most local LLM tools don't use the NPU. On a Snapdragon X Elite, Ollama runs CPU-only; reliable NPU offload for mainstream stacks isn't here yet (Intel's IPEX-LLM is the closest, and it's a separate, model-limited path).
  3. Bandwidth decides token speed. A discrete RTX has ~900-1,000 GB/s of dedicated VRAM vs a Copilot+ laptop's ~135-152 GB/s shared memory — about 6x, and it shows: ~5-10 tok/s vs ~100 tok/s on an 8B model.
  4. The NPU wins on power. It runs lightweight AI at a fraction of a GPU's wattage, which is why Copilot+ laptops have such great battery life.
  5. Recommendation: buy a discrete RTX GPU to run local LLMs fast; buy a Copilot+ PC for portability, battery, and built-in Windows AI. They solve opposite problems.

Next Steps

  • Pick the right card with our full ranked guide: Best GPUs for Local AI, from the RTX 3060 up through the latest flagships, with tested tok/s.
  • See why the value champion holds up in RTX 3090 for local AI — the best 24GB GPU per dollar for running local models.
  • Want truly portable local AI? Read how to run an LLM on your phone — where mobile NPUs and small models actually shine.
  • Not sure which GPU fits your models and budget? Use our Which GPU to buy interactive picker.

For the official requirements and capabilities, Microsoft documents the Copilot+ PC NPU developer guide, and the open-source llama.cpp project is the easiest way to benchmark CPU, GPU, and (where supported) NPU paths on your own hardware.

🎯
AI Learning Path

Picked your coding model? Build a real AI dev workflow.

From local copilots to agents that ship code — the structured path, running on your hardware. First chapter free.

Or own it for life — Lifetime $149 $599, pay once

Liked this? 20 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Want structured AI education?

20 courses, 495+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path
More on AI Models for Coding
See the full Best Local AI for Coding guide.

Comments (0)

No comments yet. Be the first to share your thoughts!

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once
📅 Published: June 20, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed

Ready to Go Beyond Tutorials?

20 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.

Was this helpful?

LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Picked your coding model? Build a real AI dev workflow.

From local copilots to agents that ship code — the structured path, running on your hardware. First chapter free.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators