The NVIDIA Tesla P40 is a 2016 datacenter card with 24GB of VRAM that sells used for roughly $150-$300 (often around $260-$330 as of June 2026), which makes it the cheapest path to 24GB for local AI by a wide margin — its dollars-per-gigabyte is unbeatable. The catch is speed and setup: it is a Pascal card with no Tensor Cores, no usable FP16 (half precision runs at roughly 1/64 of FP32, so you stick to INT8 and GGUF Q4/Q5 quants), no Flash Attention support, and notably slow prompt processing. It also needs a server cooling shroud plus a fan and an EPS-to-PCIe power adapter — it does not plug in like a gaming GPU. Verdict: great $/GB, poor speed. The P40 is excellent for hobbyists who want to load big models cheaply and can tolerate slow generation; if you can stretch the budget, a used RTX 3090 (also 24GB) is far faster and far less hassle.

This guide covers the verified specs, the real used price, exactly what the P40 runs and how slowly, and the cooling/power/driver gotchas nobody mentions until your card is already in the mail.

What is the Tesla P40, and why is it so cheap?

The Tesla P40 is an NVIDIA datacenter inference accelerator launched in September 2016, built on the Pascal architecture (the same generation as the GTX 1080). It was designed to run inference workloads in servers, not to drive a monitor — it has no video output at all. Years later, fleets of these cards were retired and dumped onto the secondary market, which is why a card that once cost thousands now sells for the price of a budget gaming GPU.

For local AI, the reason people care is one number: 24GB of VRAM. In local LLM work, VRAM decides which models you can load at all. A 24GB card holds a 7B or 13B model comfortably, a 32B-34B model at a 4-bit quant, and lets you avoid the painful CPU offloading that cripples smaller cards. Getting 24GB any other way means a used RTX 3090 (~$850+) or a much pricier new card — so the P40 is, on paper, an absurd bargain.

The reason it is not an obvious win is everything below the VRAM line: it is old, slow at the things that have gotten fast on modern cards, and physically awkward to install. The rest of this guide is about whether that trade is worth it for you.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

How much does a Tesla P40 actually cost in 2026?

Used P40 pricing in mid-2026 sits in a wide band depending on seller, condition, and whether a cooling kit is bundled. Across eBay listings and price trackers, the typical range is roughly $150 to $300, with an average around $479 skewed up by "brand new / old stock" listings, and the most common real used listings landing around $260-$330 shipped (many from overseas sellers).

Item	Typical cost (Jun 2026)	Notes
Tesla P40 card (used)	~$150-$300	Average ~$479 incl. new-old-stock; common used ~$260-$330
Cooling shroud + fan	~$10-$30	Required — the card has no fan of its own
EPS-to-PCIe power adapter	~$8-$15	Card uses a CPU/EPS-style 8-pin, not a standard PCIe plug
All-in to run one	~$180-$345	Add a PSU with headroom if yours is tight

So the headline "24GB for ~$200" is real, but budget another ~$20-$45 for the cooling and power bits that make it actually usable. Even fully kitted, it is the cheapest 24GB in local AI — that is the entire appeal.

What are the real specs (and the real caveats)?

The spec sheet is where the P40's age shows. The headline figures look fine; the footnotes are the story.

Spec	Tesla P40	What it means for local AI
Released	Sept 2016 (Pascal)	Predates Tensor Cores and Flash Attention
VRAM	24GB GDDR5	The whole reason to buy it
Memory bus / bandwidth	384-bit / ~346 GB/s	Decent, but ~2.7x below a 3090's ~936 GB/s
CUDA cores	3,840	No Tensor Cores at all
FP32 compute	~11.8 TFLOPS (~12)	Fine on paper
FP16 (half) compute	~0.18 TFLOPS (~1/64 of FP32)	FP16 is effectively unusable — avoid it
INT8 throughput	~47 TOPS	The card's intended strength (inference)
Tensor Cores	None	No FP16/BF16 acceleration, no Flash Attention
TDP (power)	250W	Needs an EPS-style 8-pin, not PCIe
Cooling	Passive (server airflow)	No fan — you add one
Video output	None	Pure compute card; needs a second GPU/iGPU for display

The single most important footnote: FP16 on the P40 runs at roughly 1/64 the rate of FP32 (about 183 GFLOPS half precision). On modern cards FP16/BF16 is the fast path; on the P40 it is a trap. The card also emulates BF16 through FP32, which is roughly 21% slower than native. In practice this means you run GGUF quantized models (Q4_K_M, Q5_K_M, Q8_0) through llama.cpp / Ollama, which use INT8/INT4 paths the P40 handles well — and you simply never touch FP16 workflows. Within that lane the P40 is a perfectly capable inference card; step outside it and performance falls off a cliff.

What can a Tesla P40 run, and how slowly?

Because it has 24GB, the P40 can load the same model sizes a 3090 can. The difference is how fast tokens come out. The P40's ~346 GB/s bandwidth is the main limiter for token generation (which is memory-bandwidth bound), so it is meaningfully slower than a 3090's ~936 GB/s — community single-card reports generally land well below modern 24GB cards.

These are approximate, community-reported single-card Q4 generation speeds. Treat them as ballpark — exact numbers vary with quant, context length, llama.cpp build, and cooling. Anything precise enough to quote to two decimals on a card this old is usually a misconfigured benchmark:

Model (Q4 quant)	Fits in 24GB?	Approx. single-P40 tok/s	Feels like
7B-8B (Llama/Qwen/Mistral)	Yes, easily	~15-25	Usable, clearly slower than a 3090
13B-14B	Yes	~8-12	Readable, but you wait
32B-34B	Yes (tight)	~4-7	Patience required
70B	No (single card)	n/a on one P40	Needs 2+ cards; still slow

A 70B model at Q4 is roughly 39GB, so it does not fit on a single 24GB P40 — you would need two cards (48GB), and even then dense-70B generation across PCIe 3.0 P40s is slow enough that most people give up on it. The realistic sweet spot for one P40 is 7B-14B chat and coding, plus the option to load a 32B model when you do not mind waiting.

The honest read: the P40 lets you run models a small card cannot, but it does not make them fast. If "it works at all, cheaply" is the goal, it delivers. If you want a snappy, real-time chat experience on bigger models, this is not the card.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Why is prompt processing so slow on the P40?

Token generation (one token at a time) is bandwidth-bound, and the P40's bandwidth, while dated, is acceptable. Prompt processing — ingesting your context before it starts answering — is different: it processes many tokens in parallel and leans on raw compute and on Flash Attention, a memory-efficient attention kernel.

Flash Attention requires Ampere or newer (RTX 30-series and up). The Pascal-based P40 cannot use it. So when you feed the P40 a long prompt — a big document for RAG, a large codebase context, a long chat history — it has to grind through attention the slow way. The result is a noticeable delay before the first token appears, and that delay grows with context length far faster than it would on a 3090.

For short prompts this barely matters. For long-context work (document Q&A, agent loops with big system prompts, large code context), the missing Flash Attention is one of the P40's most painful real-world limitations and a major reason it feels sluggish beyond what the raw tok/s numbers suggest.

What is the cooling, power and driver hassle?

The P40 is a server card, and it assumes a server chassis. Three things trip up first-time buyers:

Cooling. The P40 is passively cooled — it has a heatsink but no fan, because datacenter servers blast air through it. In a desktop case it will overheat and throttle (or worse) without help. You need a 3D-printed or off-the-shelf blower shroud plus a high-static-pressure fan (often a 40mm or 97mm blower), which is a ~$10-$30 add-on and a small assembly project.
Power. The P40 draws up to 250W through an EPS-style 8-pin connector (the kind used for CPU power), not a standard PCIe 8-pin. You need an EPS-to-dual-PCIe adapter (or a PSU CPU cable wired correctly). Plugging a regular PCIe cable into it is a known way to damage the card — verify the pinout before powering on.
Drivers and no display. It has no video output, so you need a second GPU or integrated graphics to drive your monitor. On Windows you install the NVIDIA datacenter/Tesla driver (and may need to set it to "TCC" or WDDM mode appropriately); on Linux it generally "just works" with the standard NVIDIA driver and CUDA, which is why most P40 builders run Linux.

None of this is hard, but it is real work and real research before the card is even usable. A modern gaming GPU you unbox and plug in; a P40 is a small project. If that sounds fun, the P40 rewards you with cheap 24GB. If it sounds like a headache, it is a fair reason to pay more for a 3090.

Should you buy a P40 or a used RTX 3090?

This is the real decision, because both are 24GB cards aimed at the same goal. The P40 wins on price-per-gigabyte; the 3090 wins on almost everything else.

Factor	Tesla P40	Used RTX 3090
VRAM	24GB	24GB
Used price (Jun 2026)	~$150-$300	~$850-$1,050
Memory bandwidth	~346 GB/s	~936 GB/s
LLM generation speed	Slow (~15-25 tok/s on 7B)	Fast (~95-110 tok/s on 7B)
Flash Attention	No (Pascal)	Yes (Ampere)
Usable FP16 / Tensor Cores	No	Yes
Prompt processing	Slow	Fast
Cooling / power	DIY shroud + EPS adapter	Plug-and-play
Image generation (SD/FLUX)	Poor	Strong
Best for	Cheapest possible 24GB	Best all-round 24GB value

The pattern is clear: a 3090 is roughly 4-5x faster at LLM generation, supports Flash Attention and FP16, handles image generation well, and installs like a normal GPU — but costs 3-5x more. If your budget can absorb a used 3090, it is the better card by a wide margin and the one we recommend for most people. The P40 makes sense specifically when the budget is tight, the goal is "load big models cheaply," and slow is acceptable.

Who is the Tesla P40 actually right for?

Buy a P40 if most of this is you:

You want 24GB for the absolute minimum money. Nothing else gets you here for ~$200.
You are okay with slow. You will read at a relaxed pace, not watch tokens stream instantly.
You enjoy the build. Shroud, fan, power adapter, Linux drivers — this is a feature, not a bug, for tinkerers.
Your workload is short-prompt chat/coding on 7B-14B models, not long-context RAG or image generation.
You want to experiment with bigger models (32B, or multi-card 70B) cheaply, and do not need them fast.

Skip the P40 and get a 3090 (or newer) if:

You value your time more than the ~$600 difference.
You do long-context RAG, agents, or coding with large context — the missing Flash Attention will hurt.
You generate images or video (Stable Diffusion, FLUX) — the P40 is poor here.
You want a card that just plugs in and works without a cooling project.

Key Takeaways

Best $/GB in local AI, full stop. A used P40 puts 24GB of VRAM in your machine for roughly $150-$300, plus ~$20-$45 for cooling and a power adapter.
Poor speed is the price you pay. Pascal architecture, no Tensor Cores, ~346 GB/s bandwidth — expect roughly ~15-25 tok/s on a 7B, far behind a 3090's ~95-110.
No usable FP16, no Flash Attention. FP16 runs at ~1/64 of FP32, so you live in GGUF Q4/Q5/Q8 land — and long-context prompt processing is notably slow.
It is a project, not a plug-in. Passive server cooling needs an added shroud and fan, power is an EPS-style 8-pin (not PCIe), and there is no video output, so you need a second GPU for display.
3090 if you can afford it. A used RTX 3090 is the same 24GB but ~4-5x faster and hassle-free for ~3-5x the price. The P40 is the value pick only when budget is the hard constraint and slow is acceptable.

Next Steps

Compare the obvious step up: RTX 3090 for local AI — the best all-round 24GB value, with verified tok/s and pricing.
See where the P40 sits in the full lineup in Best GPUs for Local AI, from budget cards up to the 5090.
Want a genuinely large-model build instead of one slow card? Read the cheapest 70B build: dual 3090 vs 5090.
Not sure which card fits your models and budget? Use our Which GPU to buy interactive picker.

For the original specs straight from the source, NVIDIA's Tesla P40 datasheet (PDF) lists the official numbers, and the open-source llama.cpp project is the easiest way to run GGUF quants on a P40 and benchmark it yourself with consistent settings.

Tesla P40 for Local LLMs (2026): 24GB for ~$200, Worth It?

Want to go deeper than this article?

What is the Tesla P40, and why is it so cheap?

Reading articles is good. Building is better.

How much does a Tesla P40 actually cost in 2026?

What are the real specs (and the real caveats)?

What can a Tesla P40 run, and how slowly?

Reading articles is good. Building is better.

Why is prompt processing so slow on the P40?

What is the cooling, power and driver hassle?

Should you buy a P40 or a used RTX 3090?

Who is the Tesla P40 actually right for?

Key Takeaways

Next Steps

Go from reading about AI to building with AI

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Go from reading about AI to building with AI

Ready to Go Beyond Tutorials?

Related Guides

RTX 3090 for Local AI: Still the Best 24GB Value in 2026

Best GPUs for Local AI: RTX 3060 to 5090 Tested

The Cheapest 70B Build: Dual 3090 vs 5090

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI