Is electricity a meaningful cost for self-hosted AI?

Yes. A single RTX 4090 box running 70% of the time at $0.14/kWh costs about $500/year in power. In high-rate states like California or Hawaii, power can match GPU amortization.

Should I buy used or new GPUs for AI inference?

Used RTX 3090s at $700-900 are the best tokens-per-dollar in 2026. Used 4090s have a higher failure-rate premium since many were used for Stable Diffusion. New 4090s ship clean with full warranty.

Does running AI 24/7 wear out GPUs faster?

Continuous inference is gentler than gaming because clocks stay flat. VRAM wear is the bigger concern with constant model swapping, but in practice 24/7 inference GPUs outlast equivalent gaming cards.

What about Apple Silicon for self-hosted AI cost?

A 192GB M2 Ultra Mac Studio runs Llama 3.1 70B at ~16 tok/sec at 100W - roughly a third the power of a 4090 box. It loses on raw throughput but wins on $/watt for small teams.

How much ops time should I budget for self-hosted AI?

Plan 4-8 hours/month for single-host deployments, 16-32 hours/month for multi-host or HA. At $80/hr fully loaded that is $3,840-$30,720/year and is often the deciding factor in TCO comparisons.

Are there cloud savings I should factor into the comparison?

Provisioned throughput commits can cut cloud list prices 30-50%, and prompt caching on OpenAI/Anthropic can slash input cost by 50% for repeat-heavy workloads. Compare against your real negotiated rate, not list.

Self-Hosted AI Cost Calculator: The TCO Most Spreadsheets Get Wrong

Q: When does self-hosted AI become cheaper than OpenAI?

Roughly above 100-150M tokens/year of Llama 3.1 70B-class workload at the mid-tier RTX 4090 build. Above 1B tokens/year, self-hosted is typically 3-5x cheaper than OpenAI list pricing.

Q: How do I calculate cost-per-token for my Ollama setup?

Measure tokens/sec and watts during a typical inference, multiply watts by your kWh rate, divide by tok/sec to get $/1k tokens for power, then add amortized hardware cost and ops time using the TCO formula in this guide.

Published April 23, 2026 - 24 min read

When a CFO asks "how much does self-hosted AI actually cost?", the typical answer is "well, the GPU was $1,600 and electricity is cheap." That answer is missing four cost categories, three risk lines, and at least one comma. Real total cost of ownership for a production AI box is closer to four times the sticker price of the card you put in it - and yet, for the right token volumes, self-hosted still wins by a 5x margin against cloud APIs.

This guide is the calculator I use when a customer pings me at 11pm asking "should we keep paying OpenAI or buy a 4090?" It uses real component prices, real power numbers measured at the wall with a Kill-A-Watt meter, and real cloud API rates as of April 2026. Every assumption is exposed so you can swap your own numbers in.

Quick Start: The 60-Second TCO Estimate

If you only need a back-of-the-envelope:

Annual TCO = Hardware/3 + Power_kWh * Rate * 8760 * Utilization + 0.15 * Hardware

For an RTX 4090 box ($2,800 total system, 480W avg under load, $0.14/kWh, 35% utilization):

= $2,800/3 + 0.480 * 0.14 * 8760 * 0.35 + 0.15 * $2,800
= $933 + $206 + $420
= $1,559/year

If you push 600 million tokens/year through that box (achievable with a Llama 3.1 70B Q4 model running 18 hours a day at 30 tok/s with two parallel slots), your unit cost is $0.0026 per 1K tokens - vs $5.00/M (input) + $15/M (output) for GPT-4o, or roughly $10/M blended. Self-hosted is 38x cheaper at scale.

The catch is "at scale." Below 50M tokens/year, cloud usually wins. The rest of this guide shows you exactly where the crossover happens for your workload.

The Five Cost Categories Everyone Forgets
Hardware Tiers and Real Prices
Power, Cooling, and Rack Space
Throughput Benchmarks That Drive Unit Economics
The Self-Hosted vs Cloud Crossover
Worked Examples: 4 Real Workloads
The Calculator Formula
Hidden Risks and Insurance
Pitfalls in Naive TCO Models

The Five Cost Categories Everyone Forgets {#cost-categories}

Most blog posts compare a $1,600 GPU sticker price to OpenAI's API rate and declare victory. That's not TCO. Real TCO includes:

Category	Typical Annual Cost (single 4090 box)	Why It Matters
Hardware amortization	$930	3-year straight line on a $2,800 system
Power	$200-500	Continuous if always-on; varies 5x with rate
Cooling and rack	$80-300	Often forgotten in home or small office
Ops and maintenance	$400-1,200	The #1 hidden cost
Risk and insurance	$100-300	GPU theft, lightning, drive failure
Total	$1,710-3,230	What you should actually budget

The ops cost is where most calculations fall apart. A self-hosted Ollama box does not run itself: someone updates the binary, rotates keys, monitors logs, fixes the night the model file got corrupted on shutdown. Even at 4 hours/month of skilled ops time at $80/hr fully loaded, that's $3,840/year - more than the hardware. For home users this is "your weekend hours, which are free." For businesses it's a real line item.

Hardware Tiers and Real Prices {#hardware-tiers}

April 2026 retail pricing (US, including taxes for new components, eBay average for used):

Entry Tier - Single User, 7B-13B Models

RTX 3060 12GB (used)        $220
Used Dell OptiPlex 7080     $280  (i7, 32GB RAM, 1TB NVMe)
PCIe riser cable             $25
850W PSU                     $90
Total                       $615

Power: 240W total under load, 60W idle. Tokens/sec on Llama 3.1 8B Q4: ~45.

Mid Tier - Team of 5-10, 70B Q4 Models

RTX 4090 (new)            $1,599
Workstation chassis         $700
1500W PSU                   $260
128GB DDR5 ECC              $480
2TB NVMe Gen4              $140
Threadripper 7960X         $1,500
Total                     $4,679

Power: 480W avg, 120W idle. Tokens/sec on Llama 3.1 70B Q4: 28-35 with two parallel slots.

Pro Tier - Multi-User, 70B-405B Models

2x RTX 4090 (new)         $3,200
Server chassis (4U)         $900
2000W redundant PSU         $600
256GB DDR5 ECC              $920
4TB NVMe Gen4               $260
Threadripper PRO 7975WX   $4,000
NVLink not supported on 4090 - use TP via vLLM
Total                    $9,880

Power: 920W avg under load, 200W idle. Tokens/sec on Llama 3.1 70B Q5: 55-70.

Enterprise Tier - Real Production

8x H100 SXM5 server      $240,000+ (often leased)
Or 8x L40S build          $80,000-$110,000
Rack, networking, UPS     $15,000
Total CapEx              $95,000-$255,000

For most readers, mid tier is the answer. Used GPUs change the math significantly - see our used GPU AI buying guide for the trade-offs of dropping to a 3090 instead of a 4090.

Power, Cooling, and Rack Space {#power-cooling}

I've measured every box I run with a Kill-A-Watt P4400 plugged directly between the system and the wall. Numbers below are wall-side, not GPU TDP.

System	Idle (W)	Inference avg (W)	Inference peak (W)
RTX 3060 box	55	240	310
RTX 4090 box (single GPU)	120	480	720
2x RTX 4090 box	200	920	1,380
8x L40S server	850	4,200	5,800
8x H100 SXM5	1,400	6,800	8,500

US residential power averages $0.16/kWh, commercial $0.12/kWh, data center colo $0.08-0.10/kWh fully bundled. Rates vary 4x by state. EIA monthly retail rates are the canonical source.

Cooling adds 25-40% to the GPU power bill in summer if the room isn't already air-conditioned. A single 4090 puts out about 1,640 BTU/hr - enough to noticeably warm a 100 sq ft room.

Quick formula:

Annual_Power_Cost = Avg_Watts * Hours_Per_Year * Rate_Per_kWh / 1000

For a 4090 at 480W average, 70% always-on (alternating between idle and inference based on user requests):

= (0.7 * 480 + 0.3 * 120) * 8760 * 0.16 / 1000
= 372 * 8760 * 0.16 / 1000
= $521/year

This is one of the most miscalculated lines in TCO spreadsheets. People assume 24/7 max-TDP draw and overshoot 3x, or they assume idle and undershoot 3x.

Throughput Benchmarks That Drive Unit Economics {#throughput}

Tokens per dollar is the only metric that matters at the end. Real measured throughput on Ollama 0.6 with default settings:

Hardware	Model	Quant	Tok/sec (single)	Tok/sec (4 parallel)
RTX 3060 12GB	Llama 3.1 8B	Q4_K_M	48	110
RTX 4090 24GB	Llama 3.1 8B	Q4_K_M	165	410
RTX 4090 24GB	Llama 3.1 70B	Q4_K_M	32	78
2x RTX 4090 (vLLM TP)	Llama 3.1 70B	FP16	60	220
8x L40S (vLLM)	Llama 3.1 70B	FP16	290	1,800
8x H100 SXM5 (vLLM)	Llama 3.1 70B	FP16	480	3,400
8x H100 SXM5 (vLLM)	Llama 3.1 405B	FP8	95	580

Annual throughput at realistic utilization (40% busy hours, 8 hours peak/day):

Tokens/year = Tok/sec * 8 hr * 3600 * 365 * 0.40

For a 4090 doing 78 tok/sec on Llama 70B Q4 with parallelism: ~328M tokens/year. At our $1,710 mid-case TCO, unit cost is $5.21 per 1M tokens - lower than every commercial provider for that model class.

For more on choosing the right size, see our Ollama RAM/VRAM model master table.

The Self-Hosted vs Cloud Crossover {#crossover}

Cloud API list prices (April 2026, blended in/out at 30/70):

Provider	Model	Blended $/M tokens
OpenAI	GPT-4o	$11.50
OpenAI	GPT-4o mini	$0.45
Anthropic	Claude Sonnet 4.5	$9.00
Anthropic	Claude Haiku 4.5	$1.20
AWS Bedrock	Llama 3.1 70B	$0.99
Together AI	Llama 3.1 70B	$0.88
Self-hosted	Llama 3.1 70B Q4	$5.21 (mid case)
Self-hosted	Llama 3.1 70B Q4	$1.80 (high util)

Naive read: managed Llama hosting on Bedrock or Together is cheaper than your home 4090. That's true at low volume. The math flips when:

You need data sovereignty (no third party touches the prompt)
You're paying enterprise list, not retail (Bedrock for committed enterprise can be 5x list)
Your workload spikes are bursty and you'd otherwise pay reserved capacity 24/7
You need fine-tuned weights kept private
Latency matters and you can colocate inference next to your app

Crossover for the 4090 mid-tier vs Together's $0.88/M:

Break_even_volume = Annual_TCO / Cloud_rate_per_token
                  = $1,710 / ($0.88 / 1M)
                  = 1.94B tokens/year

That's 5.3M tokens/day, achievable in a small product but not on a single user's chat. Hence the rule: if you're under 100M tokens/year, cloud usually wins on raw cost. Above 1B tokens/year, self-hosted typically wins by 3-5x. In between, it's a strategy question, not a math question.

Worked Examples: 4 Real Workloads {#worked-examples}

Workload A: Solo developer, code completion via Continue.dev

200K tokens/day, 5 days/week = 52M tokens/year
Best fit: existing M-series Mac running Llama 3.1 8B locally
Cloud equivalent: Claude Haiku 4.5 = ~$62/year
Self-hosted: $0 incremental (uses laptop you already own)
Verdict: self-hosted wins (laptop is sunk cost)

Workload B: 25-person legal team, contract review

8M tokens/user/year * 25 = 200M tokens/year
Privacy mandatory (HIPAA-adjacent data handling)
Best fit: 4090 mid-tier box with auth and audit logging
Cloud equivalent (Bedrock 70B with PrivateLink): ~$2,200/year
Self-hosted TCO: $2,400/year (includes 6 hr/month ops at $80/hr)
Verdict: parity on cost, self-hosted wins on data sovereignty and per-user log isolation

Workload C: 50,000 MAU SaaS app, light AI assist

~3B tokens/year
Latency-sensitive (chat UX)
Best fit: 2x 4090 colo'd box + cloud burst overflow
Cloud-only equivalent (GPT-4o mini): $1,350/year, but with vendor lock and rate limits
Self-hosted TCO: $5,800/year, supports up to 5B tokens
Verdict: hybrid - self-host base load, burst to cloud for peaks. ~$3,200/year combined

Workload D: Bank with 200 internal users, 70B-class assistant

~12B tokens/year, regulated data
Best fit: on-prem 8x L40S server, two for HA
Cloud-equivalent (Azure OpenAI Llama 70B with private endpoint): $11,000/year + compliance overhead
Self-hosted TCO: $32,000/year (hardware amort + power + 0.5 FTE ops)
Verdict: self-hosted wins on compliance, audit, and predictable spend, even though raw $/token looks higher

For a deep dive on the specific compliance side, see our SOC 2 self-hosted AI guide and the GDPR-compliant local AI guide.

The Calculator Formula {#calculator}

A complete TCO formula you can drop into a spreadsheet:

TCO_annual =
    (Hardware_total / Amortization_years)                # capex
  + (Avg_W * 8760 * Util * Rate_kWh / 1000)              # power
  + (0.30 * Power_cost)                                  # cooling
  + (Rack_monthly * 12 if colo else 0)                   # rack
  + (Ops_hours_year * Hourly_rate)                       # ops
  + (0.05 * Hardware_total)                              # insurance/risk
  + (Internet_monthly * 12)                              # bandwidth (if dedicated)

Cost_per_1k_tokens = TCO_annual / (Tokens_per_year / 1000)

For sensible defaults: Amortization_years = 3, Util = 0.35 (single-user) to 0.7 (team), Hourly_rate = $80, Rate_kWh = $0.14.

I keep this in a Google Sheet at the office; the same logic powers our (upcoming) interactive calculator on this site. Plug in your hardware list, power rate, and expected token volume, and you'll see exactly where you sit on the crossover curve.

Hidden Risks and Insurance {#hidden-risks}

What naive TCO models ignore:

1. Drive failure - A failed NVMe takes models with it. Recovery: $30 disk + 4 hr re-pull at saturated bandwidth. Budget $50/year amortized.

2. PSU surge - A bad PSU can take a $1,600 GPU with it. Whole-home surge protection is $200 one-time and prevents this. Renter? Get a UPS.

3. Theft - A 4090 is now stolen-bike-tier loot. Renter's insurance covers it but ask explicitly about computer hardware riders.

4. Vendor pull - NVIDIA driver bugs have bricked production runs for weeks. Pin driver versions and keep the previous one staged.

5. Cooling failure - A summer AC outage can throttle or kill a 4090 in an hour. Ambient temp monitoring with PagerDuty alerts is cheap insurance.

6. Compliance audit findings - Self-hosting shifts compliance burden from vendor to you. SOC 2 readiness is roughly $25K first year, $10K/year recurring for a small team. Often this is offset by removing a vendor's BAA cost - but plan for it explicitly.

7. Talent risk - If only one person knows how to restart Ollama after a kernel upgrade, the bus factor is 1. Document everything; pair on ops.

Pitfalls in Naive TCO Models {#pitfalls}

Mistakes I see in cost decks that get people fired:

Comparing GPU sticker price to API price - Ignores power, ops, risk. Off by 3-4x.
Assuming 100% utilization - Real utilization on internal tools is 5-15% of available hours.
Ignoring ops cost when self-hosting at home - Hobbyist labor is "free" until you burn out.
Using GPU TDP instead of measured wall power - Off by 1.3x easily; PSU efficiency varies.
Forgetting bandwidth costs - First-time pulls of large models can be 100GB+. Comcast caps and overage fees apply.
Ignoring cloud commit discounts - Enterprise OpenAI/Anthropic deals are routinely 20-50% off list. Get the actual price.
Not modeling embedding costs separately - RAG pipelines have a different unit economics; embeddings are usually 100x cheaper than generation but high volume.
Pretending fine-tuning is "free" on cloud - Cloud fine-tune costs are often 100x inference and frequently the surprise that flips the decision.

Frequently Asked Questions

When does self-hosted AI become cheaper than OpenAI?

Roughly when you push more than 100-150M tokens/year of Llama 3.1 70B-class workload, assuming the mid-tier 4090 build at $1,710 annual TCO. Below that, cloud almost always wins on raw $/token, even before factoring in convenience. Above 1B tokens/year, self-hosted is typically 3-5x cheaper.

Is electricity really a meaningful cost?

For a single 4090 box running 70% of the time at $0.14/kWh, you're looking at roughly $500/year. That's modest in a corporate context but the second-largest line for home users after hardware. In high-rate states like California ($0.32/kWh) or Hawaii ($0.42/kWh), power can equal the GPU amortization.

How do I calculate cost-per-token for my Ollama setup?

Run ollama run llama3.1:70b "Generate 200 tokens about anything" while watching nvidia-smi dmon. Note the average watts and tokens/sec. Multiply watts by your rate and divide by tok/sec to get $/1k tokens for power alone. Then layer in amortized hardware and ops time using the formula above.

Should I buy used or new GPUs for AI?

Used 3090s at $700-900 are the highest tokens/dollar value in 2026 if you can find clean ones. Used 4090s carry a meaningful failure-rate premium because they were heavily used for Stable Diffusion. New 4090s ship clean and have warranty. Our used GPU AI buying guide has the field-tested checklist.

Does running AI 24/7 affect GPU lifespan?

Continuous inference is gentler on GPUs than gaming because clocks stay flat instead of spiking. Memory wear is the bigger concern: VRAM is rated for very high write counts, but constant model swapping can shorten life. In practice, GPUs running inference 24/7 outlast desktop gaming cards.

What about Apple Silicon for self-hosted AI?

M2 Ultra Mac Studios punch well above their watt-class for inference. A 192GB M2 Ultra can run Llama 3.1 70B at ~16 tok/sec while drawing 100W total - about a third the power of a 4090 box. They lose on raw throughput but win on $/watt and silence. Great for small teams.

How do I budget for ops time on self-hosted AI?

Plan 4-8 hours/month for a single-host deployment, 16-32 hours/month for multi-host or HA. At fully loaded $80/hr, that's $3,840-$30,720/year. This is the line where many small teams find self-hosted "wins" only if the engineer was doing ops anyway.

Are there hidden cloud savings I'm missing in the comparison?

Two big ones: (1) provisioned throughput discounts (Bedrock, Anthropic, Azure) can drop list prices 30-50%, but you commit to capacity 24/7 even when idle. (2) Caching - if 30% of your prompts are repeats, OpenAI's prompt caching slashes input costs by 50%. Self-hosted inherently caches and reuses KV state, but cloud caching is now competitive for text-heavy workloads.

Bottom Line

Self-hosted AI is not "free." It's "predictable." The CFO question isn't "what's the unit cost today" - it's "where do I want my marginal-cost curve to slope?" Cloud APIs give you a flat curve at a higher base. Self-hosted gives you a steep capex up front, near-zero marginal cost after that, and full control over the data path.

Run the numbers with your real volumes, real power rate, and real ops capacity. The crossover for most production AI workloads sits between 200M and 2B tokens/year. Below that, write a check to OpenAI and sleep well. Above that, the spreadsheet starts shouting at you to buy a 4090.

And whichever side you land on, please use measured numbers - not vendor marketing decks - to make the call.

Self-Hosted AI Cost Calculator: Real TCO vs Cloud APIs

Want to go deeper than this article?

Self-Hosted AI Cost Calculator: The TCO Most Spreadsheets Get Wrong

Quick Start: The 60-Second TCO Estimate

Table of Contents

The Five Cost Categories Everyone Forgets {#cost-categories}

Hardware Tiers and Real Prices {#hardware-tiers}

Entry Tier - Single User, 7B-13B Models

Mid Tier - Team of 5-10, 70B Q4 Models

Pro Tier - Multi-User, 70B-405B Models

Enterprise Tier - Real Production

Power, Cooling, and Rack Space {#power-cooling}

Throughput Benchmarks That Drive Unit Economics {#throughput}

The Self-Hosted vs Cloud Crossover {#crossover}

Worked Examples: 4 Real Workloads {#worked-examples}

Workload A: Solo developer, code completion via Continue.dev

Workload B: 25-person legal team, contract review

Workload C: 50,000 MAU SaaS app, light AI assist

Workload D: Bank with 200 internal users, 70B-class assistant

The Calculator Formula {#calculator}

Hidden Risks and Insurance {#hidden-risks}

Pitfalls in Naive TCO Models {#pitfalls}

Frequently Asked Questions

When does self-hosted AI become cheaper than OpenAI?

Is electricity really a meaningful cost?

How do I calculate cost-per-token for my Ollama setup?

Should I buy used or new GPUs for AI?

Does running AI 24/7 affect GPU lifespan?

What about Apple Silicon for self-hosted AI?

How do I budget for ops time on self-hosted AI?

Are there hidden cloud savings I'm missing in the comparison?

Bottom Line

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

🎓 Continue Learning

Get the AI TCO Worksheet

Related Guides

Local AI vs ChatGPT: Complete Comparison

How Much RAM Do You Need for Local AI?

5 Compelling Reasons to Run AI Locally

Top 10 Free Local AI Models

Build Real AI on Your Machine

Continue Learning

Used GPU AI Buying Guide

Ollama vs ChatGPT API Cost

Ollama RAM/VRAM Table

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI