Self-Hosted AI Cost Calculator: Real TCO vs Cloud APIs
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Self-Hosted AI Cost Calculator: The TCO Most Spreadsheets Get Wrong
Published April 23, 2026 - 24 min read
When a CFO asks "how much does self-hosted AI actually cost?", the typical answer is "well, the GPU was $1,600 and electricity is cheap." That answer is missing four cost categories, three risk lines, and at least one comma. Real total cost of ownership for a production AI box is closer to four times the sticker price of the card you put in it - and yet, for the right token volumes, self-hosted still wins by a 5x margin against cloud APIs.
This guide is the calculator I use when a customer pings me at 11pm asking "should we keep paying OpenAI or buy a 4090?" It uses real component prices, real power numbers measured at the wall with a Kill-A-Watt meter, and real cloud API rates as of April 2026. Every assumption is exposed so you can swap your own numbers in.
Quick Start: The 60-Second TCO Estimate
If you only need a back-of-the-envelope:
Annual TCO = Hardware/3 + Power_kWh * Rate * 8760 * Utilization + 0.15 * Hardware
For an RTX 4090 box ($2,800 total system, 480W avg under load, $0.14/kWh, 35% utilization):
= $2,800/3 + 0.480 * 0.14 * 8760 * 0.35 + 0.15 * $2,800
= $933 + $206 + $420
= $1,559/year
If you push 600 million tokens/year through that box (achievable with a Llama 3.1 70B Q4 model running 18 hours a day at 30 tok/s with two parallel slots), your unit cost is $0.0026 per 1K tokens - vs $5.00/M (input) + $15/M (output) for GPT-4o, or roughly $10/M blended. Self-hosted is 38x cheaper at scale.
The catch is "at scale." Below 50M tokens/year, cloud usually wins. The rest of this guide shows you exactly where the crossover happens for your workload.
Table of Contents
- The Five Cost Categories Everyone Forgets
- Hardware Tiers and Real Prices
- Power, Cooling, and Rack Space
- Throughput Benchmarks That Drive Unit Economics
- The Self-Hosted vs Cloud Crossover
- Worked Examples: 4 Real Workloads
- The Calculator Formula
- Hidden Risks and Insurance
- Pitfalls in Naive TCO Models
The Five Cost Categories Everyone Forgets {#cost-categories}
Most blog posts compare a $1,600 GPU sticker price to OpenAI's API rate and declare victory. That's not TCO. Real TCO includes:
| Category | Typical Annual Cost (single 4090 box) | Why It Matters |
|---|---|---|
| Hardware amortization | $930 | 3-year straight line on a $2,800 system |
| Power | $200-500 | Continuous if always-on; varies 5x with rate |
| Cooling and rack | $80-300 | Often forgotten in home or small office |
| Ops and maintenance | $400-1,200 | The #1 hidden cost |
| Risk and insurance | $100-300 | GPU theft, lightning, drive failure |
| Total | $1,710-3,230 | What you should actually budget |
The ops cost is where most calculations fall apart. A self-hosted Ollama box does not run itself: someone updates the binary, rotates keys, monitors logs, fixes the night the model file got corrupted on shutdown. Even at 4 hours/month of skilled ops time at $80/hr fully loaded, that's $3,840/year - more than the hardware. For home users this is "your weekend hours, which are free." For businesses it's a real line item.
Hardware Tiers and Real Prices {#hardware-tiers}
April 2026 retail pricing (US, including taxes for new components, eBay average for used):
Entry Tier - Single User, 7B-13B Models
RTX 3060 12GB (used) $220
Used Dell OptiPlex 7080 $280 (i7, 32GB RAM, 1TB NVMe)
PCIe riser cable $25
850W PSU $90
Total $615
Power: 240W total under load, 60W idle. Tokens/sec on Llama 3.1 8B Q4: ~45.
Mid Tier - Team of 5-10, 70B Q4 Models
RTX 4090 (new) $1,599
Workstation chassis $700
1500W PSU $260
128GB DDR5 ECC $480
2TB NVMe Gen4 $140
Threadripper 7960X $1,500
Total $4,679
Power: 480W avg, 120W idle. Tokens/sec on Llama 3.1 70B Q4: 28-35 with two parallel slots.
Pro Tier - Multi-User, 70B-405B Models
2x RTX 4090 (new) $3,200
Server chassis (4U) $900
2000W redundant PSU $600
256GB DDR5 ECC $920
4TB NVMe Gen4 $260
Threadripper PRO 7975WX $4,000
NVLink not supported on 4090 - use TP via vLLM
Total $9,880
Power: 920W avg under load, 200W idle. Tokens/sec on Llama 3.1 70B Q5: 55-70.
Enterprise Tier - Real Production
8x H100 SXM5 server $240,000+ (often leased)
Or 8x L40S build $80,000-$110,000
Rack, networking, UPS $15,000
Total CapEx $95,000-$255,000
For most readers, mid tier is the answer. Used GPUs change the math significantly - see our used GPU AI buying guide for the trade-offs of dropping to a 3090 instead of a 4090.
Power, Cooling, and Rack Space {#power-cooling}
I've measured every box I run with a Kill-A-Watt P4400 plugged directly between the system and the wall. Numbers below are wall-side, not GPU TDP.
| System | Idle (W) | Inference avg (W) | Inference peak (W) |
|---|---|---|---|
| RTX 3060 box | 55 | 240 | 310 |
| RTX 4090 box (single GPU) | 120 | 480 | 720 |
| 2x RTX 4090 box | 200 | 920 | 1,380 |
| 8x L40S server | 850 | 4,200 | 5,800 |
| 8x H100 SXM5 | 1,400 | 6,800 | 8,500 |
US residential power averages $0.16/kWh, commercial $0.12/kWh, data center colo $0.08-0.10/kWh fully bundled. Rates vary 4x by state. EIA monthly retail rates are the canonical source.
Cooling adds 25-40% to the GPU power bill in summer if the room isn't already air-conditioned. A single 4090 puts out about 1,640 BTU/hr - enough to noticeably warm a 100 sq ft room.
Quick formula:
Annual_Power_Cost = Avg_Watts * Hours_Per_Year * Rate_Per_kWh / 1000
For a 4090 at 480W average, 70% always-on (alternating between idle and inference based on user requests):
= (0.7 * 480 + 0.3 * 120) * 8760 * 0.16 / 1000
= 372 * 8760 * 0.16 / 1000
= $521/year
This is one of the most miscalculated lines in TCO spreadsheets. People assume 24/7 max-TDP draw and overshoot 3x, or they assume idle and undershoot 3x.
Throughput Benchmarks That Drive Unit Economics {#throughput}
Tokens per dollar is the only metric that matters at the end. Real measured throughput on Ollama 0.6 with default settings:
| Hardware | Model | Quant | Tok/sec (single) | Tok/sec (4 parallel) |
|---|---|---|---|---|
| RTX 3060 12GB | Llama 3.1 8B | Q4_K_M | 48 | 110 |
| RTX 4090 24GB | Llama 3.1 8B | Q4_K_M | 165 | 410 |
| RTX 4090 24GB | Llama 3.1 70B | Q4_K_M | 32 | 78 |
| 2x RTX 4090 (vLLM TP) | Llama 3.1 70B | FP16 | 60 | 220 |
| 8x L40S (vLLM) | Llama 3.1 70B | FP16 | 290 | 1,800 |
| 8x H100 SXM5 (vLLM) | Llama 3.1 70B | FP16 | 480 | 3,400 |
| 8x H100 SXM5 (vLLM) | Llama 3.1 405B | FP8 | 95 | 580 |
Annual throughput at realistic utilization (40% busy hours, 8 hours peak/day):
Tokens/year = Tok/sec * 8 hr * 3600 * 365 * 0.40
For a 4090 doing 78 tok/sec on Llama 70B Q4 with parallelism: ~328M tokens/year. At our $1,710 mid-case TCO, unit cost is $5.21 per 1M tokens - lower than every commercial provider for that model class.
For more on choosing the right size, see our Ollama RAM/VRAM model master table.
The Self-Hosted vs Cloud Crossover {#crossover}
Cloud API list prices (April 2026, blended in/out at 30/70):
| Provider | Model | Blended $/M tokens |
|---|---|---|
| OpenAI | GPT-4o | $11.50 |
| OpenAI | GPT-4o mini | $0.45 |
| Anthropic | Claude Sonnet 4.5 | $9.00 |
| Anthropic | Claude Haiku 4.5 | $1.20 |
| AWS Bedrock | Llama 3.1 70B | $0.99 |
| Together AI | Llama 3.1 70B | $0.88 |
| Self-hosted | Llama 3.1 70B Q4 | $5.21 (mid case) |
| Self-hosted | Llama 3.1 70B Q4 | $1.80 (high util) |
Naive read: managed Llama hosting on Bedrock or Together is cheaper than your home 4090. That's true at low volume. The math flips when:
- You need data sovereignty (no third party touches the prompt)
- You're paying enterprise list, not retail (Bedrock for committed enterprise can be 5x list)
- Your workload spikes are bursty and you'd otherwise pay reserved capacity 24/7
- You need fine-tuned weights kept private
- Latency matters and you can colocate inference next to your app
Crossover for the 4090 mid-tier vs Together's $0.88/M:
Break_even_volume = Annual_TCO / Cloud_rate_per_token
= $1,710 / ($0.88 / 1M)
= 1.94B tokens/year
That's 5.3M tokens/day, achievable in a small product but not on a single user's chat. Hence the rule: if you're under 100M tokens/year, cloud usually wins on raw cost. Above 1B tokens/year, self-hosted typically wins by 3-5x. In between, it's a strategy question, not a math question.
Worked Examples: 4 Real Workloads {#worked-examples}
Workload A: Solo developer, code completion via Continue.dev
- 200K tokens/day, 5 days/week = 52M tokens/year
- Best fit: existing M-series Mac running Llama 3.1 8B locally
- Cloud equivalent: Claude Haiku 4.5 = ~$62/year
- Self-hosted: $0 incremental (uses laptop you already own)
- Verdict: self-hosted wins (laptop is sunk cost)
Workload B: 25-person legal team, contract review
- 8M tokens/user/year * 25 = 200M tokens/year
- Privacy mandatory (HIPAA-adjacent data handling)
- Best fit: 4090 mid-tier box with auth and audit logging
- Cloud equivalent (Bedrock 70B with PrivateLink): ~$2,200/year
- Self-hosted TCO: $2,400/year (includes 6 hr/month ops at $80/hr)
- Verdict: parity on cost, self-hosted wins on data sovereignty and per-user log isolation
Workload C: 50,000 MAU SaaS app, light AI assist
- ~3B tokens/year
- Latency-sensitive (chat UX)
- Best fit: 2x 4090 colo'd box + cloud burst overflow
- Cloud-only equivalent (GPT-4o mini): $1,350/year, but with vendor lock and rate limits
- Self-hosted TCO: $5,800/year, supports up to 5B tokens
- Verdict: hybrid - self-host base load, burst to cloud for peaks. ~$3,200/year combined
Workload D: Bank with 200 internal users, 70B-class assistant
- ~12B tokens/year, regulated data
- Best fit: on-prem 8x L40S server, two for HA
- Cloud-equivalent (Azure OpenAI Llama 70B with private endpoint): $11,000/year + compliance overhead
- Self-hosted TCO: $32,000/year (hardware amort + power + 0.5 FTE ops)
- Verdict: self-hosted wins on compliance, audit, and predictable spend, even though raw $/token looks higher
For a deep dive on the specific compliance side, see our SOC 2 self-hosted AI guide and the GDPR-compliant local AI guide.
The Calculator Formula {#calculator}
A complete TCO formula you can drop into a spreadsheet:
TCO_annual =
(Hardware_total / Amortization_years) # capex
+ (Avg_W * 8760 * Util * Rate_kWh / 1000) # power
+ (0.30 * Power_cost) # cooling
+ (Rack_monthly * 12 if colo else 0) # rack
+ (Ops_hours_year * Hourly_rate) # ops
+ (0.05 * Hardware_total) # insurance/risk
+ (Internet_monthly * 12) # bandwidth (if dedicated)
Cost_per_1k_tokens = TCO_annual / (Tokens_per_year / 1000)
For sensible defaults: Amortization_years = 3, Util = 0.35 (single-user) to 0.7 (team), Hourly_rate = $80, Rate_kWh = $0.14.
I keep this in a Google Sheet at the office; the same logic powers our (upcoming) interactive calculator on this site. Plug in your hardware list, power rate, and expected token volume, and you'll see exactly where you sit on the crossover curve.
Hidden Risks and Insurance {#hidden-risks}
What naive TCO models ignore:
1. Drive failure - A failed NVMe takes models with it. Recovery: $30 disk + 4 hr re-pull at saturated bandwidth. Budget $50/year amortized.
2. PSU surge - A bad PSU can take a $1,600 GPU with it. Whole-home surge protection is $200 one-time and prevents this. Renter? Get a UPS.
3. Theft - A 4090 is now stolen-bike-tier loot. Renter's insurance covers it but ask explicitly about computer hardware riders.
4. Vendor pull - NVIDIA driver bugs have bricked production runs for weeks. Pin driver versions and keep the previous one staged.
5. Cooling failure - A summer AC outage can throttle or kill a 4090 in an hour. Ambient temp monitoring with PagerDuty alerts is cheap insurance.
6. Compliance audit findings - Self-hosting shifts compliance burden from vendor to you. SOC 2 readiness is roughly $25K first year, $10K/year recurring for a small team. Often this is offset by removing a vendor's BAA cost - but plan for it explicitly.
7. Talent risk - If only one person knows how to restart Ollama after a kernel upgrade, the bus factor is 1. Document everything; pair on ops.
Pitfalls in Naive TCO Models {#pitfalls}
Mistakes I see in cost decks that get people fired:
- Comparing GPU sticker price to API price - Ignores power, ops, risk. Off by 3-4x.
- Assuming 100% utilization - Real utilization on internal tools is 5-15% of available hours.
- Ignoring ops cost when self-hosting at home - Hobbyist labor is "free" until you burn out.
- Using GPU TDP instead of measured wall power - Off by 1.3x easily; PSU efficiency varies.
- Forgetting bandwidth costs - First-time pulls of large models can be 100GB+. Comcast caps and overage fees apply.
- Ignoring cloud commit discounts - Enterprise OpenAI/Anthropic deals are routinely 20-50% off list. Get the actual price.
- Not modeling embedding costs separately - RAG pipelines have a different unit economics; embeddings are usually 100x cheaper than generation but high volume.
- Pretending fine-tuning is "free" on cloud - Cloud fine-tune costs are often 100x inference and frequently the surprise that flips the decision.
Frequently Asked Questions
When does self-hosted AI become cheaper than OpenAI?
Roughly when you push more than 100-150M tokens/year of Llama 3.1 70B-class workload, assuming the mid-tier 4090 build at $1,710 annual TCO. Below that, cloud almost always wins on raw $/token, even before factoring in convenience. Above 1B tokens/year, self-hosted is typically 3-5x cheaper.
Is electricity really a meaningful cost?
For a single 4090 box running 70% of the time at $0.14/kWh, you're looking at roughly $500/year. That's modest in a corporate context but the second-largest line for home users after hardware. In high-rate states like California ($0.32/kWh) or Hawaii ($0.42/kWh), power can equal the GPU amortization.
How do I calculate cost-per-token for my Ollama setup?
Run ollama run llama3.1:70b "Generate 200 tokens about anything" while watching nvidia-smi dmon. Note the average watts and tokens/sec. Multiply watts by your rate and divide by tok/sec to get $/1k tokens for power alone. Then layer in amortized hardware and ops time using the formula above.
Should I buy used or new GPUs for AI?
Used 3090s at $700-900 are the highest tokens/dollar value in 2026 if you can find clean ones. Used 4090s carry a meaningful failure-rate premium because they were heavily used for Stable Diffusion. New 4090s ship clean and have warranty. Our used GPU AI buying guide has the field-tested checklist.
Does running AI 24/7 affect GPU lifespan?
Continuous inference is gentler on GPUs than gaming because clocks stay flat instead of spiking. Memory wear is the bigger concern: VRAM is rated for very high write counts, but constant model swapping can shorten life. In practice, GPUs running inference 24/7 outlast desktop gaming cards.
What about Apple Silicon for self-hosted AI?
M2 Ultra Mac Studios punch well above their watt-class for inference. A 192GB M2 Ultra can run Llama 3.1 70B at ~16 tok/sec while drawing 100W total - about a third the power of a 4090 box. They lose on raw throughput but win on $/watt and silence. Great for small teams.
How do I budget for ops time on self-hosted AI?
Plan 4-8 hours/month for a single-host deployment, 16-32 hours/month for multi-host or HA. At fully loaded $80/hr, that's $3,840-$30,720/year. This is the line where many small teams find self-hosted "wins" only if the engineer was doing ops anyway.
Are there hidden cloud savings I'm missing in the comparison?
Two big ones: (1) provisioned throughput discounts (Bedrock, Anthropic, Azure) can drop list prices 30-50%, but you commit to capacity 24/7 even when idle. (2) Caching - if 30% of your prompts are repeats, OpenAI's prompt caching slashes input costs by 50%. Self-hosted inherently caches and reuses KV state, but cloud caching is now competitive for text-heavy workloads.
Bottom Line
Self-hosted AI is not "free." It's "predictable." The CFO question isn't "what's the unit cost today" - it's "where do I want my marginal-cost curve to slope?" Cloud APIs give you a flat curve at a higher base. Self-hosted gives you a steep capex up front, near-zero marginal cost after that, and full control over the data path.
Run the numbers with your real volumes, real power rate, and real ops capacity. The crossover for most production AI workloads sits between 200M and 2B tokens/year. Below that, write a check to OpenAI and sleep well. Above that, the spreadsheet starts shouting at you to buy a 4090.
And whichever side you land on, please use measured numbers - not vendor marketing decks - to make the call.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!