Will this $1,500 build support running two GPUs for bigger models down the road?

Yes with caveats. The B650M PG Lightning has one full PCIe 4.0 x16 slot and one x4 slot, so a second GPU runs at x4 — fine for inference but slower for cross-GPU communication. If you plan dual-GPU from day one, choose a B650 ATX board with x8/x8 bifurcation support, which adds roughly $50 to the build.

Why not use a Threadripper or used Xeon platform for more PCIe lanes?

More lanes are nice, but a used Threadripper Pro board and CPU starts around $900, which forces you to downgrade either the GPU or the RAM to stay under $1,500. For single-GPU and likely-single-GPU-future builds, AM5 is the better-value platform. Threadripper makes sense above $2,500 budgets.

Can this build do fine-tuning or only inference?

It can do QLoRA-style fine-tuning of 7B-13B models with 24 GB VRAM. Full 70B fine-tuning needs at least 48 GB VRAM (usually 80 GB) and DeepSpeed configurations. For serious 70B fine-tuning, rent cloud A100 or H100 capacity rather than buying that hardware outright.

Is the build actually quiet enough to run in a home office?

Idle is essentially silent. Under sustained 70B inference, the 3090's fans are the loudest component at roughly 38 dB measured at 1 meter. Acceptable in most home offices, noticeable in a quiet bedroom. We placed ours in a ventilated closet for full peace of mind.

Does running Windows instead of Linux hurt performance?

Yes, by roughly 15-20% on sustained AI workloads. Windows lacks efficient unified-memory paths and adds driver overhead. WSL2 closes most of the gap but introduces another layer of complexity. For a dedicated AI server, native Ubuntu Server 24.04 is the right choice.

How does this PC build compare to a Mac Studio M4 Max for local AI?

Performance on 70B models is roughly comparable (16 tok/s on Mac Studio vs 18 tok/s on this build). The Mac Studio is silent and draws less power but costs about 2.3x as much and has no upgrade path. The PC build wins on raw dollar-per-token, the Mac wins on simplicity and acoustics.

Will a Ryzen 7 7700 bottleneck a 3090 for AI workloads?

No. Inference is overwhelmingly GPU and memory-bandwidth bound. We measured CPU usage at 8-12% during sustained 7B inference and 18-22% during 70B mixed-precision runs. The 7700 has more than enough headroom. Spending more on the CPU at this scale is wasted money.

What is the realistic break-even point versus paying for cloud AI APIs?

At about 5,000 requests per day saturating this build, electricity costs $1.80/day vs roughly $12/day on gpt-4o for equivalent work, plus $0 for any cancelled subscriptions. The hardware pays for itself in roughly 5 months at that volume. Below ~500 requests per day the math is closer; above 2,000 per day the local server is dramatically cheaper.

Building a Dedicated AI Server Under $1,500

Published April 23, 2026 • 21 min read

A serious local AI rig used to mean spending $4,000-plus on a workstation card and a Threadripper. That math has flipped. Used 24 GB GPUs are now under $750, DDR5 has crashed to commodity prices, and a midrange Ryzen with sensible memory beats the crap out of CPU-bound inference. We built this server in March 2026, parted out for under $1,500 total including a case and PSU, and it runs Llama 3.3 70B at 18 tokens/sec, Qwen 2.5 72B at 16 tokens/sec, and 7B models at over 100 tokens/sec. Below is the exact parts list, assembly notes, BIOS tweaks, and benchmarks. No affiliate fluff — just what we paid and what it does.

Quick Start: Total Build Summary

Component	Pick	Price (Mar 2026)
GPU	Used RTX 3090 Founders Edition 24 GB	$720
CPU	AMD Ryzen 7 7700 (8C/16T, 65 W)	$268
Motherboard	ASRock B650M PG Lightning	$135
RAM	G.Skill Flare X5 64 GB (2x32) DDR5-6000 CL30	$185
Storage	WD Black SN850X 2 TB NVMe	$145
PSU	Corsair RM850e 850 W 80+ Gold	$115
Case	Fractal Design Pop Air Mid-Tower	$80
CPU cooler	Thermalright Peerless Assassin 120 SE	$35
Total		$1,683 → $1,498 with deals

The "with deals" line is real: we paid $720 for the 3090 on r/hardwareswap, $185 for the RAM during a Newegg sale, and $80 for the case during a microcenter clearance. Patience saves $185.

This rig produces:

7B Q4 model: 105 tokens/sec
13B Q4 model: 56 tokens/sec
70B Q4 model: 18 tokens/sec
Concurrent users on 7B: 8 simultaneous before queueing

Design Goals: What Did We Optimize For?
The Parts List, With Reasoning
Why a Used 3090 Beats a New 4070
Assembly Walkthrough
BIOS and Linux Tuning
Software Stack
Benchmarks: 7B to 70B
Total Cost of Ownership
Pitfalls and Mistakes We Made
FAQ

Design Goals: What Did We Optimize For? {#design-goals}

Before picking parts, write down what you actually want. Ours:

Run 70B-class models at usable speed. Below 15 tokens/sec is tedious. We targeted 18+.
Concurrent users. At least four people in a household or small team. That means parallelism.
Quiet under sustained load. Lives in a closet near the office.
Low idle power. It runs 24/7. Idle wattage matters more than peak.
Upgrade headroom. Room to add a second GPU later.
Under $1,500 all-in. Hard cap.

Things we explicitly did not optimize for: gaming, video editing, ECC memory, or future-proofing CPU compute. AI inference is overwhelmingly GPU-bound and memory-bound; CPU does not move the needle once you have 8 modern cores.

If you have a smaller budget, our budget local AI machine covers the $400-700 tier. If you have a bigger budget, Mac Studio vs PC build is the right comparison.

The Parts List, With Reasoning {#parts-list}

GPU: Used RTX 3090 Founders Edition 24 GB — $720

The single most important part. The 24 GB VRAM is the gating factor for what you can run. Every dollar saved elsewhere goes here.

Why used 3090, not 4070 Ti / 4080 / 4090? A 4070 Ti has 12 GB VRAM. Cannot fit Llama 70B Q4 (39 GB). 4080 has 16 GB — same problem. 4090 has 24 GB but costs $1,800 used, eating the entire budget. The 3090 is the cheapest 24 GB on the market.
Why FE specifically? Founders Edition has a flow-through cooler design that exhausts heat properly in case airflow. AIB cards (EVGA, ASUS) often dump heat back into the case.
Buying tips. r/hardwareswap or eBay with returns. Avoid mining cards (heavy thermal cycling). Memory modules on the 3090 commonly hit 100°C+ under sustained load — replace thermal pads if you can.

CPU: AMD Ryzen 7 7700 — $268

Eight modern Zen 4 cores at 65 W TDP. Plenty for inference orchestration and any non-LLM workload you care about.

Why not 7700X? 7700X has higher boost clocks but draws 105 W. For a 24/7 server, 65 W TDP idles 18 W lower. Over a year that is roughly $40 in electricity at US average prices.
Why not 7600? Six cores is fine for the LLM itself but tight if you also run RAG indexing, embedding generation, or other co-located services.
Why AM5? Long socket support (AMD has committed to AM5 through at least 2027), DDR5 only (better bandwidth), free-up path to 9-series chips later.

Motherboard: ASRock B650M PG Lightning — $135

B650M is the cheap, sane mid-range. AM5 socket, two M.2 slots, four DIMM slots, a real x16 PCIe slot for the GPU. Nothing fancy.

Why mATX? The build is single-GPU and does not need additional PCIe slots. mATX cases are more compact and cheaper.
Why not X670E? Twice the price for features (extra PCIe lanes, better VRMs) we will not use. B650 VRMs handle a 65 W chip without breaking a sweat.

RAM: G.Skill Flare X5 64 GB DDR5-6000 CL30 — $185

64 GB is the sweet spot. Big enough to host a 70B model split between GPU+CPU if needed, big enough for embedding pipelines and a Postgres alongside.

Why DDR5-6000 CL30? This is the AM5 sweet spot. Higher kits (DDR5-7200) cost a lot more and barely help LLM throughput.
Why not 128 GB? Tempting, but 70B Q4 fits the GPU's 24 GB. Spilling to RAM is a fallback we rarely hit. Saving $200 on RAM funds a better GPU.

Storage: WD Black SN850X 2 TB NVMe — $145

Models are big. A 70B Q5 weighs 50+ GB. Five of those plus the OS plus image-gen models eats 500 GB fast.

Why 2 TB? 1 TB fills up. We measured: a serious local AI user collects 1.2 TB of model files inside 12 months.
Why Gen4? Model load time. A 70B model loads in 4 seconds on Gen4 NVMe vs 11 seconds on a SATA SSD. Adds up.
Why not Gen5? Gen5 NVMe runs hot and the marginal benefit for sequential model loading is small.

PSU: Corsair RM850e 850 W 80+ Gold — $115

850 W is comfortable headroom. The 3090 alone can spike to 420 W; CPU adds 120 W; everything else under 50 W. 850 W puts us well below 80% load at peak.

Why not 750 W? 3090 transient spikes (sub-millisecond) can hit 600 W. A tight PSU trips OCP and reboots the box mid-inference. Don't be cheap on the PSU.
Why 80+ Gold and not Platinum? The efficiency difference at typical 200-300 W idle/medium load is under 2 percentage points. Saves $50.

Case: Fractal Design Pop Air Mid-Tower — $80

Mesh front, decent stock fans, fits a 320 mm GPU comfortably. Quiet under load when paired with a low-noise CPU cooler.

CPU Cooler: Thermalright Peerless Assassin 120 SE — $35

The legend. $35 cooler that beats $90 coolers in benchmarks. Keeps a 7700 at 65°C under sustained load, near-silent.

For broader hardware orientation, see our AI hardware requirements complete guide.

Why a Used 3090 Beats a New 4070 {#3090-vs-4070}

This decision drives the whole build. Counter-intuitive at first glance — a new card with newer architecture should win, right? Not for AI inference at 24 GB scale.

Spec	RTX 3090 (used)	RTX 4070 Super (new)	RTX 4080 Super (new)
VRAM	24 GB	12 GB	16 GB
Memory bandwidth	936 GB/s	504 GB/s	736 GB/s
Tensor TFLOPs	142	142	209
Price (Apr 2026)	$720	$620	$1,000
Llama 70B Q4 fits?	Yes	No (split)	No (split)
Llama 3.1 8B Q4 tok/s	92	84	105
Llama 3.1 13B Q4 tok/s	54	(split, ~12)	41

The 4070 Super has identical tensor compute to the 3090 and 100% more memory bandwidth per dollar of new card. But it cannot fit the models we want to run. Memory bandwidth is the dominant factor for inference throughput on models that fit; raw compute is secondary.

The 4080 Super is a great card but at $1,000 takes 67% of our budget for a card that still cannot fit 70B Q4 in VRAM alone.

For a deeper apples-to-apples comparison, see RTX 4060 vs RTX 3060 for AI. For 70B specifically, the used GPU buying guide walks through the full 3090 vetting checklist.

Assembly Walkthrough {#assembly}

Standard PC build with two AI-specific notes. Estimated time: 90 minutes for someone who has built before, 3-4 hours first time.

1. CPU and cooler first, outside the case

Drop the 7700 into the AM5 socket. Apply rice-grain of thermal paste. Mount the Peerless Assassin per its instructions. AM5 mounting on this cooler is straightforward; no backplate replacement needed.

2. RAM into A2 and B2 slots

The far slot from the CPU on each side. This is the standard dual-channel config. Push firmly until both clips snap. Half-seated DIMMs are the #1 cause of "won't POST" calls.

3. NVMe into the M.2 slot closest to the CPU

This slot has the most direct PCIe lanes. Use the included thermal pad / heatsink. Screw down at a slight angle.

4. Mount motherboard in case

I/O shield first if separate. Standoffs already in place on most cases. Don't over-tighten.

5. PSU and cable routing

Modular PSU. Run only the cables you need: 24-pin ATX, EPS 8-pin (CPU), GPU 8+8 pins, SATA for fans. Route through the back of the case before plugging in.

6. GPU last

Lift the GPU into the top PCIe x16 slot. Push down evenly until the latch clicks. Critical for 3090s: this card is heavy and sags. Use the GPU support bracket included with the FE, or a $5 anti-sag from any PC parts site. A sagging 3090 will eventually crack the PCIe slot.

Plug in both 8-pin GPU power connectors. Use two separate PCIe cables from the PSU, not a single cable with two pigtails. The 3090 can pull 350 W sustained; a single cable shared across both connectors is on the edge of its rating.

7. Front panel cables

Tiny pins. Refer to motherboard manual. Power switch, reset, power LED, HDD LED. Easy to miss; double-check.

8. First boot

Plug into a monitor's DisplayPort (the GPU's, not the motherboard — your CPU has integrated graphics but you want to verify the GPU boots). Press power. You should see the ASRock splash, then UEFI on first boot.

If no POST: re-seat RAM in A2/B2. If still no POST: clear CMOS via the jumper on the motherboard.

BIOS and Linux Tuning {#bios-tuning}

BIOS settings

EXPO Profile: Enable. Otherwise your DDR5-6000 RAM runs at 4800 MT/s defaults. This single setting is worth ~10% throughput on CPU-side tasks.
Resizable BAR: Enable. Modest improvement for CUDA workloads.
PCIe ASPM: Disable on the GPU slot. ASPM (active state power management) can introduce micro-latencies that disrupt sustained inference.
Secure Boot: Disable for now. Easier driver installation. Re-enable after you have NVIDIA drivers signed if you care about it.
CPU Boost Override: Leave on auto unless you're undervolting.

OS choice

Ubuntu Server 24.04 LTS. Linux significantly outperforms Windows for sustained AI workloads. WSL2 closes most of the gap but adds memory overhead and a layer of complexity we do not need on a dedicated server.

NVIDIA driver install

sudo apt update
sudo apt install ubuntu-drivers-common
sudo ubuntu-drivers install nvidia:550   # or latest stable
sudo reboot

# Verify
nvidia-smi

You should see the 3090 with 24576 MiB total memory and current driver version.

Linux tuning

# Persistent GPU power state - pin to max performance
sudo nvidia-smi -pm 1

# Set persistence so this survives reboots
sudo nvidia-persistenced --user nvidia-persistenced

# Optional: cap power limit if thermal-constrained
# sudo nvidia-smi -pl 350   # default 350W on 3090

# CPU governor for low idle power, ramps under load
echo 'GOVERNOR="schedutil"' | sudo tee -a /etc/default/cpufrequtils

# Increase ulimit for large models / many connections
echo "* soft nofile 65535" | sudo tee -a /etc/security/limits.conf
echo "* hard nofile 65535" | sudo tee -a /etc/security/limits.conf

Cooling sanity check

Run a 30-minute Llama 3.3 70B inference loop and check temps:

nvidia-smi --query-gpu=temperature.gpu,temperature.memory --format=csv -l 5

GPU core should sit at 70-78°C. Memory junction temperature should stay below 90°C. If memory junction climbs above 100°C, the thermal pads on a used 3090 are likely worn out — replace with 2 mm Thermalright Odyssey pads. We had to do this on one of three 3090s we tested.

Software Stack {#software-stack}

The base AI stack we run on this hardware:

# Ollama via official install script
curl -fsSL https://ollama.com/install.sh | sh

# Verify GPU detection
ollama serve &
ollama run llama3.3:70b "test" --verbose
# Should show eval rate ~18 tok/s and 100% GPU utilization

For multi-user concurrency, swap Ollama for vLLM (much higher throughput, harder to operate). For routing in front of either, deploy LiteLLM as we cover in AI gateway with LiteLLM.

Models we actually keep on this box

ollama pull qwen2.5:7b-instruct-q4_K_M       # daily driver
ollama pull qwen2.5-coder:7b-instruct-q4_K_M  # coding
ollama pull llama3.3:70b-instruct-q4_K_M     # heavy reasoning
ollama pull deepseek-r1-distill-qwen:7b-q4_K_M  # math/reasoning
ollama pull bge-m3                            # embeddings

Total disk: about 80 GB.

Reverse proxy with Caddy for TLS + auth

ai.lan {
  basicauth {
    you JDJhJDE0$encrypted_hash
  }
  reverse_proxy 127.0.0.1:11434
}

For the deeper hardening pattern, see our Ollama production deployment walkthrough.

Benchmarks: 7B to 70B {#benchmarks}

All numbers are median of 50 runs, num_ctx=4096, models at Q4_K_M unless noted. Measured with Ollama --verbose.

Model	VRAM	Tokens/sec	TTFT	Concurrency cap
Llama 3.2 3B	2.5 GB	162	18 ms	16+
Phi-4 mini 3.8B	3.0 GB	138	22 ms	14
Mistral 7B v0.3	4.6 GB	118	35 ms	10
Qwen 2.5 7B	5.4 GB	105	38 ms	8
Qwen 2.5-Coder 7B	5.4 GB	102	39 ms	8
Llama 3.1 8B	5.7 GB	92	42 ms	8
Gemma 2 9B	5.8 GB	84	48 ms	7
Llama 3.1 13B	8.8 GB	54	72 ms	4
Mixtral 8x7B Q4	26 GB (mixed)	18	350 ms	1
Qwen 2.5 72B Q4	39 GB (mixed)	16	380 ms	1
Llama 3.3 70B Q4	39 GB (mixed)	18	360 ms	1

70B-class models exceed 24 GB VRAM at Q4_K_M, so Ollama splits some layers to system RAM. The 18 tok/s is real but bounded by the PCIe transfer rate between GPU and CPU. With a 48 GB GPU (RTX 6000 Ada or A6000), you would see ~38 tok/s on the same model — but those cards are $4,000+.

For comparison against cloud:

gpt-4o on the same 200-prompt set: 48 tok/s, 290 ms TTFT, $0.0024 per request
Our 3090 box: depends on model, $0 per request, all data stays local

A team running 5,000 requests/day saturates this build. Daily power cost: $1.80 (165 W average × 24h × $0.16/kWh + idle hours). vs $12/day on gpt-4o. The hardware pays for itself in roughly 5 months at that volume.

Total Cost of Ownership {#tco}

Year 1 line item	Cost
Hardware	$1,498
Electricity (165 W avg, $0.16/kWh)	$231
Internet bandwidth (no cloud egress)	$0
Subscription replacement (ChatGPT Plus, Claude Pro)	-$240 saved
Net Year 1	$1,489
Year 2	$231 (electricity only)

Power is the recurring cost. We measured 165 W average across a typical day (low overnight, peaks during work hours), with idle around 95 W and full-load peaks at 480 W. Annual electricity at US average rates is about $230 — less than two months of a single ChatGPT Plus subscription.

Replacement timeline: GPUs in our experience last 5+ years of constant use if cooled properly. The whole platform is comfortable for 4+ years before "AI relevance" pressure (newer architectures, better tensor support) makes upgrades worthwhile.

Pitfalls and Mistakes We Made {#pitfalls}

1. Bought a mining 3090 first. Heavy thermal cycling. Memory junction temps hit 105°C in 10 minutes. Replaced thermal pads twice and it still threw VRAM ECC errors. Took the loss. Now we only buy 3090s with verifiable non-mining history (gaming forum members, content creators).

2. Cheap PSU. First build used a $70 750 W unit. 3090 transient spikes tripped OCP and rebooted the box during inference. The $115 RM850e never had an issue.

3. Forgot the GPU support bracket. Three months in, the 3090 drooped enough to crack the PCIe x16 slot retainer. Glue-and-pray fix. Always use a support bracket for 1+ kg cards.

4. Set up Ubuntu Desktop instead of Server. Wasted RAM on a desktop environment we never used. Reinstalled to Server, gained 1.5 GB RAM and faster boots. Headless is the right answer for an always-on server.

5. Did not enable EXPO in BIOS. RAM ran at 4800 MT/s for two weeks. CPU-side throughput was 11% slower than it should have been. Five-second BIOS toggle.

6. Tried to share the GPU between Ollama and a Plex transcoder. Plex stutters when Ollama hits memory bandwidth saturation. Picked one workload per GPU.

7. Did not budget for a UPS. Power blip mid-inference corrupted a model checkpoint we were fine-tuning. APC Back-UPS Pro 1500 ($180) would have prevented it. Adding one is on the next-quarter list.

For the official manufacturer references on the parts we picked, the AMD Ryzen 7 7700 product page documents the TDP and specifications, and the NVIDIA RTX 3090 specifications are the canonical source for VRAM and memory bandwidth.

Frequently Asked Questions {#faq}

Q: Will this build run two GPUs for 70B at full speed?

Yes, with caveats. The B650M PG Lightning has one full x16 slot and one x4 slot. A second 3090 in the x4 slot runs at PCIe 4.0 x4 — fine for inference but slower for any cross-GPU communication. For dual-3090 from day one, get a B650 ATX board with x8/x8 bifurcation support.

Q: Why not a Threadripper or used Xeon for more PCIe lanes?

Sounds great, costs more. A used Threadripper Pro 5955WX board+CPU is $900 minimum. We would lose the GPU upgrade or take a step down on RAM. For single-GPU and likely-single-GPU-future builds, AM5 is the better-value platform.

Q: Can I use this build for fine-tuning, not just inference?

Limited. 24 GB VRAM is enough for QLoRA fine-tuning of 7B-13B models. 70B fine-tuning needs at least 48 GB VRAM (typically 80 GB) and DeepSpeed configurations. For fine-tuning a real 70B, rent a cloud A100 cluster instead.

Q: Is the build quiet enough for an office?

Yes. Idle is silent. Under sustained inference, the 3090 fans are the loudest component at about 38 dB at 1 meter. The Peerless Assassin CPU cooler is near-silent. We placed ours in a closet with passive ventilation; in an open office it is comfortable but noticeable.

Q: Does Windows work for this build?

Yes, but you lose roughly 15-20% throughput vs Ubuntu on the same hardware due to driver overhead and the lack of efficient unified memory paths. WSL2 closes most of that gap but introduces a second layer of complexity. For a dedicated AI server, native Linux wins.

Q: How does this compare to a Mac Studio M4 Max?

A Mac Studio M4 Max with 64 GB unified memory is around $3,500 and runs 70B at roughly 16 tok/s — close to our build. Pros: silent, low power (~80 W), single-box simplicity. Cons: 2.3x the price, no upgrade path, no CUDA. If you live in the Apple ecosystem and want zero fuss, Studio is great. If you want raw value, this PC build is the answer. We compare both in Mac Studio vs PC build for AI.

Q: What if I find a 3090 for $500 or less?

Buy two. Upgrade later. Stockpile the difference for a 4090 in a year. The 3090 is the strongest dollar-per-VRAM card on the market for any model that fits 24 GB.

Q: Will the 7700 bottleneck the 3090?

No. AI inference is overwhelmingly GPU and memory-bandwidth bound. We measured CPU usage at 8-12% during sustained 7B inference and 18-22% during 70B (where it's nudging weights between GPU and RAM). A 7700 has plenty of headroom.

Conclusion

This build is the answer to "I want a real local AI server but I do not want to spend more than my partner's car on it." For under $1,500 you get a machine that runs every model worth running today, supports a small team, costs roughly $20/month to power, and has clear upgrade paths if your needs grow. The single most important decision is the GPU, and the single most important advice is to buy a used 24 GB card rather than a new 12 GB or 16 GB one. Everything else flows from that.

Once the hardware is in, the next steps are software: see our Ollama production deployment, AI gateway with LiteLLM, and adding local AI to an existing app to turn the box into something your apps actually talk to. For the model picking layer, AI on 16GB RAM covers the smaller-tier model picks and best local AI models covers the full landscape.

Want our exact parts list updated as prices move and new GPUs drop? Subscribe to the LocalAIMaster newsletter — we re-test the build every quarter.

Building a Dedicated AI Server Under $1,500: Tested Parts List (2026)

Want to go deeper than this article?