How much performance does an eGPU lose for local AI?

For LLM inference with the model loaded into VRAM: 2-3 percent on OCuLink (PCIe 4.0 x4), and 15-17 percent on Thunderbolt 4 or USB4. Substantially less than the 25-40 percent loss seen in gaming because inference rarely saturates the PCIe bus.

Is an RTX 4090 eGPU faster than an RTX 3090 internal for local LLMs?

Yes. Even with the 15 percent Thunderbolt 4 tax, an RTX 4090 eGPU hits about 121 tokens per second on Llama 3.1 8B versus 96 for an internal RTX 3090. Card class matters more than interconnect.

Which eGPU interconnect is best for AI?

OCuLink (PCIe 4.0 x4) wins decisively at 2-3 percent performance loss versus internal. Thunderbolt 4 and USB4 are roughly tied around 15-17 percent loss. Avoid plain USB-C without Thunderbolt 4 or USB4 — it cannot carry PCIe.

Does an eGPU work for local AI on a Mac?

No for NVIDIA cards — macOS has no NVIDIA driver. AMD eGPUs work for Metal compute but Ollama does not currently dispatch to non-Apple-Silicon GPUs. Mac users should invest in unified memory instead of an eGPU.

What enclosure should I buy for an eGPU AI rig?

For Thunderbolt 4 with a 4090, the Razer Core X Chroma (700W PSU) is the safe choice. For OCuLink, the GPD G1 dock or Minisforum DEG1 paired with a desktop PSU. Avoid older TB3 enclosures with 400W PSUs — they brown out under prefill.

eGPU for Local AI: External GPU Benchmarks

Q: Can I run Llama 3.1 70B on an eGPU?

On a single 24GB eGPU like the RTX 4090, only with partial CPU offload, dropping generation to about 6-7 tokens per second. Two 24GB GPUs (one internal, one eGPU) or a single 48GB GPU is the cleaner path.

Q: How long does it take to load a model over Thunderbolt 4?

About 12-13 seconds for an 8B Q4 model compared to 2 seconds on internal PCIe. After the initial load you do not pay this cost again until you swap models.

Q: Does USB-C work as an eGPU connection?

Plain USB-C does not carry PCIe and cannot host an eGPU. Only Thunderbolt 3/4 and USB4 (with the optional Thunderbolt-equivalent feature) support eGPU enclosures.

Published on March 8, 2026 • 23 min read

Quick Answer: Yes, an eGPU Works for Local AI — With Caveats

Headline numbers from our lab on Llama 3.1 8B Q4_K_M:

RTX 4090 internal PCIe 5.0 x16: 142 tok/s (baseline)
RTX 4090 over OCuLink (PCIe 4.0 x4): 138 tok/s — only 3% slower
RTX 4090 over Thunderbolt 4 (Razer Core X Chroma): 121 tok/s — 15% slower
RTX 4090 over USB4 (40 Gbps): 119 tok/s — 16% slower

The verdict: Once the model is loaded into VRAM, inference barely uses PCIe bandwidth. eGPU performance depends almost entirely on whether your interconnect can move weights into VRAM at acceptable speed during load — and once loaded, your tokens/sec are within 3-16% of internal. That is genuinely surprising and worth a 4,000-word article.

Why this article exists:

Nobody benchmarks eGPUs for local AI. eGPU coverage online is dominated by gaming benchmarks (where Thunderbolt loses 25-40% to internal PCIe) and ML training benchmarks (where the loss is 10-30%). LLM inference sits in a different regime — much lighter on PCIe traffic — and the answer is dramatically more positive than gaming numbers would suggest.

This is the first benchmark we know of that measures eGPU LLM inference across all three modern interconnects. If you have a mini PC, a laptop, or a Mac and have been wondering whether an eGPU is worth $400 + GPU, the answer is in this guide.

If you have not chosen the GPU itself, our best GPUs for AI and used GPU buying guide cover that step. This article assumes you have a card and want to know how the external connection performs.

Why eGPU Performance for AI Differs From Gaming
The Three Interconnects: TB4, USB4, OCuLink
Test Methodology
Benchmarks: RTX 4090 Across Interconnects
Benchmarks: RTX 3090 and 4070 Ti Super
Mac mini + eGPU: A Special Case
Real-World Recommendations
Setup Guide: From Box to First Token
Common Pitfalls
Frequently Asked Questions

Why eGPU Performance for AI Differs From Gaming {#different}

Gaming benchmarks tank on Thunderbolt because every frame requires sending textures, draw calls, and assets across the PCIe bus. The bus is on the render-loop critical path.

LLM inference is different:

Weights are loaded once at model startup and stay resident in VRAM. After load, the bus is idle.
Per-token traffic is tiny. A token is a few hundred KB of activations going to the GPU and a probability vector coming back.
Compute dominates. The GPU spends 99%+ of its time on matmul, not on bus transfers.

This means an interconnect with 1/8 the bandwidth (Thunderbolt 4 at ~22 Gbps usable vs PCIe 5.0 x16 at ~256 Gbps) only loses 10-20% of inference performance, not 80%. Pre-fill (prompt processing) takes a slightly bigger hit because activation tensors are larger, but generation is almost untouched.

For the underlying bandwidth specs, see Intel's Thunderbolt Technology Community for TB4 and the USB-IF USB4 specification for USB4.

The Three Interconnects: TB4, USB4, OCuLink {#interconnects}

Interconnect	Theoretical	Real-world PCIe	Cost (enclosure)	Host availability
Thunderbolt 4	40 Gbps	PCIe 3.0 x4 (~22 Gbps)	$300-450	All TB4 laptops, Mac, some PCs
USB4 (40 Gbps)	40 Gbps	PCIe 3.0 x4 (~22 Gbps)	$300-450	Ryzen 7040/8040, Intel Core Ultra
OCuLink (PCIe 4.0 x4)	64 Gbps	PCIe 4.0 x4 (~58 Gbps)	$80-140	Specific mini PCs (Minisforum, GMKtec)

Thunderbolt 4 is the most universally compatible, supports hot-plug, and works on Mac. The PCIe link inside is 3.0 x4 — 22 Gbps usable after protocol overhead.

USB4 is functionally similar to TB4 for eGPU purposes. Compatibility is more uneven — some hosts implement only USB4 v1 or skip the optional eGPU mode.

OCuLink is the dark horse. It is a direct PCIe 4.0 x4 cable with no Thunderbolt translation overhead. ~2.5x the effective bandwidth of TB4. The catch: only specific mini PCs (and a handful of laptops) expose it.

Test Methodology {#methodology}

Models: Llama 3.1 8B Q4_K_M, Llama 3.1 70B Q4_K_M (where it fits)
Runners: Ollama 0.3.14, llama.cpp llama-bench (commit pinned)
Prompt: Fixed 512-token prompt, 256-token output, temperature 0, seed 42
Iterations: 5 runs, first dropped, mean reported with standard deviation
Cards tested: RTX 4090 24GB (FE), RTX 3090 24GB (Asus TUF), RTX 4070 Ti Super 16GB
Hosts tested:
- Internal baseline: Ryzen 9 7950X, X670E, RTX 4090 in PCIe 5.0 x16 slot
- TB4: Razer Core X Chroma + 2024 ThinkPad X1 Carbon (Intel Core Ultra 7)
- USB4: Beelink SER8 (Ryzen 7 8845HS) + ADT-Link UT4-D
- OCuLink: Minisforum UM890 Pro + GPD G1 OCuLink dock
Power: External 850W ATX PSU on all eGPU enclosures
Methodology: Same as our benchmark playbook

All numbers are reproducible with the YAML harness we ship to newsletter subscribers.

Benchmarks: RTX 4090 Across Interconnects {#bench-4090}

Llama 3.1 8B Instruct (Q4_K_M) — single stream

Connection	Prefill (tok/s)	Generation (tok/s)	TTFT (ms)	Model load time (s)
Internal PCIe 5.0 x16	4,140	142.0	64	2.1
OCuLink (PCIe 4.0 x4)	3,920	138.0	71	4.8
Thunderbolt 4	2,810	121.0	96	12.6
USB4 (40 Gbps)	2,740	119.0	102	13.4

Llama 3.1 70B (Q4_K_M) — single stream

The 70B model needs ~40GB VRAM at Q4_K_M, so it does not fit on the RTX 4090's 24GB. We tested with partial offload (40 layers on GPU, rest on CPU):

Connection	Generation (tok/s)	Notes
Internal	7.1	CPU/GPU split shifts more bottleneck onto host RAM
OCuLink	6.9	Negligible delta — host RAM dominates
Thunderbolt 4	6.4	Activations crossing the bus add up
USB4	6.2	Same as TB4 within margin

Concurrency: 8 parallel streams, Llama 3.1 8B

Connection	Aggregate gen tok/s	P99 TTFT (ms)
Internal	412	280
OCuLink	398	310
Thunderbolt 4	318	540
USB4	312	580

Take-aways

OCuLink is within margin of error of internal PCIe for inference. If your mini PC has OCuLink, you have not given up meaningful performance.
Thunderbolt 4 / USB4 cost ~15% on small models, ~10% on large. Real, but far less than gaming benchmarks would suggest.
Concurrency widens the gap. TB4/USB4 lose more under load because more activations cross the bus.
Model load time is the worst part of TB4/USB4. First-time load of an 8B model takes 12-13s vs 2s internal. Once loaded, the experience is fine.

The picture changes if you are choosing between an eGPU 4090 and an internal 3090 — see the next section.

Benchmarks: RTX 3090 and 4070 Ti Super {#bench-others}

RTX 3090, Llama 3.1 8B Q4_K_M

Connection	Generation (tok/s)	Notes
Internal	96.0	Baseline
OCuLink	93.5	-2.6%
TB4	81.0	-15.6%
USB4	79.5	-17.2%

RTX 4070 Ti Super (16GB), Llama 3.1 8B Q4_K_M

Connection	Generation (tok/s)
Internal	88.0
OCuLink	86.0
TB4	76.0
USB4	75.0

Cross-card comparison: when does an eGPU 4090 beat an internal 3090?

This is the most useful question for a homelab buyer with a mini PC and a budget. An RTX 4090 over Thunderbolt 4 (121 tok/s) is faster than an internal RTX 3090 (96 tok/s) by 26% on Llama 3.1 8B. So the eGPU tax is more than offset by the better card if your goal is single-stream throughput.

The same logic applies to 70B-class models: a 4090 eGPU + 64GB host RAM partial offload beats a 3090 internal partial offload because the 4090 has higher memory bandwidth for the layers it does hold.

If you are stuck on the card choice itself, the RTX 4060 vs 3060 article covers the budget end and our best GPUs for AI guide handles the upper tier.

Mac mini + eGPU: A Special Case {#mac}

We tested a Mac mini M4 (16GB) with a Razer Core X Chroma + RTX 4090 over Thunderbolt 4. Spoiler: it does not work — Apple removed support for NVIDIA / external CUDA GPUs years ago, and macOS recognizes the device only as a generic PCIe peripheral.

What does work on Mac:

Apple Silicon's unified memory is already a built-in answer to "I want more VRAM." A Mac Studio M2 Ultra with 192GB unified memory beats any single eGPU setup for 70B-class models.
AMD eGPUs (Radeon Pro / W-series) work for Metal compute, but Ollama on macOS does not currently dispatch to non-Apple-Silicon GPUs.

Practical advice for Mac users: skip eGPU. Buy more unified memory. Our Mac local AI setup and Apple M4 for AI guides cover the realistic path.

Real-World Recommendations {#recommendations}

Setup A: Beelink SER8 mini PC + eGPU ($1,239 total)

Mini PC: Beelink SER8 (Ryzen 7 8845HS, 32GB) — $639
Enclosure: Razer Core X Chroma — $400
Card: Used RTX 3090 24GB — $700 (covered in our used GPU guide)
Performance: 81 tok/s on Llama 3.1 8B, ~5 tok/s on Llama 3.1 70B partial offload
Verdict: Excellent for a single-user shop that already has a mini PC. Quiet, modular, upgradable.

Setup B: Minisforum UM890 Pro + OCuLink GPU dock ($1,599 total)

Mini PC: Minisforum UM890 Pro (Ryzen 9 8945HS, 32GB) — $789
Dock: GPD G1 OCuLink ($170) or Minisforum DEG1 ($150)
Card: RTX 4070 Ti Super 16GB — $660
Performance: 86 tok/s on Llama 3.1 8B, with internal-PCIe-class behavior
Verdict: Best value-per-tok/s for someone willing to source OCuLink components.

Setup C: 2024 laptop + Thunderbolt 4 eGPU ($600 + GPU)

Laptop: Any modern TB4 laptop (you probably already own one)
Enclosure: Akitio Node Titan or Razer Core X Chroma — $400-450
Card: Bring whatever fits the budget
Performance: ~85% of internal-PCIe equivalent
Verdict: Best path for someone who already owns a capable laptop and wants to add LLM capability without a second machine.

Setup Guide: From Box to First Token {#setup}

Step 1: Hardware sanity check

Before installing software, confirm:

# Linux: confirm Thunderbolt / USB4 sees the GPU
lspci -nn | grep -i nvidia
# Should show: VGA compatible controller [03c0]: NVIDIA Corporation ...

# Confirm PCIe link width and speed
sudo lspci -vvv -s 04:00.0 | grep -i "LnkSta:"
# Look for: LnkSta: Speed 16GT/s (downgraded), Width x4 (downgraded)
# x4 is correct for TB4/USB4/OCuLink. Speed 16GT/s = PCIe 4.0.

# macOS: System Information → Thunderbolt
system_profiler SPThunderboltDataType | grep -A3 "Vendor Name"

Step 2: Install drivers

# Ubuntu 24.04 — install NVIDIA proprietary driver
sudo apt install -y nvidia-driver-550 nvidia-cuda-toolkit
sudo reboot

# Verify
nvidia-smi

# Authorize Thunderbolt device (Linux + boltctl)
boltctl list
boltctl authorize <device-uuid>
boltctl enroll <device-uuid>

Step 3: Install Ollama

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b-instruct-q4_K_M
ollama run llama3.1:8b-instruct-q4_K_M --verbose "Hello"

You should see all layers offload to GPU (offloaded 33/33 layers to GPU in the verbose output) and eval rate north of 80 tok/s on a 4090 eGPU.

Step 4: Sanity-check the link

# During inference, watch link util
sudo nvidia-smi dmon -s u -c 60
# 'pcie' column should be modest (1-5% during gen, higher during prefill)

If you see 100% PCIe utilization during steady-state generation, the eGPU link is the bottleneck — confirm you are running at PCIe 4.0 x4 and not PCIe 3.0 x4 or lower.

Common Pitfalls {#pitfalls}

1. Buying a Thunderbolt 3 enclosure on a Thunderbolt 4 host

TB3 enclosures negotiate PCIe 3.0 x4 max on most chipsets, which is fine, but some pre-2020 enclosures only do PCIe 3.0 x2 — half the bandwidth. Check the enclosure's PCIe lane count before buying.

2. Underspeccing the eGPU PSU

A 4090 pulls 450W under load. The Razer Core X Chroma's 700W PSU is enough. Cheaper enclosures with 400W PSUs will brown-out during prompt processing and crash mid-inference.

3. Cable matters

Use a certified TB4 cable rated for the full length you need. A mismarked TB3 cable will downgrade your link silently.

4. Not authorizing TB devices on Linux

Linux requires explicit boltctl authorize before TB devices function. The eGPU will appear in lspci only after authorization.

5. Hot-unplugging during inference

Hot-unplug after the model is loaded but before generation starts will corrupt the driver state. Use nvidia-smi --query-gpu=pstate to confirm the GPU is idle before unplugging.

6. Forgetting suspend behavior

TB eGPUs do not survive sleep / hibernate well. Disable suspend on a server-style host or you will rebuild the link manually after every wake.

7. Mac users buying NVIDIA cards

Will not work. macOS does not have NVIDIA drivers. Burn the budget on a higher-RAM Mac instead.

Frequently Asked Questions {#faq}

Q: How much performance do I lose with an eGPU for AI?

A: For LLM inference with the model fully resident in VRAM: 2-3% on OCuLink, 15-17% on Thunderbolt 4 / USB4. Far less than the 25-40% loss seen in gaming benchmarks because inference uses PCIe much less heavily.

Q: Is an RTX 4090 eGPU faster than an RTX 3090 internal for AI?

A: Yes. Even with the 15% Thunderbolt tax, the 4090 eGPU at 121 tok/s beats the internal 3090 at 96 tok/s on Llama 3.1 8B. Card class matters more than interconnect.

Q: Which interconnect is best for an eGPU AI rig?

A: OCuLink is the clear winner if you can get it (PCIe 4.0 x4, 2-3% performance loss). Thunderbolt 4 and USB4 are tied at around 15-17% loss. USB-C with DisplayPort alt-mode is not an eGPU interface — do not confuse them.

Q: Does an eGPU work with a Mac for local AI?

A: Not for NVIDIA cards — macOS has no NVIDIA driver. AMD eGPUs work for Metal compute but Ollama does not currently dispatch to them. Mac users should invest in unified memory instead.

Q: Can I run Llama 3.1 70B on an eGPU?

A: With a 24GB GPU like the RTX 4090, you need partial CPU offload for 70B at Q4_K_M. Generation drops to 6-7 tok/s. Two 24GB GPUs (one internal, one eGPU) or one 48GB GPU is the cleaner path — see our distributed inference guide.

Q: How long does it take to load a model over Thunderbolt 4?

A: About 12-13 seconds for an 8B Q4 model versus 2 seconds on internal PCIe. After load, you do not pay this cost again until you swap models.

Q: Will USB-C without Thunderbolt 4 work as an eGPU?

A: No. Standard USB-C does not carry PCIe. Only Thunderbolt 3/4 and USB4 (with the optional Thunderbolt-equivalent feature set) can host an eGPU enclosure.

Q: Does the eGPU need its own power supply?

A: Yes. Every eGPU enclosure has a built-in PSU because GPUs draw far more than the host can provide over the cable. Confirm the PSU is sized for your card (700W+ for an RTX 4090).

Conclusion

The most surprising finding from our benchmarks: eGPUs are genuinely viable for local AI, in a way they never quite were for gaming. Inference has a vastly lighter PCIe footprint than rendering, and the result is performance within 3-17% of internal PCIe across every interconnect we tested.

For homelab builders with a mini PC, an eGPU is the cleanest upgrade path — keep the silent, low-idle mini PC for everything else and dock a powerful GPU when you need to run a serious model. For laptop owners, an eGPU adds a Plain B option that did not exist a year ago.

Skip eGPUs on Macs. Skip them if you are buying a brand-new PC anyway (just put the GPU inside). Embrace them when you are extending hardware you already own.

Want our full eGPU benchmark sheet, including chassis-by-chassis test data and cable comparisons? Subscribe to the LocalAimaster newsletter and we will send the spreadsheet.

eGPU for Local AI: External GPU Benchmarks (2026)

Want to go deeper than this article?

eGPU for Local AI: External GPU Benchmarks

Quick Answer: Yes, an eGPU Works for Local AI — With Caveats

Table of Contents

Why eGPU Performance for AI Differs From Gaming {#different}

The Three Interconnects: TB4, USB4, OCuLink {#interconnects}

Test Methodology {#methodology}

Benchmarks: RTX 4090 Across Interconnects {#bench-4090}

Llama 3.1 8B Instruct (Q4_K_M) — single stream

Llama 3.1 70B (Q4_K_M) — single stream

Concurrency: 8 parallel streams, Llama 3.1 8B

Take-aways

Benchmarks: RTX 3090 and 4070 Ti Super {#bench-others}

RTX 3090, Llama 3.1 8B Q4_K_M

RTX 4070 Ti Super (16GB), Llama 3.1 8B Q4_K_M

Cross-card comparison: when does an eGPU 4090 beat an internal 3090?

Mac mini + eGPU: A Special Case {#mac}

Real-World Recommendations {#recommendations}

Setup A: Beelink SER8 mini PC + eGPU ($1,239 total)

Setup B: Minisforum UM890 Pro + OCuLink GPU dock ($1,599 total)

Setup C: 2024 laptop + Thunderbolt 4 eGPU ($600 + GPU)

Setup Guide: From Box to First Token {#setup}

Step 1: Hardware sanity check

Step 2: Install drivers

Step 3: Install Ollama

Step 4: Sanity-check the link

Common Pitfalls {#pitfalls}

1. Buying a Thunderbolt 3 enclosure on a Thunderbolt 4 host

2. Underspeccing the eGPU PSU

3. Cable matters

4. Not authorizing TB devices on Linux

5. Hot-unplugging during inference

6. Forgetting suspend behavior

7. Mac users buying NVIDIA cards

Frequently Asked Questions {#faq}

Q: How much performance do I lose with an eGPU for AI?

Q: Is an RTX 4090 eGPU faster than an RTX 3090 internal for AI?

Q: Which interconnect is best for an eGPU AI rig?

Q: Does an eGPU work with a Mac for local AI?

Q: Can I run Llama 3.1 70B on an eGPU?

Q: How long does it take to load a model over Thunderbolt 4?

Q: Will USB-C without Thunderbolt 4 work as an eGPU?

Q: Does the eGPU need its own power supply?

Conclusion

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

🎓 Continue Learning

Get the Full eGPU Benchmark Sheet

Related Guides

Build Real AI on Your Machine

Continue Learning

Used GPU Buying Guide

Distributed Inference

Hardware Requirements

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI