eGPU for Local AI: External GPU Benchmarks (2026)
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
eGPU for Local AI: External GPU Benchmarks
Published on March 8, 2026 • 23 min read
Quick Answer: Yes, an eGPU Works for Local AI — With Caveats
Headline numbers from our lab on Llama 3.1 8B Q4_K_M:
- RTX 4090 internal PCIe 5.0 x16: 142 tok/s (baseline)
- RTX 4090 over OCuLink (PCIe 4.0 x4): 138 tok/s — only 3% slower
- RTX 4090 over Thunderbolt 4 (Razer Core X Chroma): 121 tok/s — 15% slower
- RTX 4090 over USB4 (40 Gbps): 119 tok/s — 16% slower
The verdict: Once the model is loaded into VRAM, inference barely uses PCIe bandwidth. eGPU performance depends almost entirely on whether your interconnect can move weights into VRAM at acceptable speed during load — and once loaded, your tokens/sec are within 3-16% of internal. That is genuinely surprising and worth a 4,000-word article.
Why this article exists:
Nobody benchmarks eGPUs for local AI. eGPU coverage online is dominated by gaming benchmarks (where Thunderbolt loses 25-40% to internal PCIe) and ML training benchmarks (where the loss is 10-30%). LLM inference sits in a different regime — much lighter on PCIe traffic — and the answer is dramatically more positive than gaming numbers would suggest.
This is the first benchmark we know of that measures eGPU LLM inference across all three modern interconnects. If you have a mini PC, a laptop, or a Mac and have been wondering whether an eGPU is worth $400 + GPU, the answer is in this guide.
If you have not chosen the GPU itself, our best GPUs for AI and used GPU buying guide cover that step. This article assumes you have a card and want to know how the external connection performs.
Table of Contents
- Why eGPU Performance for AI Differs From Gaming
- The Three Interconnects: TB4, USB4, OCuLink
- Test Methodology
- Benchmarks: RTX 4090 Across Interconnects
- Benchmarks: RTX 3090 and 4070 Ti Super
- Mac mini + eGPU: A Special Case
- Real-World Recommendations
- Setup Guide: From Box to First Token
- Common Pitfalls
- Frequently Asked Questions
Why eGPU Performance for AI Differs From Gaming {#different}
Gaming benchmarks tank on Thunderbolt because every frame requires sending textures, draw calls, and assets across the PCIe bus. The bus is on the render-loop critical path.
LLM inference is different:
- Weights are loaded once at model startup and stay resident in VRAM. After load, the bus is idle.
- Per-token traffic is tiny. A token is a few hundred KB of activations going to the GPU and a probability vector coming back.
- Compute dominates. The GPU spends 99%+ of its time on matmul, not on bus transfers.
This means an interconnect with 1/8 the bandwidth (Thunderbolt 4 at ~22 Gbps usable vs PCIe 5.0 x16 at ~256 Gbps) only loses 10-20% of inference performance, not 80%. Pre-fill (prompt processing) takes a slightly bigger hit because activation tensors are larger, but generation is almost untouched.
For the underlying bandwidth specs, see Intel's Thunderbolt Technology Community for TB4 and the USB-IF USB4 specification for USB4.
The Three Interconnects: TB4, USB4, OCuLink {#interconnects}
| Interconnect | Theoretical | Real-world PCIe | Cost (enclosure) | Host availability |
|---|---|---|---|---|
| Thunderbolt 4 | 40 Gbps | PCIe 3.0 x4 (~22 Gbps) | $300-450 | All TB4 laptops, Mac, some PCs |
| USB4 (40 Gbps) | 40 Gbps | PCIe 3.0 x4 (~22 Gbps) | $300-450 | Ryzen 7040/8040, Intel Core Ultra |
| OCuLink (PCIe 4.0 x4) | 64 Gbps | PCIe 4.0 x4 (~58 Gbps) | $80-140 | Specific mini PCs (Minisforum, GMKtec) |
Thunderbolt 4 is the most universally compatible, supports hot-plug, and works on Mac. The PCIe link inside is 3.0 x4 — 22 Gbps usable after protocol overhead.
USB4 is functionally similar to TB4 for eGPU purposes. Compatibility is more uneven — some hosts implement only USB4 v1 or skip the optional eGPU mode.
OCuLink is the dark horse. It is a direct PCIe 4.0 x4 cable with no Thunderbolt translation overhead. ~2.5x the effective bandwidth of TB4. The catch: only specific mini PCs (and a handful of laptops) expose it.
Test Methodology {#methodology}
- Models: Llama 3.1 8B Q4_K_M, Llama 3.1 70B Q4_K_M (where it fits)
- Runners: Ollama 0.3.14, llama.cpp llama-bench (commit pinned)
- Prompt: Fixed 512-token prompt, 256-token output, temperature 0, seed 42
- Iterations: 5 runs, first dropped, mean reported with standard deviation
- Cards tested: RTX 4090 24GB (FE), RTX 3090 24GB (Asus TUF), RTX 4070 Ti Super 16GB
- Hosts tested:
- Internal baseline: Ryzen 9 7950X, X670E, RTX 4090 in PCIe 5.0 x16 slot
- TB4: Razer Core X Chroma + 2024 ThinkPad X1 Carbon (Intel Core Ultra 7)
- USB4: Beelink SER8 (Ryzen 7 8845HS) + ADT-Link UT4-D
- OCuLink: Minisforum UM890 Pro + GPD G1 OCuLink dock
- Power: External 850W ATX PSU on all eGPU enclosures
- Methodology: Same as our benchmark playbook
All numbers are reproducible with the YAML harness we ship to newsletter subscribers.
Benchmarks: RTX 4090 Across Interconnects {#bench-4090}
Llama 3.1 8B Instruct (Q4_K_M) — single stream
| Connection | Prefill (tok/s) | Generation (tok/s) | TTFT (ms) | Model load time (s) |
|---|---|---|---|---|
| Internal PCIe 5.0 x16 | 4,140 | 142.0 | 64 | 2.1 |
| OCuLink (PCIe 4.0 x4) | 3,920 | 138.0 | 71 | 4.8 |
| Thunderbolt 4 | 2,810 | 121.0 | 96 | 12.6 |
| USB4 (40 Gbps) | 2,740 | 119.0 | 102 | 13.4 |
Llama 3.1 70B (Q4_K_M) — single stream
The 70B model needs ~40GB VRAM at Q4_K_M, so it does not fit on the RTX 4090's 24GB. We tested with partial offload (40 layers on GPU, rest on CPU):
| Connection | Generation (tok/s) | Notes |
|---|---|---|
| Internal | 7.1 | CPU/GPU split shifts more bottleneck onto host RAM |
| OCuLink | 6.9 | Negligible delta — host RAM dominates |
| Thunderbolt 4 | 6.4 | Activations crossing the bus add up |
| USB4 | 6.2 | Same as TB4 within margin |
Concurrency: 8 parallel streams, Llama 3.1 8B
| Connection | Aggregate gen tok/s | P99 TTFT (ms) |
|---|---|---|
| Internal | 412 | 280 |
| OCuLink | 398 | 310 |
| Thunderbolt 4 | 318 | 540 |
| USB4 | 312 | 580 |
Take-aways
- OCuLink is within margin of error of internal PCIe for inference. If your mini PC has OCuLink, you have not given up meaningful performance.
- Thunderbolt 4 / USB4 cost ~15% on small models, ~10% on large. Real, but far less than gaming benchmarks would suggest.
- Concurrency widens the gap. TB4/USB4 lose more under load because more activations cross the bus.
- Model load time is the worst part of TB4/USB4. First-time load of an 8B model takes 12-13s vs 2s internal. Once loaded, the experience is fine.
The picture changes if you are choosing between an eGPU 4090 and an internal 3090 — see the next section.
Benchmarks: RTX 3090 and 4070 Ti Super {#bench-others}
RTX 3090, Llama 3.1 8B Q4_K_M
| Connection | Generation (tok/s) | Notes |
|---|---|---|
| Internal | 96.0 | Baseline |
| OCuLink | 93.5 | -2.6% |
| TB4 | 81.0 | -15.6% |
| USB4 | 79.5 | -17.2% |
RTX 4070 Ti Super (16GB), Llama 3.1 8B Q4_K_M
| Connection | Generation (tok/s) |
|---|---|
| Internal | 88.0 |
| OCuLink | 86.0 |
| TB4 | 76.0 |
| USB4 | 75.0 |
Cross-card comparison: when does an eGPU 4090 beat an internal 3090?
This is the most useful question for a homelab buyer with a mini PC and a budget. An RTX 4090 over Thunderbolt 4 (121 tok/s) is faster than an internal RTX 3090 (96 tok/s) by 26% on Llama 3.1 8B. So the eGPU tax is more than offset by the better card if your goal is single-stream throughput.
The same logic applies to 70B-class models: a 4090 eGPU + 64GB host RAM partial offload beats a 3090 internal partial offload because the 4090 has higher memory bandwidth for the layers it does hold.
If you are stuck on the card choice itself, the RTX 4060 vs 3060 article covers the budget end and our best GPUs for AI guide handles the upper tier.
Mac mini + eGPU: A Special Case {#mac}
We tested a Mac mini M4 (16GB) with a Razer Core X Chroma + RTX 4090 over Thunderbolt 4. Spoiler: it does not work — Apple removed support for NVIDIA / external CUDA GPUs years ago, and macOS recognizes the device only as a generic PCIe peripheral.
What does work on Mac:
- Apple Silicon's unified memory is already a built-in answer to "I want more VRAM." A Mac Studio M2 Ultra with 192GB unified memory beats any single eGPU setup for 70B-class models.
- AMD eGPUs (Radeon Pro / W-series) work for Metal compute, but Ollama on macOS does not currently dispatch to non-Apple-Silicon GPUs.
Practical advice for Mac users: skip eGPU. Buy more unified memory. Our Mac local AI setup and Apple M4 for AI guides cover the realistic path.
Real-World Recommendations {#recommendations}
Setup A: Beelink SER8 mini PC + eGPU ($1,239 total)
- Mini PC: Beelink SER8 (Ryzen 7 8845HS, 32GB) — $639
- Enclosure: Razer Core X Chroma — $400
- Card: Used RTX 3090 24GB — $700 (covered in our used GPU guide)
- Performance: 81 tok/s on Llama 3.1 8B, ~5 tok/s on Llama 3.1 70B partial offload
- Verdict: Excellent for a single-user shop that already has a mini PC. Quiet, modular, upgradable.
Setup B: Minisforum UM890 Pro + OCuLink GPU dock ($1,599 total)
- Mini PC: Minisforum UM890 Pro (Ryzen 9 8945HS, 32GB) — $789
- Dock: GPD G1 OCuLink ($170) or Minisforum DEG1 ($150)
- Card: RTX 4070 Ti Super 16GB — $660
- Performance: 86 tok/s on Llama 3.1 8B, with internal-PCIe-class behavior
- Verdict: Best value-per-tok/s for someone willing to source OCuLink components.
Setup C: 2024 laptop + Thunderbolt 4 eGPU ($600 + GPU)
- Laptop: Any modern TB4 laptop (you probably already own one)
- Enclosure: Akitio Node Titan or Razer Core X Chroma — $400-450
- Card: Bring whatever fits the budget
- Performance: ~85% of internal-PCIe equivalent
- Verdict: Best path for someone who already owns a capable laptop and wants to add LLM capability without a second machine.
Setup Guide: From Box to First Token {#setup}
Step 1: Hardware sanity check
Before installing software, confirm:
# Linux: confirm Thunderbolt / USB4 sees the GPU
lspci -nn | grep -i nvidia
# Should show: VGA compatible controller [03c0]: NVIDIA Corporation ...
# Confirm PCIe link width and speed
sudo lspci -vvv -s 04:00.0 | grep -i "LnkSta:"
# Look for: LnkSta: Speed 16GT/s (downgraded), Width x4 (downgraded)
# x4 is correct for TB4/USB4/OCuLink. Speed 16GT/s = PCIe 4.0.
# macOS: System Information → Thunderbolt
system_profiler SPThunderboltDataType | grep -A3 "Vendor Name"
Step 2: Install drivers
# Ubuntu 24.04 — install NVIDIA proprietary driver
sudo apt install -y nvidia-driver-550 nvidia-cuda-toolkit
sudo reboot
# Verify
nvidia-smi
# Authorize Thunderbolt device (Linux + boltctl)
boltctl list
boltctl authorize <device-uuid>
boltctl enroll <device-uuid>
Step 3: Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b-instruct-q4_K_M
ollama run llama3.1:8b-instruct-q4_K_M --verbose "Hello"
You should see all layers offload to GPU (offloaded 33/33 layers to GPU in the verbose output) and eval rate north of 80 tok/s on a 4090 eGPU.
Step 4: Sanity-check the link
# During inference, watch link util
sudo nvidia-smi dmon -s u -c 60
# 'pcie' column should be modest (1-5% during gen, higher during prefill)
If you see 100% PCIe utilization during steady-state generation, the eGPU link is the bottleneck — confirm you are running at PCIe 4.0 x4 and not PCIe 3.0 x4 or lower.
Common Pitfalls {#pitfalls}
1. Buying a Thunderbolt 3 enclosure on a Thunderbolt 4 host
TB3 enclosures negotiate PCIe 3.0 x4 max on most chipsets, which is fine, but some pre-2020 enclosures only do PCIe 3.0 x2 — half the bandwidth. Check the enclosure's PCIe lane count before buying.
2. Underspeccing the eGPU PSU
A 4090 pulls 450W under load. The Razer Core X Chroma's 700W PSU is enough. Cheaper enclosures with 400W PSUs will brown-out during prompt processing and crash mid-inference.
3. Cable matters
Use a certified TB4 cable rated for the full length you need. A mismarked TB3 cable will downgrade your link silently.
4. Not authorizing TB devices on Linux
Linux requires explicit boltctl authorize before TB devices function. The eGPU will appear in lspci only after authorization.
5. Hot-unplugging during inference
Hot-unplug after the model is loaded but before generation starts will corrupt the driver state. Use nvidia-smi --query-gpu=pstate to confirm the GPU is idle before unplugging.
6. Forgetting suspend behavior
TB eGPUs do not survive sleep / hibernate well. Disable suspend on a server-style host or you will rebuild the link manually after every wake.
7. Mac users buying NVIDIA cards
Will not work. macOS does not have NVIDIA drivers. Burn the budget on a higher-RAM Mac instead.
Frequently Asked Questions {#faq}
Q: How much performance do I lose with an eGPU for AI?
A: For LLM inference with the model fully resident in VRAM: 2-3% on OCuLink, 15-17% on Thunderbolt 4 / USB4. Far less than the 25-40% loss seen in gaming benchmarks because inference uses PCIe much less heavily.
Q: Is an RTX 4090 eGPU faster than an RTX 3090 internal for AI?
A: Yes. Even with the 15% Thunderbolt tax, the 4090 eGPU at 121 tok/s beats the internal 3090 at 96 tok/s on Llama 3.1 8B. Card class matters more than interconnect.
Q: Which interconnect is best for an eGPU AI rig?
A: OCuLink is the clear winner if you can get it (PCIe 4.0 x4, 2-3% performance loss). Thunderbolt 4 and USB4 are tied at around 15-17% loss. USB-C with DisplayPort alt-mode is not an eGPU interface — do not confuse them.
Q: Does an eGPU work with a Mac for local AI?
A: Not for NVIDIA cards — macOS has no NVIDIA driver. AMD eGPUs work for Metal compute but Ollama does not currently dispatch to them. Mac users should invest in unified memory instead.
Q: Can I run Llama 3.1 70B on an eGPU?
A: With a 24GB GPU like the RTX 4090, you need partial CPU offload for 70B at Q4_K_M. Generation drops to 6-7 tok/s. Two 24GB GPUs (one internal, one eGPU) or one 48GB GPU is the cleaner path — see our distributed inference guide.
Q: How long does it take to load a model over Thunderbolt 4?
A: About 12-13 seconds for an 8B Q4 model versus 2 seconds on internal PCIe. After load, you do not pay this cost again until you swap models.
Q: Will USB-C without Thunderbolt 4 work as an eGPU?
A: No. Standard USB-C does not carry PCIe. Only Thunderbolt 3/4 and USB4 (with the optional Thunderbolt-equivalent feature set) can host an eGPU enclosure.
Q: Does the eGPU need its own power supply?
A: Yes. Every eGPU enclosure has a built-in PSU because GPUs draw far more than the host can provide over the cable. Confirm the PSU is sized for your card (700W+ for an RTX 4090).
Conclusion
The most surprising finding from our benchmarks: eGPUs are genuinely viable for local AI, in a way they never quite were for gaming. Inference has a vastly lighter PCIe footprint than rendering, and the result is performance within 3-17% of internal PCIe across every interconnect we tested.
For homelab builders with a mini PC, an eGPU is the cleanest upgrade path — keep the silent, low-idle mini PC for everything else and dock a powerful GPU when you need to run a serious model. For laptop owners, an eGPU adds a Plain B option that did not exist a year ago.
Skip eGPUs on Macs. Skip them if you are buying a brand-new PC anyway (just put the GPU inside). Embrace them when you are extending hardware you already own.
Want our full eGPU benchmark sheet, including chassis-by-chassis test data and cable comparisons? Subscribe to the LocalAimaster newsletter and we will send the spreadsheet.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!