Free course — 2 free chapters of every course. No credit card.Start learning free
Hardware Benchmarks

eGPU for Local AI: External GPU Benchmarks (2026)

March 8, 2026
23 min read
LocalAimaster Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

eGPU for Local AI: External GPU Benchmarks

Published on March 8, 2026 • 23 min read

Quick Answer: Yes, an eGPU Works for Local AI — With Caveats

Headline numbers from our lab on Llama 3.1 8B Q4_K_M:

  • RTX 4090 internal PCIe 5.0 x16: 142 tok/s (baseline)
  • RTX 4090 over OCuLink (PCIe 4.0 x4): 138 tok/s — only 3% slower
  • RTX 4090 over Thunderbolt 4 (Razer Core X Chroma): 121 tok/s — 15% slower
  • RTX 4090 over USB4 (40 Gbps): 119 tok/s — 16% slower

The verdict: Once the model is loaded into VRAM, inference barely uses PCIe bandwidth. eGPU performance depends almost entirely on whether your interconnect can move weights into VRAM at acceptable speed during load — and once loaded, your tokens/sec are within 3-16% of internal. That is genuinely surprising and worth a 4,000-word article.


Why this article exists:

Nobody benchmarks eGPUs for local AI. eGPU coverage online is dominated by gaming benchmarks (where Thunderbolt loses 25-40% to internal PCIe) and ML training benchmarks (where the loss is 10-30%). LLM inference sits in a different regime — much lighter on PCIe traffic — and the answer is dramatically more positive than gaming numbers would suggest.

This is the first benchmark we know of that measures eGPU LLM inference across all three modern interconnects. If you have a mini PC, a laptop, or a Mac and have been wondering whether an eGPU is worth $400 + GPU, the answer is in this guide.

If you have not chosen the GPU itself, our best GPUs for AI and used GPU buying guide cover that step. This article assumes you have a card and want to know how the external connection performs.

Table of Contents

  1. Why eGPU Performance for AI Differs From Gaming
  2. The Three Interconnects: TB4, USB4, OCuLink
  3. Test Methodology
  4. Benchmarks: RTX 4090 Across Interconnects
  5. Benchmarks: RTX 3090 and 4070 Ti Super
  6. Mac mini + eGPU: A Special Case
  7. Real-World Recommendations
  8. Setup Guide: From Box to First Token
  9. Common Pitfalls
  10. Frequently Asked Questions

Why eGPU Performance for AI Differs From Gaming {#different}

Gaming benchmarks tank on Thunderbolt because every frame requires sending textures, draw calls, and assets across the PCIe bus. The bus is on the render-loop critical path.

LLM inference is different:

  1. Weights are loaded once at model startup and stay resident in VRAM. After load, the bus is idle.
  2. Per-token traffic is tiny. A token is a few hundred KB of activations going to the GPU and a probability vector coming back.
  3. Compute dominates. The GPU spends 99%+ of its time on matmul, not on bus transfers.

This means an interconnect with 1/8 the bandwidth (Thunderbolt 4 at ~22 Gbps usable vs PCIe 5.0 x16 at ~256 Gbps) only loses 10-20% of inference performance, not 80%. Pre-fill (prompt processing) takes a slightly bigger hit because activation tensors are larger, but generation is almost untouched.

For the underlying bandwidth specs, see Intel's Thunderbolt Technology Community for TB4 and the USB-IF USB4 specification for USB4.


InterconnectTheoreticalReal-world PCIeCost (enclosure)Host availability
Thunderbolt 440 GbpsPCIe 3.0 x4 (~22 Gbps)$300-450All TB4 laptops, Mac, some PCs
USB4 (40 Gbps)40 GbpsPCIe 3.0 x4 (~22 Gbps)$300-450Ryzen 7040/8040, Intel Core Ultra
OCuLink (PCIe 4.0 x4)64 GbpsPCIe 4.0 x4 (~58 Gbps)$80-140Specific mini PCs (Minisforum, GMKtec)

Thunderbolt 4 is the most universally compatible, supports hot-plug, and works on Mac. The PCIe link inside is 3.0 x4 — 22 Gbps usable after protocol overhead.

USB4 is functionally similar to TB4 for eGPU purposes. Compatibility is more uneven — some hosts implement only USB4 v1 or skip the optional eGPU mode.

OCuLink is the dark horse. It is a direct PCIe 4.0 x4 cable with no Thunderbolt translation overhead. ~2.5x the effective bandwidth of TB4. The catch: only specific mini PCs (and a handful of laptops) expose it.


Test Methodology {#methodology}

  • Models: Llama 3.1 8B Q4_K_M, Llama 3.1 70B Q4_K_M (where it fits)
  • Runners: Ollama 0.3.14, llama.cpp llama-bench (commit pinned)
  • Prompt: Fixed 512-token prompt, 256-token output, temperature 0, seed 42
  • Iterations: 5 runs, first dropped, mean reported with standard deviation
  • Cards tested: RTX 4090 24GB (FE), RTX 3090 24GB (Asus TUF), RTX 4070 Ti Super 16GB
  • Hosts tested:
    • Internal baseline: Ryzen 9 7950X, X670E, RTX 4090 in PCIe 5.0 x16 slot
    • TB4: Razer Core X Chroma + 2024 ThinkPad X1 Carbon (Intel Core Ultra 7)
    • USB4: Beelink SER8 (Ryzen 7 8845HS) + ADT-Link UT4-D
    • OCuLink: Minisforum UM890 Pro + GPD G1 OCuLink dock
  • Power: External 850W ATX PSU on all eGPU enclosures
  • Methodology: Same as our benchmark playbook

All numbers are reproducible with the YAML harness we ship to newsletter subscribers.


Benchmarks: RTX 4090 Across Interconnects {#bench-4090}

Llama 3.1 8B Instruct (Q4_K_M) — single stream

ConnectionPrefill (tok/s)Generation (tok/s)TTFT (ms)Model load time (s)
Internal PCIe 5.0 x164,140142.0642.1
OCuLink (PCIe 4.0 x4)3,920138.0714.8
Thunderbolt 42,810121.09612.6
USB4 (40 Gbps)2,740119.010213.4

Llama 3.1 70B (Q4_K_M) — single stream

The 70B model needs ~40GB VRAM at Q4_K_M, so it does not fit on the RTX 4090's 24GB. We tested with partial offload (40 layers on GPU, rest on CPU):

ConnectionGeneration (tok/s)Notes
Internal7.1CPU/GPU split shifts more bottleneck onto host RAM
OCuLink6.9Negligible delta — host RAM dominates
Thunderbolt 46.4Activations crossing the bus add up
USB46.2Same as TB4 within margin

Concurrency: 8 parallel streams, Llama 3.1 8B

ConnectionAggregate gen tok/sP99 TTFT (ms)
Internal412280
OCuLink398310
Thunderbolt 4318540
USB4312580

Take-aways

  1. OCuLink is within margin of error of internal PCIe for inference. If your mini PC has OCuLink, you have not given up meaningful performance.
  2. Thunderbolt 4 / USB4 cost ~15% on small models, ~10% on large. Real, but far less than gaming benchmarks would suggest.
  3. Concurrency widens the gap. TB4/USB4 lose more under load because more activations cross the bus.
  4. Model load time is the worst part of TB4/USB4. First-time load of an 8B model takes 12-13s vs 2s internal. Once loaded, the experience is fine.

The picture changes if you are choosing between an eGPU 4090 and an internal 3090 — see the next section.


Benchmarks: RTX 3090 and 4070 Ti Super {#bench-others}

RTX 3090, Llama 3.1 8B Q4_K_M

ConnectionGeneration (tok/s)Notes
Internal96.0Baseline
OCuLink93.5-2.6%
TB481.0-15.6%
USB479.5-17.2%

RTX 4070 Ti Super (16GB), Llama 3.1 8B Q4_K_M

ConnectionGeneration (tok/s)
Internal88.0
OCuLink86.0
TB476.0
USB475.0

Cross-card comparison: when does an eGPU 4090 beat an internal 3090?

This is the most useful question for a homelab buyer with a mini PC and a budget. An RTX 4090 over Thunderbolt 4 (121 tok/s) is faster than an internal RTX 3090 (96 tok/s) by 26% on Llama 3.1 8B. So the eGPU tax is more than offset by the better card if your goal is single-stream throughput.

The same logic applies to 70B-class models: a 4090 eGPU + 64GB host RAM partial offload beats a 3090 internal partial offload because the 4090 has higher memory bandwidth for the layers it does hold.

If you are stuck on the card choice itself, the RTX 4060 vs 3060 article covers the budget end and our best GPUs for AI guide handles the upper tier.


Mac mini + eGPU: A Special Case {#mac}

We tested a Mac mini M4 (16GB) with a Razer Core X Chroma + RTX 4090 over Thunderbolt 4. Spoiler: it does not work — Apple removed support for NVIDIA / external CUDA GPUs years ago, and macOS recognizes the device only as a generic PCIe peripheral.

What does work on Mac:

  • Apple Silicon's unified memory is already a built-in answer to "I want more VRAM." A Mac Studio M2 Ultra with 192GB unified memory beats any single eGPU setup for 70B-class models.
  • AMD eGPUs (Radeon Pro / W-series) work for Metal compute, but Ollama on macOS does not currently dispatch to non-Apple-Silicon GPUs.

Practical advice for Mac users: skip eGPU. Buy more unified memory. Our Mac local AI setup and Apple M4 for AI guides cover the realistic path.


Real-World Recommendations {#recommendations}

  • Mini PC: Beelink SER8 (Ryzen 7 8845HS, 32GB) — $639
  • Enclosure: Razer Core X Chroma — $400
  • Card: Used RTX 3090 24GB — $700 (covered in our used GPU guide)
  • Performance: 81 tok/s on Llama 3.1 8B, ~5 tok/s on Llama 3.1 70B partial offload
  • Verdict: Excellent for a single-user shop that already has a mini PC. Quiet, modular, upgradable.
  • Mini PC: Minisforum UM890 Pro (Ryzen 9 8945HS, 32GB) — $789
  • Dock: GPD G1 OCuLink ($170) or Minisforum DEG1 ($150)
  • Card: RTX 4070 Ti Super 16GB — $660
  • Performance: 86 tok/s on Llama 3.1 8B, with internal-PCIe-class behavior
  • Verdict: Best value-per-tok/s for someone willing to source OCuLink components.

Setup C: 2024 laptop + Thunderbolt 4 eGPU ($600 + GPU)

  • Laptop: Any modern TB4 laptop (you probably already own one)
  • Enclosure: Akitio Node Titan or Razer Core X Chroma — $400-450
  • Card: Bring whatever fits the budget
  • Performance: ~85% of internal-PCIe equivalent
  • Verdict: Best path for someone who already owns a capable laptop and wants to add LLM capability without a second machine.

Setup Guide: From Box to First Token {#setup}

Step 1: Hardware sanity check

Before installing software, confirm:

# Linux: confirm Thunderbolt / USB4 sees the GPU
lspci -nn | grep -i nvidia
# Should show: VGA compatible controller [03c0]: NVIDIA Corporation ...

# Confirm PCIe link width and speed
sudo lspci -vvv -s 04:00.0 | grep -i "LnkSta:"
# Look for: LnkSta: Speed 16GT/s (downgraded), Width x4 (downgraded)
# x4 is correct for TB4/USB4/OCuLink. Speed 16GT/s = PCIe 4.0.
# macOS: System Information → Thunderbolt
system_profiler SPThunderboltDataType | grep -A3 "Vendor Name"

Step 2: Install drivers

# Ubuntu 24.04 — install NVIDIA proprietary driver
sudo apt install -y nvidia-driver-550 nvidia-cuda-toolkit
sudo reboot

# Verify
nvidia-smi
# Authorize Thunderbolt device (Linux + boltctl)
boltctl list
boltctl authorize <device-uuid>
boltctl enroll <device-uuid>

Step 3: Install Ollama

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b-instruct-q4_K_M
ollama run llama3.1:8b-instruct-q4_K_M --verbose "Hello"

You should see all layers offload to GPU (offloaded 33/33 layers to GPU in the verbose output) and eval rate north of 80 tok/s on a 4090 eGPU.

# During inference, watch link util
sudo nvidia-smi dmon -s u -c 60
# 'pcie' column should be modest (1-5% during gen, higher during prefill)

If you see 100% PCIe utilization during steady-state generation, the eGPU link is the bottleneck — confirm you are running at PCIe 4.0 x4 and not PCIe 3.0 x4 or lower.


Common Pitfalls {#pitfalls}

1. Buying a Thunderbolt 3 enclosure on a Thunderbolt 4 host

TB3 enclosures negotiate PCIe 3.0 x4 max on most chipsets, which is fine, but some pre-2020 enclosures only do PCIe 3.0 x2 — half the bandwidth. Check the enclosure's PCIe lane count before buying.

2. Underspeccing the eGPU PSU

A 4090 pulls 450W under load. The Razer Core X Chroma's 700W PSU is enough. Cheaper enclosures with 400W PSUs will brown-out during prompt processing and crash mid-inference.

3. Cable matters

Use a certified TB4 cable rated for the full length you need. A mismarked TB3 cable will downgrade your link silently.

4. Not authorizing TB devices on Linux

Linux requires explicit boltctl authorize before TB devices function. The eGPU will appear in lspci only after authorization.

5. Hot-unplugging during inference

Hot-unplug after the model is loaded but before generation starts will corrupt the driver state. Use nvidia-smi --query-gpu=pstate to confirm the GPU is idle before unplugging.

6. Forgetting suspend behavior

TB eGPUs do not survive sleep / hibernate well. Disable suspend on a server-style host or you will rebuild the link manually after every wake.

7. Mac users buying NVIDIA cards

Will not work. macOS does not have NVIDIA drivers. Burn the budget on a higher-RAM Mac instead.


Frequently Asked Questions {#faq}

Q: How much performance do I lose with an eGPU for AI?

A: For LLM inference with the model fully resident in VRAM: 2-3% on OCuLink, 15-17% on Thunderbolt 4 / USB4. Far less than the 25-40% loss seen in gaming benchmarks because inference uses PCIe much less heavily.

Q: Is an RTX 4090 eGPU faster than an RTX 3090 internal for AI?

A: Yes. Even with the 15% Thunderbolt tax, the 4090 eGPU at 121 tok/s beats the internal 3090 at 96 tok/s on Llama 3.1 8B. Card class matters more than interconnect.

Q: Which interconnect is best for an eGPU AI rig?

A: OCuLink is the clear winner if you can get it (PCIe 4.0 x4, 2-3% performance loss). Thunderbolt 4 and USB4 are tied at around 15-17% loss. USB-C with DisplayPort alt-mode is not an eGPU interface — do not confuse them.

Q: Does an eGPU work with a Mac for local AI?

A: Not for NVIDIA cards — macOS has no NVIDIA driver. AMD eGPUs work for Metal compute but Ollama does not currently dispatch to them. Mac users should invest in unified memory instead.

Q: Can I run Llama 3.1 70B on an eGPU?

A: With a 24GB GPU like the RTX 4090, you need partial CPU offload for 70B at Q4_K_M. Generation drops to 6-7 tok/s. Two 24GB GPUs (one internal, one eGPU) or one 48GB GPU is the cleaner path — see our distributed inference guide.

Q: How long does it take to load a model over Thunderbolt 4?

A: About 12-13 seconds for an 8B Q4 model versus 2 seconds on internal PCIe. After load, you do not pay this cost again until you swap models.

Q: Will USB-C without Thunderbolt 4 work as an eGPU?

A: No. Standard USB-C does not carry PCIe. Only Thunderbolt 3/4 and USB4 (with the optional Thunderbolt-equivalent feature set) can host an eGPU enclosure.

Q: Does the eGPU need its own power supply?

A: Yes. Every eGPU enclosure has a built-in PSU because GPUs draw far more than the host can provide over the cable. Confirm the PSU is sized for your card (700W+ for an RTX 4090).


Conclusion

The most surprising finding from our benchmarks: eGPUs are genuinely viable for local AI, in a way they never quite were for gaming. Inference has a vastly lighter PCIe footprint than rendering, and the result is performance within 3-17% of internal PCIe across every interconnect we tested.

For homelab builders with a mini PC, an eGPU is the cleanest upgrade path — keep the silent, low-idle mini PC for everything else and dock a powerful GPU when you need to run a serious model. For laptop owners, an eGPU adds a Plain B option that did not exist a year ago.

Skip eGPUs on Macs. Skip them if you are buying a brand-new PC anyway (just put the GPU inside). Embrace them when you are extending hardware you already own.


Want our full eGPU benchmark sheet, including chassis-by-chassis test data and cable comparisons? Subscribe to the LocalAimaster newsletter and we will send the spreadsheet.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: March 8, 2026🔄 Last Updated: April 23, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

Was this helpful?

Get the Full eGPU Benchmark Sheet

Subscribers get the chassis-by-chassis spreadsheet, cable test data, and updated numbers each time a new TB5 or USB4 v2 host launches.

Related Guides

Continue your local AI journey with these comprehensive guides

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Continue Learning

📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators