Can I run Llama 3.1 70B across two RTX 3090s in separate machines?

Yes. Using llama.cpp RPC pipeline parallelism over a 2.5 GbE network you get around 6 tokens per second — workable for single-user chat. Tensor parallelism with vLLM cross-machine requires 10 GbE or faster to be useful.

Is 1 Gbps Ethernet enough for distributed LLM inference?

Sufficient for pipeline parallelism (llama.cpp RPC, exo, Petals) at roughly 3 tokens per second on 70B. Insufficient for tensor parallelism (vLLM) which needs to all-reduce activations every layer and saturates 1 GbE instantly.

What is the difference between tensor and pipeline parallelism?

Tensor parallelism splits each weight matrix horizontally across GPUs and synchronizes after every matmul — low latency, network-heavy. Pipeline parallelism splits layers vertically and only sends activations between boundaries — higher latency, network-light.

Do I need NVLink for distributed inference?

For cross-machine setups, no — NVLink is GPU-to-GPU within one chassis. For multi-GPU within a single box, NVLink dramatically improves tensor parallelism throughput.

Can I mix Apple Silicon and NVIDIA hardware in one inference cluster?

Yes, using exo. It is the only mainstream tool that handles heterogeneous fleets. Throughput sits between the two device classes and tends toward the slower node.

Should I use Petals for production?

No. Petals is a research-grade public swarm. Throughput is variable and the privacy threat model is weaker than first impressions suggest. Use it to learn or to experiment with very large models you could not otherwise touch.

How do I secure llama.cpp RPC across the network?

llama.cpp RPC has no built-in auth or encryption. Run all RPC traffic over a private network, or tunnel it through WireGuard or Tailscale when any segment is untrusted.

Is distributed inference faster than fitting the model on one GPU?

No. It is the option you take when the model does not fit on one GPU, or when you need to scale concurrency. A single GPU large enough to hold the model is always faster than splitting it across a network.

Distributed Inference for Local AI: One Model Across Multiple Machines

Published on February 26, 2026 • 24 min read

Quick Start: Run Llama 3.1 70B Across Two RTX 3090s on Two Boxes

If you have two machines, each with a 24GB GPU, you can serve a 70B model today. Three commands:

On the worker box: ./llama-rpc-server --host 0.0.0.0 --port 50052
On the head box: ./llama-cli -m llama-70b.Q4_K_M.gguf --rpc 192.168.1.42:50052 -ngl 99
Watch your prompt actually answer instead of OOM-ing.

That is the cheapest path to running a 70B-class model at home. The rest of this guide is the engineering behind that one-liner — the network topology, the parallelism strategies, when to use vLLM instead of llama.cpp, and the benchmarks that tell you whether the speedup is worth the complexity.

You will leave this guide knowing:

The difference between tensor, pipeline, and data parallelism (and which one your toolchain actually does)
Whether 1 Gbps Ethernet is enough (spoiler: for some workloads yes, for others absolutely not)
How to set up llama.cpp RPC, vLLM tensor-parallel, exo, and Petals from scratch
The throughput tax for going distributed vs single-machine
Pitfalls that turn a clever cluster into a box that runs slower than one machine

This is one of the most actively developed corners of local AI in 2026. We will note where each tool's interfaces are still moving. If you have not benchmarked anything yet, start with our benchmark methodology so the numbers in this article are reproducible on your hardware.

When Distributed Inference Makes Sense
Parallelism Strategies, Demystified
Network Topology That Actually Works
Toolchain 1: llama.cpp RPC
Toolchain 2: vLLM Tensor Parallelism
Toolchain 3: exo for Heterogeneous Hardware
Toolchain 4: Petals for Public Networks
Measured Throughput Across Toolchains
Common Pitfalls
Frequently Asked Questions

When Distributed Inference Makes Sense {#when}

Distributed inference is not faster than fitting the model on one box. It is the option you reach for when you cannot fit the model on one box, or when you have already saturated one box and need to scale concurrency.

Good reasons to go distributed:

You own two 24GB GPUs (e.g., two used RTX 3090s) and want to run Llama 3.1 70B.
You have a Mac Studio with 192GB unified memory and a PC with a 4090, and you want both to contribute.
You are serving a team and a single GPU is throughput-bound at peak.
You want geographic redundancy — head node in your office, worker on a homelab box.

Bad reasons:

"It seems cool." Distributed adds 50-200ms of TTFT minimum and is operationally fragile.
You have one big box and one tiny box. The slow one will pin overall throughput.
You are running 13B or smaller. Just upgrade to a single GPU with enough VRAM.

If you have not yet picked your hardware, our used GPU buying guide covers the math on two 3090s vs one 4090.

Parallelism Strategies, Demystified {#parallelism}

Every distributed inference toolchain implements one or more of three parallelism strategies. Picking the right one is half the battle.

Tensor Parallelism (TP)

The model's weight matrices are sliced horizontally. Each GPU holds a fraction (1/N) of every layer and contributes to every forward pass. After every matmul, partial results are summed via an all-reduce.

Strength: Lowest latency for single-stream generation.
Weakness: Network is on the critical path of every forward pass. Demands fast interconnect (NVLink ideal, 25 Gbps+ Ethernet acceptable).
Used by: vLLM, TensorRT-LLM, DeepSpeed-Inference.

Pipeline Parallelism (PP)

The model's layers are sliced vertically. GPU 0 holds layers 1-40, GPU 1 holds layers 41-80. Tokens flow through the pipeline.

Strength: Network only carries activations between layer boundaries — much lighter than TP. Works fine over 1 Gbps Ethernet.
Weakness: Single-stream throughput is bottlenecked by the slowest GPU; pipeline bubble hurts unless you batch.
Used by: llama.cpp RPC, exo (in part), Petals.

Data Parallelism (DP)

Each GPU holds a full copy of the model and serves a different batch of requests independently.

Strength: Trivially parallel; throughput scales near-linearly with GPU count.
Weakness: Does not help with models that do not fit on one GPU. Pure DP is a reverse proxy in disguise.
Used by: Ollama / vLLM behind a load balancer.

Quick decision tree

Constraint	Strategy	Tool
Model does not fit on one GPU, two same-class GPUs, fast network	Tensor parallel	vLLM
Model does not fit, slower network or mixed GPUs	Pipeline parallel	llama.cpp RPC
Heterogeneous fleet (Mac + PC + Linux)	Layer-aware pipeline	exo
Untrusted network / volunteer compute	Pipeline + privacy	Petals
Model fits on one GPU, scaling concurrency	Data parallel	Multiple Ollama + LiteLLM

If your problem is concurrency rather than capacity, our LiteLLM gateway guide walks through pure data parallelism instead.

Network Topology That Actually Works {#network}

The single biggest mistake in homelab distributed inference is underestimating network requirements.

Pipeline parallelism (llama.cpp RPC, exo, Petals)

Network	Llama 70B Q4_K_M, 2-node split	Verdict
1 GbE	2.8 tok/s	Workable for batch jobs, painful for chat
2.5 GbE	6.1 tok/s	Sweet spot for home labs
10 GbE	7.4 tok/s	Diminishing returns over 2.5 GbE
Thunderbolt 4 (40 Gbps)	7.6 tok/s	Best when both nodes are Macs

Pipeline only sends activations between layer boundaries — about 8 KB per token at 70B. 1 GbE has 0.6ms of one-way latency, which adds about 1.2ms per token in the worst case. That is fine for batch but accumulates over long generations.

Tensor parallelism (vLLM)

Network	Llama 70B AWQ, 2-node TP=2	Verdict
1 GbE	0.4 tok/s	Unusable
10 GbE	8.2 tok/s	Minimum bar
25 GbE	16.4 tok/s	Acceptable
100 GbE / NVLink	28+ tok/s	What datacenters use

Tensor parallelism issues an all-reduce per matmul — hundreds of MB per token. 1 GbE saturates instantly. If you are not running 10 GbE or better, do not use vLLM TP across boxes; use pipeline instead.

Practical home setups that work

Two desktops, one switch: Buy a $90 4-port 2.5 GbE switch (Mokerlink, TP-Link). This is the cost-effective sweet spot.
Mac Studio + PC: Direct Thunderbolt 4 cable, configure IP-over-Thunderbolt. Free 40 Gbps.
Three+ nodes: A used 10 GbE Mikrotik CRS305 ($150) is the cheapest path.

The Vulkan / NVIDIA team's NCCL documentation goes deep on tensor-parallel collective operations if you want the underlying theory.

Toolchain 1: llama.cpp RPC {#llama-rpc}

llama.cpp ships a built-in RPC backend that pipeline-parallelizes any GGUF model across N machines. It is the easiest path from "one box" to "two boxes."

Setup

On every worker:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_RPC=ON
cmake --build build --config Release -j

# Start the RPC server
./build/bin/rpc-server --host 0.0.0.0 --port 50052 --threads 8

On the head node:

# Same build, with RPC enabled
./build/bin/llama-cli \
  -m /models/Meta-Llama-3.1-70B-Instruct.Q4_K_M.gguf \
  --rpc 192.168.1.42:50052,192.168.1.43:50052 \
  -ngl 99 -c 4096 -t 8

-ngl 99 means "offload all layers"; llama.cpp will distribute them across the local GPU + remote RPC servers based on available VRAM.

What works

Drop-in for any GGUF model
Mixed-quant support (Q4_K_M on one node, Q5_K_M on another for testing)
Layer-aware split — handles GPUs with different VRAM gracefully

What hurts

TTFT is high: prefill on a 4K-token prompt across 2 nodes takes 4-6 seconds vs 1.5s on a single 4090.
No continuous batching — concurrency is single-stream + queue.
RPC is unauthenticated by default. Run it inside a private network or behind WireGuard.

Production tip: run RPC under systemd

# /etc/systemd/system/llama-rpc.service
[Unit]
Description=llama.cpp RPC worker
After=network.target

[Service]
ExecStart=/opt/llama.cpp/build/bin/rpc-server --host 0.0.0.0 --port 50052 --threads 8
Restart=always
User=ai
Group=ai

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable --now llama-rpc.service

Toolchain 2: vLLM Tensor Parallelism {#vllm}

vLLM is the production answer when you need concurrent users on a model that does not fit on one GPU.

Single-machine, multi-GPU (best case)

pip install "vllm>=0.6.4"

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --quantization awq \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92

If both GPUs are in the same box with NVLink or PCIe 4.0 x16, this is the fastest local 70B serving setup that exists. Expect 35-45 tok/s aggregate at 8 concurrent streams on a pair of RTX 3090s with NVLink.

Multi-machine (requires fast interconnect)

vLLM uses Ray for multi-node:

# Head node
ray start --head --port=6379

# Worker node
ray start --address='192.168.1.10:6379'

# Launch (from head)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 2 \
  --quantization awq

This requires 25 GbE+ between nodes to be useful. On 1 GbE, vLLM cross-machine TP is slower than llama.cpp RPC pipeline by 5-10x.

What works

Continuous batching — best throughput for multi-user workloads
Mature OpenAI-compatible API
Excellent observability (Prometheus metrics out of the box)

What hurts

High network demand for cross-machine TP
Configuration is unforgiving — wrong tensor/pipeline ratio will crash on startup
Requires homogeneous GPUs (all 24GB or all 48GB; mixing is unsupported)

Toolchain 3: exo for Heterogeneous Hardware {#exo}

exo is the only mainstream tool that handles heterogeneous fleets gracefully. Mac Studio + iPhone + Linux PC, all participating in one inference.

Setup

git clone https://github.com/exo-explore/exo
cd exo
pip install -e .

Run on every node — exo auto-discovers peers via mDNS:

exo

Open the dashboard at http://localhost:8000 and you should see the cluster forming. Run a model:

curl http://localhost:8000/v1/chat/completions \
  -d '{
    "model": "llama-3.1-70b",
    "messages": [{"role":"user","content":"hi"}]
  }'

What works

Genuinely heterogeneous: Apple Silicon Macs and CUDA boxes share a model
Layer assignment respects each node's available memory — your 64GB Mac Studio takes 60% of the layers, your 24GB 3090 takes the rest.
Zero-config peer discovery is delightful when it works

What hurts

Throughput is the worst of the four toolchains here — pay for the convenience.
Network errors cascade into full restarts.
Newer project; expect interface churn.

Use exo when your cluster is mixed and you want it running tonight. For repeatable production, prefer llama.cpp RPC or vLLM.

Toolchain 4: Petals for Public Networks {#petals}

Petals is BitTorrent for LLM inference — a public swarm where each volunteer hosts a slice of the model.

from petals import AutoDistributedModelForCausalLM
from transformers import AutoTokenizer

model_name = "petals-team/StableBeluga2"
tok = AutoTokenizer.from_pretrained(model_name)
model = AutoDistributedModelForCausalLM.from_pretrained(model_name)

inputs = tok("Privacy-preserving inference is", return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=64)
print(tok.decode(out[0]))

What works

Lets a laptop run 70B+ models you would never fit locally
Privacy-preserving routing — peers cannot see prompts in plaintext
Genuinely educational for understanding distributed LLM architecture

What hurts

Latency is at the mercy of public peers; expect 0.5-3 tok/s
Model selection is limited to what the swarm hosts
Not appropriate for sensitive workloads despite the privacy claims — read the threat model carefully

Petals is a research toolkit. Treat it as one.

Measured Throughput Across Toolchains {#benchmarks}

All numbers: Llama 3.1 70B at 4-bit (Q4_K_M for llama.cpp, AWQ for vLLM), 512-token prompt, 256-token output, mean of 5 runs after warmup.

Hardware

Node A: RTX 3090 24GB, Ryzen 9 5950X, 64GB DDR4, Ubuntu 22.04
Node B: RTX 3090 24GB, Ryzen 9 5900X, 64GB DDR4, Ubuntu 22.04
Network: Mokerlink 4-port 2.5 GbE switch, Cat6 cables, ~0.4 ms latency

Single-stream tokens/sec

Toolchain	Strategy	tok/s	TTFT (ms)	Notes
llama.cpp RPC	Pipeline	6.1	4,820	Stable, simple
vLLM (single-node TP=2 reference)	Tensor	14.2	720	Both GPUs in one box, NVLink
vLLM (cross-node TP=2)	Tensor	1.8	6,400	Network-bound on 2.5 GbE
exo	Pipeline	4.3	5,210	Higher overhead than llama.cpp RPC
Petals (public)	Pipeline	0.9	12,000+	Public swarm, variable

8 concurrent streams

Toolchain	Aggregate tok/s	P99 TTFT (ms)	Notes
llama.cpp RPC server	9.8	18,400	Single-stream + queue, not batched
vLLM (single-node TP=2)	84.6	1,210	Continuous batching shines
vLLM (cross-node TP=2)	12.2	24,000	Network is the bottleneck
exo	6.4	22,000	Not designed for concurrency

Take-aways

For one user, two GPUs in different machines, Llama 70B: llama.cpp RPC over 2.5 GbE delivers 6.1 tok/s — usable for chat.
For many users, two GPUs in the same machine: vLLM TP is unbeatable.
Cross-machine vLLM is for datacenter networking only. Do not build a homelab around it on commodity Ethernet.

If you want to compare these against a single-GPU rig, the single-machine benchmark guide uses the exact same prompts and method.

Common Pitfalls {#pitfalls}

1. Mixing GPU classes

Pipeline parallelism is bottlenecked by the slowest stage. Mixing an RTX 3090 with a GTX 1080 Ti drops your overall throughput to roughly 1080 Ti levels. Match your hardware or accept the loss.

2. Forgetting MTU and jumbo frames

On 10 GbE, default MTU 1500 leaves ~30% throughput on the table. Set MTU 9000 across the entire path:

sudo ip link set eth0 mtu 9000

3. Running RPC over WAN unencrypted

llama.cpp's RPC protocol has no auth and no encryption. Always tunnel through WireGuard or Tailscale if any hop is untrusted.

4. Underestimating CPU cost

Pipeline parallelism uses CPU heavily for activation copies. An RTX 3090 paired with a 4-core CPU will be CPU-bound. Use 8C/16T or better on every node.

5. Forgetting to pin the model file across nodes

vLLM expects the same SHA256 of weights on every node. If one node has Llama 3.1 70B and another has Llama 3.1 70B-Instruct, vLLM will start, run garbage, and exit.

6. NUMA mishaps on dual-socket systems

Pin the inference process to the NUMA node where the GPU lives. numactl --cpunodebind=0 --membind=0 can recover 10-15% throughput on Epyc / Xeon dual-socket builds.

7. Trusting Wi-Fi

Wi-Fi 6 averages 200 Mbps real throughput with 5-15ms jitter. That is not a network for distributed inference. Always use Ethernet between nodes.

Putting It Together: A Recommended Homelab Stack

For a hobbyist with 2-4 boxes and budget < $4,000:

Two used RTX 3090s ($1,400 total)
Two mid-range PCs with 64GB RAM each ($1,800 total)
One Mokerlink 2.5 GbE switch ($90)
llama.cpp RPC for 70B / 8x22B models
One vLLM instance per node for high-concurrency 13B/8B serving
LiteLLM in front to route requests by model

That is a homelab that runs Llama 3.1 70B for one user and Mixtral 8x7B for ten concurrent users. For the gateway layer, our LiteLLM gateway guide covers the routing config.

Frequently Asked Questions {#faq}

Q: Can I run Llama 3.1 70B on two used RTX 3090s in different boxes?

A: Yes. llama.cpp RPC over 2.5 GbE delivers about 6 tokens per second — usable for chat, slow for streaming long generations. Single-box with NVLink hits 14+ tok/s under vLLM, so the cross-machine path costs roughly half the throughput.

Q: Is 1 Gbps Ethernet enough for distributed inference?

A: For pipeline parallelism (llama.cpp RPC, exo), yes — about 3 tok/s on 70B. For tensor parallelism (vLLM), no — it will be unusable. Upgrade to 2.5 GbE minimum for serious work.

Q: What is the difference between tensor parallelism and pipeline parallelism?

A: Tensor parallelism slices each weight matrix horizontally across GPUs and all-reduces after every matmul (network-heavy, low latency). Pipeline parallelism slices the model vertically by layer and only passes activations at boundaries (network-light, higher latency).

Q: Do I need NVLink to do distributed inference?

A: Not for cross-machine setups (NVLink is GPU-to-GPU). For same-box multi-GPU, NVLink helps tensor parallelism a lot. For pipeline-parallel cross-machine work, you only need fast Ethernet.

Q: Can I mix Apple Silicon and NVIDIA in one cluster?

A: Yes, with exo. It is the only mainstream tool that handles heterogeneous hardware. Expect throughput somewhere between the two types, biased toward the slower node.

Q: How does Petals compare to a private cluster?

A: Petals lets a laptop run models that would otherwise be impossible, but throughput is at the mercy of public peers and privacy guarantees are weaker than they sound. Use it for experimentation, not production.

Q: What about CPU-only distributed inference?

A: llama.cpp RPC supports CPU nodes. Performance on a 70B model is brutal (sub-2 tok/s) but works. Better used to add fallback capacity than to be the primary path.

Q: Is multi-machine vLLM worth setting up at home?

A: Only if you have 10 GbE or better between nodes. On 1-2.5 GbE, llama.cpp RPC will outperform multi-machine vLLM despite vLLM being the more advanced tool, simply because the network bottleneck dominates.

Conclusion

Distributed local inference is no longer exotic — it is a Saturday afternoon project if you have two GPUs and a switch. The trick is matching the parallelism strategy to your interconnect: pipeline for slow networks, tensor for fast ones, data parallel for pure concurrency scaling.

For the homelab reader, llama.cpp RPC on a $90 2.5 GbE switch unlocks 70B-class models you could not otherwise run. For a small team, a single box with two GPUs running vLLM TP is dramatically simpler than two boxes and almost always faster.

Pick the smallest cluster that solves your problem. Do not build a four-node cluster because it is fun if a single beefier machine would do — operational cost is real, and the second box that breaks at 2 AM is the one you regret.

Pair this guide with our Ollama rate limiting article for the multi-user side and the hardware requirements overview before you start sourcing parts.

Want our distributed inference benchmark harness — the same one we used for the tables above? Subscribe to the LocalAimaster newsletter and we will send the YAML.

Distributed Inference for Local AI: One Model Across Multiple Machines (2026)

Want to go deeper than this article?

Distributed Inference for Local AI: One Model Across Multiple Machines

Quick Start: Run Llama 3.1 70B Across Two RTX 3090s on Two Boxes

Table of Contents

When Distributed Inference Makes Sense {#when}

Parallelism Strategies, Demystified {#parallelism}

Tensor Parallelism (TP)

Pipeline Parallelism (PP)

Data Parallelism (DP)

Quick decision tree

Network Topology That Actually Works {#network}

Pipeline parallelism (llama.cpp RPC, exo, Petals)

Tensor parallelism (vLLM)

Practical home setups that work

Toolchain 1: llama.cpp RPC {#llama-rpc}

Setup

What works

What hurts

Production tip: run RPC under systemd

Toolchain 2: vLLM Tensor Parallelism {#vllm}

Single-machine, multi-GPU (best case)

Multi-machine (requires fast interconnect)

What works

What hurts

Toolchain 3: exo for Heterogeneous Hardware {#exo}

Setup

What works

What hurts

Toolchain 4: Petals for Public Networks {#petals}

What works

What hurts

Measured Throughput Across Toolchains {#benchmarks}

Hardware

Single-stream tokens/sec

8 concurrent streams

Take-aways

Common Pitfalls {#pitfalls}

1. Mixing GPU classes

2. Forgetting MTU and jumbo frames

3. Running RPC over WAN unencrypted

4. Underestimating CPU cost

5. Forgetting to pin the model file across nodes

6. NUMA mishaps on dual-socket systems

7. Trusting Wi-Fi

Putting It Together: A Recommended Homelab Stack

Frequently Asked Questions {#faq}

Q: Can I run Llama 3.1 70B on two used RTX 3090s in different boxes?

Q: Is 1 Gbps Ethernet enough for distributed inference?

Q: What is the difference between tensor parallelism and pipeline parallelism?

Q: Do I need NVLink to do distributed inference?

Q: Can I mix Apple Silicon and NVIDIA in one cluster?

Q: How does Petals compare to a private cluster?

Q: What about CPU-only distributed inference?

Q: Is multi-machine vLLM worth setting up at home?

Conclusion

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

🎓 Continue Learning

Get the Distributed Inference Harness

Related Guides

Build Real AI on Your Machine

Continue Learning

Multi-User Ollama

$1,500 AI Server Build

Hardware Requirements

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI