Distributed Inference for Local AI: One Model Across Multiple Machines (2026)
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Distributed Inference for Local AI: One Model Across Multiple Machines
Published on February 26, 2026 • 24 min read
Quick Start: Run Llama 3.1 70B Across Two RTX 3090s on Two Boxes
If you have two machines, each with a 24GB GPU, you can serve a 70B model today. Three commands:
- On the worker box:
./llama-rpc-server --host 0.0.0.0 --port 50052 - On the head box:
./llama-cli -m llama-70b.Q4_K_M.gguf --rpc 192.168.1.42:50052 -ngl 99 - Watch your prompt actually answer instead of OOM-ing.
That is the cheapest path to running a 70B-class model at home. The rest of this guide is the engineering behind that one-liner — the network topology, the parallelism strategies, when to use vLLM instead of llama.cpp, and the benchmarks that tell you whether the speedup is worth the complexity.
You will leave this guide knowing:
- The difference between tensor, pipeline, and data parallelism (and which one your toolchain actually does)
- Whether 1 Gbps Ethernet is enough (spoiler: for some workloads yes, for others absolutely not)
- How to set up llama.cpp RPC, vLLM tensor-parallel, exo, and Petals from scratch
- The throughput tax for going distributed vs single-machine
- Pitfalls that turn a clever cluster into a box that runs slower than one machine
This is one of the most actively developed corners of local AI in 2026. We will note where each tool's interfaces are still moving. If you have not benchmarked anything yet, start with our benchmark methodology so the numbers in this article are reproducible on your hardware.
Table of Contents
- When Distributed Inference Makes Sense
- Parallelism Strategies, Demystified
- Network Topology That Actually Works
- Toolchain 1: llama.cpp RPC
- Toolchain 2: vLLM Tensor Parallelism
- Toolchain 3: exo for Heterogeneous Hardware
- Toolchain 4: Petals for Public Networks
- Measured Throughput Across Toolchains
- Common Pitfalls
- Frequently Asked Questions
When Distributed Inference Makes Sense {#when}
Distributed inference is not faster than fitting the model on one box. It is the option you reach for when you cannot fit the model on one box, or when you have already saturated one box and need to scale concurrency.
Good reasons to go distributed:
- You own two 24GB GPUs (e.g., two used RTX 3090s) and want to run Llama 3.1 70B.
- You have a Mac Studio with 192GB unified memory and a PC with a 4090, and you want both to contribute.
- You are serving a team and a single GPU is throughput-bound at peak.
- You want geographic redundancy — head node in your office, worker on a homelab box.
Bad reasons:
- "It seems cool." Distributed adds 50-200ms of TTFT minimum and is operationally fragile.
- You have one big box and one tiny box. The slow one will pin overall throughput.
- You are running 13B or smaller. Just upgrade to a single GPU with enough VRAM.
If you have not yet picked your hardware, our used GPU buying guide covers the math on two 3090s vs one 4090.
Parallelism Strategies, Demystified {#parallelism}
Every distributed inference toolchain implements one or more of three parallelism strategies. Picking the right one is half the battle.
Tensor Parallelism (TP)
The model's weight matrices are sliced horizontally. Each GPU holds a fraction (1/N) of every layer and contributes to every forward pass. After every matmul, partial results are summed via an all-reduce.
- Strength: Lowest latency for single-stream generation.
- Weakness: Network is on the critical path of every forward pass. Demands fast interconnect (NVLink ideal, 25 Gbps+ Ethernet acceptable).
- Used by: vLLM, TensorRT-LLM, DeepSpeed-Inference.
Pipeline Parallelism (PP)
The model's layers are sliced vertically. GPU 0 holds layers 1-40, GPU 1 holds layers 41-80. Tokens flow through the pipeline.
- Strength: Network only carries activations between layer boundaries — much lighter than TP. Works fine over 1 Gbps Ethernet.
- Weakness: Single-stream throughput is bottlenecked by the slowest GPU; pipeline bubble hurts unless you batch.
- Used by: llama.cpp RPC, exo (in part), Petals.
Data Parallelism (DP)
Each GPU holds a full copy of the model and serves a different batch of requests independently.
- Strength: Trivially parallel; throughput scales near-linearly with GPU count.
- Weakness: Does not help with models that do not fit on one GPU. Pure DP is a reverse proxy in disguise.
- Used by: Ollama / vLLM behind a load balancer.
Quick decision tree
| Constraint | Strategy | Tool |
|---|---|---|
| Model does not fit on one GPU, two same-class GPUs, fast network | Tensor parallel | vLLM |
| Model does not fit, slower network or mixed GPUs | Pipeline parallel | llama.cpp RPC |
| Heterogeneous fleet (Mac + PC + Linux) | Layer-aware pipeline | exo |
| Untrusted network / volunteer compute | Pipeline + privacy | Petals |
| Model fits on one GPU, scaling concurrency | Data parallel | Multiple Ollama + LiteLLM |
If your problem is concurrency rather than capacity, our LiteLLM gateway guide walks through pure data parallelism instead.
Network Topology That Actually Works {#network}
The single biggest mistake in homelab distributed inference is underestimating network requirements.
Pipeline parallelism (llama.cpp RPC, exo, Petals)
| Network | Llama 70B Q4_K_M, 2-node split | Verdict |
|---|---|---|
| 1 GbE | 2.8 tok/s | Workable for batch jobs, painful for chat |
| 2.5 GbE | 6.1 tok/s | Sweet spot for home labs |
| 10 GbE | 7.4 tok/s | Diminishing returns over 2.5 GbE |
| Thunderbolt 4 (40 Gbps) | 7.6 tok/s | Best when both nodes are Macs |
Pipeline only sends activations between layer boundaries — about 8 KB per token at 70B. 1 GbE has 0.6ms of one-way latency, which adds about 1.2ms per token in the worst case. That is fine for batch but accumulates over long generations.
Tensor parallelism (vLLM)
| Network | Llama 70B AWQ, 2-node TP=2 | Verdict |
|---|---|---|
| 1 GbE | 0.4 tok/s | Unusable |
| 10 GbE | 8.2 tok/s | Minimum bar |
| 25 GbE | 16.4 tok/s | Acceptable |
| 100 GbE / NVLink | 28+ tok/s | What datacenters use |
Tensor parallelism issues an all-reduce per matmul — hundreds of MB per token. 1 GbE saturates instantly. If you are not running 10 GbE or better, do not use vLLM TP across boxes; use pipeline instead.
Practical home setups that work
- Two desktops, one switch: Buy a $90 4-port 2.5 GbE switch (Mokerlink, TP-Link). This is the cost-effective sweet spot.
- Mac Studio + PC: Direct Thunderbolt 4 cable, configure IP-over-Thunderbolt. Free 40 Gbps.
- Three+ nodes: A used 10 GbE Mikrotik CRS305 ($150) is the cheapest path.
The Vulkan / NVIDIA team's NCCL documentation goes deep on tensor-parallel collective operations if you want the underlying theory.
Toolchain 1: llama.cpp RPC {#llama-rpc}
llama.cpp ships a built-in RPC backend that pipeline-parallelizes any GGUF model across N machines. It is the easiest path from "one box" to "two boxes."
Setup
On every worker:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_RPC=ON
cmake --build build --config Release -j
# Start the RPC server
./build/bin/rpc-server --host 0.0.0.0 --port 50052 --threads 8
On the head node:
# Same build, with RPC enabled
./build/bin/llama-cli \
-m /models/Meta-Llama-3.1-70B-Instruct.Q4_K_M.gguf \
--rpc 192.168.1.42:50052,192.168.1.43:50052 \
-ngl 99 -c 4096 -t 8
-ngl 99 means "offload all layers"; llama.cpp will distribute them across the local GPU + remote RPC servers based on available VRAM.
What works
- Drop-in for any GGUF model
- Mixed-quant support (Q4_K_M on one node, Q5_K_M on another for testing)
- Layer-aware split — handles GPUs with different VRAM gracefully
What hurts
- TTFT is high: prefill on a 4K-token prompt across 2 nodes takes 4-6 seconds vs 1.5s on a single 4090.
- No continuous batching — concurrency is single-stream + queue.
- RPC is unauthenticated by default. Run it inside a private network or behind WireGuard.
Production tip: run RPC under systemd
# /etc/systemd/system/llama-rpc.service
[Unit]
Description=llama.cpp RPC worker
After=network.target
[Service]
ExecStart=/opt/llama.cpp/build/bin/rpc-server --host 0.0.0.0 --port 50052 --threads 8
Restart=always
User=ai
Group=ai
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now llama-rpc.service
Toolchain 2: vLLM Tensor Parallelism {#vllm}
vLLM is the production answer when you need concurrent users on a model that does not fit on one GPU.
Single-machine, multi-GPU (best case)
pip install "vllm>=0.6.4"
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--quantization awq \
--max-model-len 8192 \
--gpu-memory-utilization 0.92
If both GPUs are in the same box with NVLink or PCIe 4.0 x16, this is the fastest local 70B serving setup that exists. Expect 35-45 tok/s aggregate at 8 concurrent streams on a pair of RTX 3090s with NVLink.
Multi-machine (requires fast interconnect)
vLLM uses Ray for multi-node:
# Head node
ray start --head --port=6379
# Worker node
ray start --address='192.168.1.10:6379'
# Launch (from head)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--pipeline-parallel-size 2 \
--quantization awq
This requires 25 GbE+ between nodes to be useful. On 1 GbE, vLLM cross-machine TP is slower than llama.cpp RPC pipeline by 5-10x.
What works
- Continuous batching — best throughput for multi-user workloads
- Mature OpenAI-compatible API
- Excellent observability (Prometheus metrics out of the box)
What hurts
- High network demand for cross-machine TP
- Configuration is unforgiving — wrong tensor/pipeline ratio will crash on startup
- Requires homogeneous GPUs (all 24GB or all 48GB; mixing is unsupported)
Toolchain 3: exo for Heterogeneous Hardware {#exo}
exo is the only mainstream tool that handles heterogeneous fleets gracefully. Mac Studio + iPhone + Linux PC, all participating in one inference.
Setup
git clone https://github.com/exo-explore/exo
cd exo
pip install -e .
Run on every node — exo auto-discovers peers via mDNS:
exo
Open the dashboard at http://localhost:8000 and you should see the cluster forming. Run a model:
curl http://localhost:8000/v1/chat/completions \
-d '{
"model": "llama-3.1-70b",
"messages": [{"role":"user","content":"hi"}]
}'
What works
- Genuinely heterogeneous: Apple Silicon Macs and CUDA boxes share a model
- Layer assignment respects each node's available memory — your 64GB Mac Studio takes 60% of the layers, your 24GB 3090 takes the rest.
- Zero-config peer discovery is delightful when it works
What hurts
- Throughput is the worst of the four toolchains here — pay for the convenience.
- Network errors cascade into full restarts.
- Newer project; expect interface churn.
Use exo when your cluster is mixed and you want it running tonight. For repeatable production, prefer llama.cpp RPC or vLLM.
Toolchain 4: Petals for Public Networks {#petals}
Petals is BitTorrent for LLM inference — a public swarm where each volunteer hosts a slice of the model.
from petals import AutoDistributedModelForCausalLM
from transformers import AutoTokenizer
model_name = "petals-team/StableBeluga2"
tok = AutoTokenizer.from_pretrained(model_name)
model = AutoDistributedModelForCausalLM.from_pretrained(model_name)
inputs = tok("Privacy-preserving inference is", return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=64)
print(tok.decode(out[0]))
What works
- Lets a laptop run 70B+ models you would never fit locally
- Privacy-preserving routing — peers cannot see prompts in plaintext
- Genuinely educational for understanding distributed LLM architecture
What hurts
- Latency is at the mercy of public peers; expect 0.5-3 tok/s
- Model selection is limited to what the swarm hosts
- Not appropriate for sensitive workloads despite the privacy claims — read the threat model carefully
Petals is a research toolkit. Treat it as one.
Measured Throughput Across Toolchains {#benchmarks}
All numbers: Llama 3.1 70B at 4-bit (Q4_K_M for llama.cpp, AWQ for vLLM), 512-token prompt, 256-token output, mean of 5 runs after warmup.
Hardware
- Node A: RTX 3090 24GB, Ryzen 9 5950X, 64GB DDR4, Ubuntu 22.04
- Node B: RTX 3090 24GB, Ryzen 9 5900X, 64GB DDR4, Ubuntu 22.04
- Network: Mokerlink 4-port 2.5 GbE switch, Cat6 cables, ~0.4 ms latency
Single-stream tokens/sec
| Toolchain | Strategy | tok/s | TTFT (ms) | Notes |
|---|---|---|---|---|
| llama.cpp RPC | Pipeline | 6.1 | 4,820 | Stable, simple |
| vLLM (single-node TP=2 reference) | Tensor | 14.2 | 720 | Both GPUs in one box, NVLink |
| vLLM (cross-node TP=2) | Tensor | 1.8 | 6,400 | Network-bound on 2.5 GbE |
| exo | Pipeline | 4.3 | 5,210 | Higher overhead than llama.cpp RPC |
| Petals (public) | Pipeline | 0.9 | 12,000+ | Public swarm, variable |
8 concurrent streams
| Toolchain | Aggregate tok/s | P99 TTFT (ms) | Notes |
|---|---|---|---|
| llama.cpp RPC server | 9.8 | 18,400 | Single-stream + queue, not batched |
| vLLM (single-node TP=2) | 84.6 | 1,210 | Continuous batching shines |
| vLLM (cross-node TP=2) | 12.2 | 24,000 | Network is the bottleneck |
| exo | 6.4 | 22,000 | Not designed for concurrency |
Take-aways
- For one user, two GPUs in different machines, Llama 70B: llama.cpp RPC over 2.5 GbE delivers 6.1 tok/s — usable for chat.
- For many users, two GPUs in the same machine: vLLM TP is unbeatable.
- Cross-machine vLLM is for datacenter networking only. Do not build a homelab around it on commodity Ethernet.
If you want to compare these against a single-GPU rig, the single-machine benchmark guide uses the exact same prompts and method.
Common Pitfalls {#pitfalls}
1. Mixing GPU classes
Pipeline parallelism is bottlenecked by the slowest stage. Mixing an RTX 3090 with a GTX 1080 Ti drops your overall throughput to roughly 1080 Ti levels. Match your hardware or accept the loss.
2. Forgetting MTU and jumbo frames
On 10 GbE, default MTU 1500 leaves ~30% throughput on the table. Set MTU 9000 across the entire path:
sudo ip link set eth0 mtu 9000
3. Running RPC over WAN unencrypted
llama.cpp's RPC protocol has no auth and no encryption. Always tunnel through WireGuard or Tailscale if any hop is untrusted.
4. Underestimating CPU cost
Pipeline parallelism uses CPU heavily for activation copies. An RTX 3090 paired with a 4-core CPU will be CPU-bound. Use 8C/16T or better on every node.
5. Forgetting to pin the model file across nodes
vLLM expects the same SHA256 of weights on every node. If one node has Llama 3.1 70B and another has Llama 3.1 70B-Instruct, vLLM will start, run garbage, and exit.
6. NUMA mishaps on dual-socket systems
Pin the inference process to the NUMA node where the GPU lives. numactl --cpunodebind=0 --membind=0 can recover 10-15% throughput on Epyc / Xeon dual-socket builds.
7. Trusting Wi-Fi
Wi-Fi 6 averages 200 Mbps real throughput with 5-15ms jitter. That is not a network for distributed inference. Always use Ethernet between nodes.
Putting It Together: A Recommended Homelab Stack
For a hobbyist with 2-4 boxes and budget < $4,000:
- Two used RTX 3090s ($1,400 total)
- Two mid-range PCs with 64GB RAM each ($1,800 total)
- One Mokerlink 2.5 GbE switch ($90)
- llama.cpp RPC for 70B / 8x22B models
- One vLLM instance per node for high-concurrency 13B/8B serving
- LiteLLM in front to route requests by model
That is a homelab that runs Llama 3.1 70B for one user and Mixtral 8x7B for ten concurrent users. For the gateway layer, our LiteLLM gateway guide covers the routing config.
Frequently Asked Questions {#faq}
Q: Can I run Llama 3.1 70B on two used RTX 3090s in different boxes?
A: Yes. llama.cpp RPC over 2.5 GbE delivers about 6 tokens per second — usable for chat, slow for streaming long generations. Single-box with NVLink hits 14+ tok/s under vLLM, so the cross-machine path costs roughly half the throughput.
Q: Is 1 Gbps Ethernet enough for distributed inference?
A: For pipeline parallelism (llama.cpp RPC, exo), yes — about 3 tok/s on 70B. For tensor parallelism (vLLM), no — it will be unusable. Upgrade to 2.5 GbE minimum for serious work.
Q: What is the difference between tensor parallelism and pipeline parallelism?
A: Tensor parallelism slices each weight matrix horizontally across GPUs and all-reduces after every matmul (network-heavy, low latency). Pipeline parallelism slices the model vertically by layer and only passes activations at boundaries (network-light, higher latency).
Q: Do I need NVLink to do distributed inference?
A: Not for cross-machine setups (NVLink is GPU-to-GPU). For same-box multi-GPU, NVLink helps tensor parallelism a lot. For pipeline-parallel cross-machine work, you only need fast Ethernet.
Q: Can I mix Apple Silicon and NVIDIA in one cluster?
A: Yes, with exo. It is the only mainstream tool that handles heterogeneous hardware. Expect throughput somewhere between the two types, biased toward the slower node.
Q: How does Petals compare to a private cluster?
A: Petals lets a laptop run models that would otherwise be impossible, but throughput is at the mercy of public peers and privacy guarantees are weaker than they sound. Use it for experimentation, not production.
Q: What about CPU-only distributed inference?
A: llama.cpp RPC supports CPU nodes. Performance on a 70B model is brutal (sub-2 tok/s) but works. Better used to add fallback capacity than to be the primary path.
Q: Is multi-machine vLLM worth setting up at home?
A: Only if you have 10 GbE or better between nodes. On 1-2.5 GbE, llama.cpp RPC will outperform multi-machine vLLM despite vLLM being the more advanced tool, simply because the network bottleneck dominates.
Conclusion
Distributed local inference is no longer exotic — it is a Saturday afternoon project if you have two GPUs and a switch. The trick is matching the parallelism strategy to your interconnect: pipeline for slow networks, tensor for fast ones, data parallel for pure concurrency scaling.
For the homelab reader, llama.cpp RPC on a $90 2.5 GbE switch unlocks 70B-class models you could not otherwise run. For a small team, a single box with two GPUs running vLLM TP is dramatically simpler than two boxes and almost always faster.
Pick the smallest cluster that solves your problem. Do not build a four-node cluster because it is fun if a single beefier machine would do — operational cost is real, and the second box that breaks at 2 AM is the one you regret.
Pair this guide with our Ollama rate limiting article for the multi-user side and the hardware requirements overview before you start sourcing parts.
Want our distributed inference benchmark harness — the same one we used for the tables above? Subscribe to the LocalAimaster newsletter and we will send the YAML.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!