Does Ollama support tensor parallelism across multiple GPUs?

Not in the strict vLLM sense. Ollama relies on llama.cpp under the hood, which performs layer-level splitting (pipeline parallelism), not row/column tensor parallelism. The model is sharded by transformer layer across visible GPUs and tokens flow through them sequentially. This means two GPUs do not double throughput — they double the available VRAM. If you need true tensor parallelism for higher tokens per second, run vLLM or TGI behind an Ollama-compatible facade.

How does Ollama decide which layers go on which GPU?

When OLLAMA_GPU_LAYERS or num_gpu is unset, llama.cpp counts free VRAM on each visible CUDA device and packs layers in order until VRAM is exhausted. If you have a 24GB and a 16GB GPU, the first GPU receives more layers. You can override this with the OLLAMA_SCHED_SPREAD=1 environment variable, which spreads layers evenly even on asymmetric setups, or by setting per-device memory budgets with CUDA_VISIBLE_DEVICES and OLLAMA_GPU_OVERHEAD.

Will NVLink make Ollama faster on dual 3090s?

Marginally — usually 2 to 5 percent on a 70B Q4_K_M. Because Ollama uses pipeline parallelism, GPUs only exchange the hidden state vector at the layer boundary. That payload is tiny (hundreds of KB per token), so PCIe 4.0 x16 is rarely the bottleneck. NVLink matters far more for tensor-parallel frameworks like vLLM where activations and gradients move in bulk between cards.

Can I mix an RTX 4090 and an RTX 3090 for one model?

Yes, and it works well for inference. Ollama treats them as separate CUDA devices and assigns layers based on free VRAM. You will not get the FP8 or Ada-specific optimizations of the 4090 on the 3090 portion, but Q4_K_M and Q5_K_M quantizations run identically on both because they are dequantized to FP16 inside the kernel. Expect throughput to land between the two cards — often 80 to 90 percent of the slower GPU.

Why does my second GPU sit at 0% usage?

Three usual suspects. First, the model fits entirely on GPU 0 — Ollama will not split a model that fits in one card. Second, CUDA_VISIBLE_DEVICES is set to a single device or the GPU order in the environment file disagrees with nvidia-smi. Third, the second GPU is on a chipset PCIe lane (x4 or worse) and llama.cpp deliberately skipped it. Verify with nvidia-smi during inference, then set OLLAMA_SCHED_SPREAD=1 to force a split.

How much VRAM do I need for Llama 3.3 70B across two GPUs?

Roughly 42 GB of VRAM for Q4_K_M plus 4 to 8 GB for KV cache at 8K context. Two RTX 3090s (48 GB total) handle this comfortably with some headroom for context. For 16K context you should plan on 50 GB+ of VRAM, which is why a third 3090 or an A6000 (48 GB on a single card) is the next step up.

Is multi-GPU worth it over a single A6000 (48GB) or RTX 6000 Ada?

Strictly for raw throughput, a single A6000 beats two 3090s on the same model because there is no inter-GPU traffic and the larger card can hold more KV cache. But two used 3090s cost roughly half of a new A6000 and let you run two different 32GB-class models simultaneously on separate cards — useful for embedding plus chat workloads. Choose based on whether you want capacity for one big model or flexibility for many.

Does Ollama work with AMD multi-GPU (MI210, MI300X, RX 7900 XTX)?

Yes, through the ROCm build of Ollama. Multi-GPU AMD support landed properly with ROCm 6.0 and Ollama 0.1.40+. Set HSA_VISIBLE_DEVICES instead of CUDA_VISIBLE_DEVICES, and confirm rocm-smi reports both cards. Layer splitting works the same way as NVIDIA, but the AMD ecosystem still has more rough edges — expect to compile your own Ollama from source for the latest ROCm fixes.

Multi-GPU Ollama: Split 70B Models Across Two or More GPUs

Published on April 23, 2026 -- 19 min read

The day you graduate from a single 4090 to two GPUs is the day Ollama stops feeling like a toy. A 70B model that crawled at 2 tokens per second from system RAM suddenly streams at 18-22 tok/s. A 120B mixture-of-experts model that refused to load now boots in 30 seconds. Two GPUs are not just "more VRAM" — they unlock an entirely different class of model.

But multi-GPU Ollama is also where every tutorial on the internet falls silent. The official docs cover ollama run. They do not cover OLLAMA_SCHED_SPREAD, mixed-VRAM scheduling, NVLink myths, or what to do when nvidia-smi shows GPU 1 sitting cold while GPU 0 melts at 92 degrees. This guide does, with benchmarks from real rigs.

Quick Start: If you have two NVIDIA GPUs already detected by nvidia-smi, set export OLLAMA_SCHED_SPREAD=1 and export CUDA_VISIBLE_DEVICES=0,1, restart the Ollama service, then run ollama run llama3.3:70b. The model will split itself across both cards automatically.

Why Multi-GPU for Ollama
Hardware Compatibility Matrix
Installation and Driver Setup
Layer-Splitting Configuration
Benchmarks: Real Numbers
NVLink, PCIe, and Bus Reality
Mixed VRAM and Asymmetric Rigs
Troubleshooting Pitfalls
Frequently Asked Questions

Why Multi-GPU for Ollama {#why-multi-gpu}

Three legitimate reasons exist to run Ollama across multiple GPUs:

1. Models that exceed single-card VRAM. Llama 3.3 70B Q4_K_M needs 42-44 GB. Mixtral 8x22B Q4_K_M needs 80+ GB. DeepSeek V3 671B in any usable quantization needs hundreds of GB. None of these fit on a 24GB card. Splitting them across two or four GPUs is the only way to keep the entire model resident in fast memory.

2. KV cache headroom for long context. Even when the weights fit, the KV cache for 32K or 128K context can double VRAM requirements. A 70B Q4_K_M model on a 48GB A6000 starts swapping into CPU RAM around 16K context. Two cards give you breathing room without surrendering throughput to system memory.

3. Concurrent workloads. With OLLAMA_NUM_PARALLEL=2 and two GPUs, you can serve two different sessions simultaneously — one on each card — instead of queuing requests behind a single inference stream. This is the cheapest path to a multi-user team server.

What multi-GPU does not give you with stock Ollama is doubled tokens per second on a single request. That requires tensor parallelism, which Ollama does not implement. We will return to that limitation throughout this guide.

For background on quantization choices that affect how much VRAM each layer consumes, see our AWQ vs GPTQ vs GGUF comparison.

Hardware Compatibility Matrix {#hardware-matrix}

Combination	Total VRAM	70B Q4_K_M	70B Q8	120B MoE	Notes
2x RTX 3090	48 GB	Yes (8K ctx)	No	No	Cheapest viable 70B rig
2x RTX 4090	48 GB	Yes (8K ctx)	No	No	~30% faster than 3090 pair
1x 4090 + 1x 3090	48 GB	Yes (8K ctx)	No	No	Layers auto-split by VRAM
2x RTX A6000	96 GB	Yes (32K ctx)	Yes	Tight	Workstation grade
4x RTX 3090	96 GB	Yes (32K ctx)	Yes	Yes (Q3)	Power: 1400 W under load
2x A100 80GB	160 GB	Yes (128K)	Yes	Yes	Datacenter, NVLink helps
4x A6000 Ada	192 GB	Yes (128K)	Yes	Yes	Best price/VRAM workstation
1x H100 80GB	80 GB	Yes (32K)	Yes	Yes	Single-card class above 70B

PCIe lane requirements. Each GPU should be on at least PCIe 4.0 x8. Consumer motherboards with two x16 slots usually drop both to x8 when populated. That is fine. PCIe 3.0 x4 (typical of M.2-converted slots) is not fine — llama.cpp will use the slow GPU but throughput collapses 30-40 percent.

Power and cooling. Two RTX 3090s at full load draw 700 W. Two 4090s under heavy inference draw 800-900 W. Add CPU and overhead, and a 1200 W PSU is a minimum, 1500 W comfortable. If your case has front intake and rear exhaust only, the second GPU will heat-soak the first. Plan for either blower-style cards (rare on consumer 3090/4090) or a mining-style open frame with PCIe risers.

For full power and budget planning before you commit, walk through the budget local AI machine guide.

Installation and Driver Setup {#installation}

Step 1: Install or Update NVIDIA Drivers

# Verify both GPUs are detected before installing Ollama
nvidia-smi

# Expected output should list every GPU with a non-zero memory total
# Driver version 550.54.14 or newer is required for Ada Lovelace + Ampere mixing

If only one GPU appears, the second card is either not seated, missing power, or in a disabled BIOS slot. Fix that before going further. Multi-GPU Ollama cannot work around hardware that the driver cannot see.

Step 2: Install the Latest Ollama

# Linux: install or upgrade in place
curl -fsSL https://ollama.com/install.sh | sh

# Verify version (multi-GPU scheduling improvements landed in 0.1.40+)
ollama --version

Versions older than 0.1.40 had a bug where the second GPU was discovered but layers were never assigned to it on systems with asymmetric VRAM. Anything newer is fine.

Step 3: Configure the systemd Service

# Edit the service environment
sudo systemctl edit ollama.service

Add the following:

[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1"
Environment="OLLAMA_SCHED_SPREAD=1"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_HOST=0.0.0.0:11434"

Then reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama
sudo journalctl -u ollama -f

The journal output should mention both GPUs by name when a model loads. If only one is mentioned, jump to the pitfalls section.

Step 4: Pull a Model Worth Splitting

# Anything below 30B fits on a single 24GB card and will not split
ollama pull llama3.3:70b
ollama pull mixtral:8x7b

For details on which model deserves the VRAM you are spending, the best Ollama models shortlist is the right next stop.

Layer-Splitting Configuration {#layer-splitting}

Ollama exposes three primary controls for multi-GPU placement. Used together they cover 95 percent of real-world rigs.

Control 1: `CUDA_VISIBLE_DEVICES`

Limits which GPUs Ollama can see. The order in this string is the order of preference for layer placement.

# Use both GPUs, GPU 0 first
export CUDA_VISIBLE_DEVICES=0,1

# Use only the second GPU (force a different model to a separate card)
export CUDA_VISIBLE_DEVICES=1

# Use four GPUs but skip GPU 2 (e.g., used for display)
export CUDA_VISIBLE_DEVICES=0,1,3

Control 2: `OLLAMA_SCHED_SPREAD`

Default behaviour packs the first GPU before touching the second. Setting this to 1 forces an even split based on relative free VRAM.

# Even split — best for symmetric rigs (2x 3090, 2x 4090)
export OLLAMA_SCHED_SPREAD=1

# Greedy fill — best for asymmetric rigs (4090 + 3090)
unset OLLAMA_SCHED_SPREAD

On asymmetric rigs the greedy default is usually correct: pack the faster card first so it owns the early transformer layers (which run on every token), then spill remainder onto the slower card.

Control 3: `num_gpu` (Per-Model Override)

Inside a Modelfile or via the API you can pin a specific layer count on the primary GPU. This is the escape hatch when automatic placement gets it wrong.

# Modelfile fragment for a 70B model with manual placement
FROM llama3.3:70b
PARAMETER num_gpu 50      # 50 layers on GPU 0
PARAMETER num_ctx 8192
PARAMETER num_batch 512

ollama create llama3-multi -f Modelfile
ollama run llama3-multi

A Llama 3.3 70B model has 80 transformer layers. Pinning 50 to GPU 0 places the remaining 30 on GPU 1, which roughly mirrors the VRAM ratio of a 24GB + 16GB pair. Always validate with nvidia-smi dmon while running.

Confirming the Split

# Watch utilisation while a prompt is processing
nvidia-smi dmon -s u -c 30

# Or one-shot check during inference
watch -n 0.5 nvidia-smi

A correctly split model shows both GPUs swinging between 60 and 95 percent utilisation in lockstep. If GPU 1 sits at 0 percent while GPU 0 saturates, the split did not happen.

Benchmarks: Real Numbers {#benchmarks}

All numbers below are from a 256-token completion at temperature 0, prompt size 512 tokens, Llama 3.3 70B Q4_K_M unless noted. Power draw is wall-socket measured.

Llama 3.3 70B Q4_K_M (8K context)

Rig	Tokens/sec	First-token latency	Wall power	VRAM used
1x A6000 (48GB)	19.4	380 ms	320 W	46.1 GB
2x RTX 3090	17.1	410 ms	690 W	23.8 + 22.4 GB
2x RTX 4090	22.6	350 ms	800 W	23.6 + 22.6 GB
4090 + 3090 (asym)	18.3	395 ms	720 W	23.5 + 22.7 GB
4x RTX 3090 (32K ctx)	16.4	470 ms	1320 W	~12 GB each

Two key takeaways. First, doubling GPUs does not double throughput — pipeline parallelism only adds VRAM, not compute. Second, the 4090 pair beats a single A6000 only narrowly despite costing more, because the inter-GPU latency of pipeline parallelism eats most of the gain.

Mixtral 8x22B Q4_K_M

Rig	Tokens/sec	Notes
2x A6000 (96GB)	24.8	Sweet spot for MoE
4x RTX 3090	21.1	Power-hungry alternative
2x RTX 4090	OOM	Insufficient combined VRAM

MoE models reward more cards because the router activates different experts each token, and llama.cpp can keep more experts hot when more VRAM is available.

Comparison Against vLLM (Tensor Parallel)

Framework	Hardware	Llama 3.3 70B Q4 tok/s
Ollama (pipeline)	2x RTX 4090	22.6
vLLM (tensor parallel)	2x RTX 4090	41.2
TGI (tensor parallel)	2x RTX 4090	38.9

If single-stream throughput matters more than developer ergonomics, pair Ollama with LiteLLM for the API surface and run vLLM behind it. The cost is a substantially more complex deployment.

For a deeper look at these tradeoffs, the official llama.cpp project documents the underlying split modes in detail at the llama.cpp tensor split discussion.

NVLink, PCIe, and Bus Reality {#nvlink-pcie}

NVLink is the most over-recommended upgrade in the local AI community. For Ollama specifically, here is what it actually does.

Pipeline parallelism transfers per token: approximately 16 KB to 64 KB of hidden-state activations between cards, depending on hidden dimension and batch. A 70B model has hidden_size 8192, so each token boundary moves roughly 16 KB in FP16.

PCIe 4.0 x16 bandwidth: ~32 GB/s theoretical, ~28 GB/s practical.

NVLink 3.0 (3090 SLI bridge): 112.5 GB/s.

A 30 tok/s inference stream at 16 KB per token-boundary needs about 0.5 MB/s — three orders of magnitude below PCIe 4.0 capacity. The actual benchmark gap between PCIe-only and NVLink on 2x 3090 in Ollama is 1-3 percent.

When NVLink does help:

Loading the model (one-time cost, not per inference)
Tensor-parallel frameworks (vLLM, TGI) that move bulk activations
Prefill of very long prompts where activations are larger

When NVLink does not help meaningfully:

Standard token-by-token Ollama inference
Short prompts followed by long generations
Mixed CPU/GPU offload setups

Save the NVLink money for a third GPU or more system RAM.

Mixed VRAM and Asymmetric Rigs {#mixed-vram}

The single most popular real-world combination right now: a new RTX 4090 paired with a used RTX 3090 from the previous build. Here is how to make it work cleanly.

Layout the Layers Manually

A 4090 has 24 GB. A 3090 has 24 GB. They are technically symmetric in capacity but asymmetric in compute. The 4090 is roughly 35-45 percent faster on FP16 inference. So you want more layers on the 4090 even though VRAM matches.

# Modelfile pinning 70 layers to GPU 0 (the 4090)
cat > Modelfile.l3-asym <<'EOF'
FROM llama3.3:70b
PARAMETER num_gpu 70
PARAMETER num_ctx 8192
EOF

ollama create llama3-asym -f Modelfile.l3-asym

The remaining 10 layers land on the 3090, which becomes the bottleneck only briefly per token. Net throughput in our tests was 21.4 tok/s — about 95 percent of a pure 4090 pair.

Mixed 24GB + 16GB (4090 + 4070 Ti)

When VRAM is genuinely asymmetric, lean even harder on OLLAMA_SCHED_SPREAD=0 (the default). Let the larger card take its share first. Override only if you see GPU 1 OOM during model load — then dial num_gpu down by 2-4 layers at a time until stable.

Mixing NVIDIA Generations

Ada (4090, 4080) + Ampere (3090, 3080) works. Hopper (H100) + Ada works. Mixing Turing (RTX 2080) with Ampere or newer is risky — older CUDA capability can force fallback kernels that cap throughput. Stick to capability 8.0+ on every card in the rig.

Troubleshooting Pitfalls {#pitfalls}

Pitfall 1: Second GPU Stays Idle

Symptom: nvidia-smi shows GPU 1 at 0 percent utilisation and 0 MB allocated during inference.

Causes:

The model fits entirely on GPU 0. Confirm with ollama show llama3.3:70b. If size is under 24 GB, expected behaviour.
CUDA_VISIBLE_DEVICES is set to a single device by your shell or another systemd override.
OLLAMA_SCHED_SPREAD is unset on a symmetric rig.

Fix:

# Confirm what Ollama actually sees
sudo systemctl show ollama | grep -E "Environment|CUDA"

# Force the spread
sudo systemctl edit ollama
# add: Environment="OLLAMA_SCHED_SPREAD=1"
sudo systemctl restart ollama

Pitfall 2: OOM on Model Load Despite Combined VRAM Being Sufficient

Symptom: A 70B model with 44 GB weight requirement fails to load on 2x 24 GB cards.

Cause: Each GPU also needs context buffer, KV cache, and CUDA workspace. On a 24 GB card, that overhead is 1.5-2.5 GB. So combined usable VRAM is 42-43 GB, not 48 GB.

Fix: Lower context size or quantization.

# Drop ctx from 16K to 8K
ollama run llama3.3:70b --ctx-size 8192

# Or pull a tighter quantization (Q3_K_M instead of Q4_K_M)
ollama pull llama3.3:70b-instruct-q3_K_M

Pitfall 3: Throughput Worse Than Single GPU

Symptom: A model that fits on one card but is being split anyway runs slower than on the single card.

Cause: You forced a split that should not have happened. Pipeline parallelism adds inter-GPU latency. If the model fits on one card, run it on one card.

Fix: Either remove OLLAMA_SCHED_SPREAD=1, or pin the model entirely with PARAMETER num_gpu 999 in a Modelfile to keep it on GPU 0.

Pitfall 4: Random Crashes Mid-Generation

Symptom: Ollama segfaults or the GPU drops off the bus during long generations.

Causes (in order of frequency): thermal throttling, PSU undervolt under transient load, driver bug.

Fix:

# Cap power on each card to reduce transient spikes
sudo nvidia-smi -i 0 -pl 320
sudo nvidia-smi -i 1 -pl 320

# Check thermals during a stress run
nvidia-smi dmon -s pucvmt -c 60 > thermals.log

If either GPU hits 85 C+ under load, improve airflow before tuning anything else. Sustained 90 C is the leading cause of "random" multi-GPU failures.

Pitfall 5: Different Results Between Runs

Symptom: Same prompt, same seed, different output across two-GPU and one-GPU runs.

Cause: Floating-point reduction order differs by layer placement. This is expected and harmless for normal usage, but breaks deterministic tests.

Fix: Pin the rig to a single GPU for any test that demands exact reproducibility. Multi-GPU is for production throughput, not regression testing.

For the broader pattern of fixing Ollama issues, the Ollama troubleshooting guide covers single-GPU problems exhaustively.

Final Notes

Multi-GPU Ollama is best understood as VRAM expansion, not compute expansion. The moment you internalise that, every confusing benchmark and every disappointed Reddit thread snaps into place. Two 4090s give you 48 GB of fast memory, modest throughput improvement, and the ability to run a model class that simply does not fit anywhere else on consumer hardware. They do not give you twice the speed.

Build the rig. Set OLLAMA_SCHED_SPREAD=1. Pin the layers if your cards are asymmetric. Watch nvidia-smi dmon until both GPUs swing in unison. Then forget the hardware exists and go build the application that needed a 70B model in the first place.

If single-stream throughput is non-negotiable, the rest of the local AI ecosystem — vLLM, TGI, SGLang — is waiting. But for the team that wants 70B-class reasoning on hardware they own, with a CLI as friendly as ollama run, this is still the cleanest path.

Multi-GPU Ollama: Split 70B Models Across Two or More GPUs

Want to go deeper than this article?

Multi-GPU Ollama: Split 70B Models Across Two or More GPUs

Table of Contents

Why Multi-GPU for Ollama {#why-multi-gpu}

Hardware Compatibility Matrix {#hardware-matrix}

Installation and Driver Setup {#installation}

Step 1: Install or Update NVIDIA Drivers

Step 2: Install the Latest Ollama

Step 3: Configure the systemd Service

Step 4: Pull a Model Worth Splitting

Layer-Splitting Configuration {#layer-splitting}

Control 1: CUDA_VISIBLE_DEVICES

Control 2: OLLAMA_SCHED_SPREAD

Control 3: num_gpu (Per-Model Override)

Confirming the Split

Benchmarks: Real Numbers {#benchmarks}

Llama 3.3 70B Q4_K_M (8K context)

Mixtral 8x22B Q4_K_M

Comparison Against vLLM (Tensor Parallel)

NVLink, PCIe, and Bus Reality {#nvlink-pcie}

Mixed VRAM and Asymmetric Rigs {#mixed-vram}

Layout the Layers Manually

Mixed 24GB + 16GB (4090 + 4070 Ti)

Mixing NVIDIA Generations

Troubleshooting Pitfalls {#pitfalls}

Pitfall 1: Second GPU Stays Idle

Pitfall 2: OOM on Model Load Despite Combined VRAM Being Sufficient

Pitfall 3: Throughput Worse Than Single GPU

Pitfall 4: Random Crashes Mid-Generation

Pitfall 5: Different Results Between Runs

Final Notes

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

🎓 Continue Learning

Get Production Ollama Tips Weekly

Build Real AI on Your Machine

Related Guides

Continue Learning

Ollama in Production

Best GPUs for AI

Quantization Compared

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI

Control 1: `CUDA_VISIBLE_DEVICES`

Control 2: `OLLAMA_SCHED_SPREAD`

Control 3: `num_gpu` (Per-Model Override)