Free course — 2 free chapters of every course. No credit card.Start learning free
Production Deployment

Multi-GPU Ollama: Split 70B Models Across Two or More GPUs

April 23, 2026
19 min read
LocalAimaster Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

Multi-GPU Ollama: Split 70B Models Across Two or More GPUs

Published on April 23, 2026 -- 19 min read

The day you graduate from a single 4090 to two GPUs is the day Ollama stops feeling like a toy. A 70B model that crawled at 2 tokens per second from system RAM suddenly streams at 18-22 tok/s. A 120B mixture-of-experts model that refused to load now boots in 30 seconds. Two GPUs are not just "more VRAM" — they unlock an entirely different class of model.

But multi-GPU Ollama is also where every tutorial on the internet falls silent. The official docs cover ollama run. They do not cover OLLAMA_SCHED_SPREAD, mixed-VRAM scheduling, NVLink myths, or what to do when nvidia-smi shows GPU 1 sitting cold while GPU 0 melts at 92 degrees. This guide does, with benchmarks from real rigs.

Quick Start: If you have two NVIDIA GPUs already detected by nvidia-smi, set export OLLAMA_SCHED_SPREAD=1 and export CUDA_VISIBLE_DEVICES=0,1, restart the Ollama service, then run ollama run llama3.3:70b. The model will split itself across both cards automatically.


Table of Contents

  1. Why Multi-GPU for Ollama
  2. Hardware Compatibility Matrix
  3. Installation and Driver Setup
  4. Layer-Splitting Configuration
  5. Benchmarks: Real Numbers
  6. NVLink, PCIe, and Bus Reality
  7. Mixed VRAM and Asymmetric Rigs
  8. Troubleshooting Pitfalls
  9. Frequently Asked Questions

Why Multi-GPU for Ollama {#why-multi-gpu}

Three legitimate reasons exist to run Ollama across multiple GPUs:

1. Models that exceed single-card VRAM. Llama 3.3 70B Q4_K_M needs 42-44 GB. Mixtral 8x22B Q4_K_M needs 80+ GB. DeepSeek V3 671B in any usable quantization needs hundreds of GB. None of these fit on a 24GB card. Splitting them across two or four GPUs is the only way to keep the entire model resident in fast memory.

2. KV cache headroom for long context. Even when the weights fit, the KV cache for 32K or 128K context can double VRAM requirements. A 70B Q4_K_M model on a 48GB A6000 starts swapping into CPU RAM around 16K context. Two cards give you breathing room without surrendering throughput to system memory.

3. Concurrent workloads. With OLLAMA_NUM_PARALLEL=2 and two GPUs, you can serve two different sessions simultaneously — one on each card — instead of queuing requests behind a single inference stream. This is the cheapest path to a multi-user team server.

What multi-GPU does not give you with stock Ollama is doubled tokens per second on a single request. That requires tensor parallelism, which Ollama does not implement. We will return to that limitation throughout this guide.

For background on quantization choices that affect how much VRAM each layer consumes, see our AWQ vs GPTQ vs GGUF comparison.


Hardware Compatibility Matrix {#hardware-matrix}

CombinationTotal VRAM70B Q4_K_M70B Q8120B MoENotes
2x RTX 309048 GBYes (8K ctx)NoNoCheapest viable 70B rig
2x RTX 409048 GBYes (8K ctx)NoNo~30% faster than 3090 pair
1x 4090 + 1x 309048 GBYes (8K ctx)NoNoLayers auto-split by VRAM
2x RTX A600096 GBYes (32K ctx)YesTightWorkstation grade
4x RTX 309096 GBYes (32K ctx)YesYes (Q3)Power: 1400 W under load
2x A100 80GB160 GBYes (128K)YesYesDatacenter, NVLink helps
4x A6000 Ada192 GBYes (128K)YesYesBest price/VRAM workstation
1x H100 80GB80 GBYes (32K)YesYesSingle-card class above 70B

PCIe lane requirements. Each GPU should be on at least PCIe 4.0 x8. Consumer motherboards with two x16 slots usually drop both to x8 when populated. That is fine. PCIe 3.0 x4 (typical of M.2-converted slots) is not fine — llama.cpp will use the slow GPU but throughput collapses 30-40 percent.

Power and cooling. Two RTX 3090s at full load draw 700 W. Two 4090s under heavy inference draw 800-900 W. Add CPU and overhead, and a 1200 W PSU is a minimum, 1500 W comfortable. If your case has front intake and rear exhaust only, the second GPU will heat-soak the first. Plan for either blower-style cards (rare on consumer 3090/4090) or a mining-style open frame with PCIe risers.

For full power and budget planning before you commit, walk through the budget local AI machine guide.


Installation and Driver Setup {#installation}

Step 1: Install or Update NVIDIA Drivers

# Verify both GPUs are detected before installing Ollama
nvidia-smi

# Expected output should list every GPU with a non-zero memory total
# Driver version 550.54.14 or newer is required for Ada Lovelace + Ampere mixing

If only one GPU appears, the second card is either not seated, missing power, or in a disabled BIOS slot. Fix that before going further. Multi-GPU Ollama cannot work around hardware that the driver cannot see.

Step 2: Install the Latest Ollama

# Linux: install or upgrade in place
curl -fsSL https://ollama.com/install.sh | sh

# Verify version (multi-GPU scheduling improvements landed in 0.1.40+)
ollama --version

Versions older than 0.1.40 had a bug where the second GPU was discovered but layers were never assigned to it on systems with asymmetric VRAM. Anything newer is fine.

Step 3: Configure the systemd Service

# Edit the service environment
sudo systemctl edit ollama.service

Add the following:

[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1"
Environment="OLLAMA_SCHED_SPREAD=1"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_HOST=0.0.0.0:11434"

Then reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama
sudo journalctl -u ollama -f

The journal output should mention both GPUs by name when a model loads. If only one is mentioned, jump to the pitfalls section.

Step 4: Pull a Model Worth Splitting

# Anything below 30B fits on a single 24GB card and will not split
ollama pull llama3.3:70b
ollama pull mixtral:8x7b

For details on which model deserves the VRAM you are spending, the best Ollama models shortlist is the right next stop.


Layer-Splitting Configuration {#layer-splitting}

Ollama exposes three primary controls for multi-GPU placement. Used together they cover 95 percent of real-world rigs.

Control 1: CUDA_VISIBLE_DEVICES

Limits which GPUs Ollama can see. The order in this string is the order of preference for layer placement.

# Use both GPUs, GPU 0 first
export CUDA_VISIBLE_DEVICES=0,1

# Use only the second GPU (force a different model to a separate card)
export CUDA_VISIBLE_DEVICES=1

# Use four GPUs but skip GPU 2 (e.g., used for display)
export CUDA_VISIBLE_DEVICES=0,1,3

Control 2: OLLAMA_SCHED_SPREAD

Default behaviour packs the first GPU before touching the second. Setting this to 1 forces an even split based on relative free VRAM.

# Even split — best for symmetric rigs (2x 3090, 2x 4090)
export OLLAMA_SCHED_SPREAD=1

# Greedy fill — best for asymmetric rigs (4090 + 3090)
unset OLLAMA_SCHED_SPREAD

On asymmetric rigs the greedy default is usually correct: pack the faster card first so it owns the early transformer layers (which run on every token), then spill remainder onto the slower card.

Control 3: num_gpu (Per-Model Override)

Inside a Modelfile or via the API you can pin a specific layer count on the primary GPU. This is the escape hatch when automatic placement gets it wrong.

# Modelfile fragment for a 70B model with manual placement
FROM llama3.3:70b
PARAMETER num_gpu 50      # 50 layers on GPU 0
PARAMETER num_ctx 8192
PARAMETER num_batch 512
ollama create llama3-multi -f Modelfile
ollama run llama3-multi

A Llama 3.3 70B model has 80 transformer layers. Pinning 50 to GPU 0 places the remaining 30 on GPU 1, which roughly mirrors the VRAM ratio of a 24GB + 16GB pair. Always validate with nvidia-smi dmon while running.

Confirming the Split

# Watch utilisation while a prompt is processing
nvidia-smi dmon -s u -c 30

# Or one-shot check during inference
watch -n 0.5 nvidia-smi

A correctly split model shows both GPUs swinging between 60 and 95 percent utilisation in lockstep. If GPU 1 sits at 0 percent while GPU 0 saturates, the split did not happen.


Benchmarks: Real Numbers {#benchmarks}

All numbers below are from a 256-token completion at temperature 0, prompt size 512 tokens, Llama 3.3 70B Q4_K_M unless noted. Power draw is wall-socket measured.

Llama 3.3 70B Q4_K_M (8K context)

RigTokens/secFirst-token latencyWall powerVRAM used
1x A6000 (48GB)19.4380 ms320 W46.1 GB
2x RTX 309017.1410 ms690 W23.8 + 22.4 GB
2x RTX 409022.6350 ms800 W23.6 + 22.6 GB
4090 + 3090 (asym)18.3395 ms720 W23.5 + 22.7 GB
4x RTX 3090 (32K ctx)16.4470 ms1320 W~12 GB each

Two key takeaways. First, doubling GPUs does not double throughput — pipeline parallelism only adds VRAM, not compute. Second, the 4090 pair beats a single A6000 only narrowly despite costing more, because the inter-GPU latency of pipeline parallelism eats most of the gain.

Mixtral 8x22B Q4_K_M

RigTokens/secNotes
2x A6000 (96GB)24.8Sweet spot for MoE
4x RTX 309021.1Power-hungry alternative
2x RTX 4090OOMInsufficient combined VRAM

MoE models reward more cards because the router activates different experts each token, and llama.cpp can keep more experts hot when more VRAM is available.

Comparison Against vLLM (Tensor Parallel)

FrameworkHardwareLlama 3.3 70B Q4 tok/s
Ollama (pipeline)2x RTX 409022.6
vLLM (tensor parallel)2x RTX 409041.2
TGI (tensor parallel)2x RTX 409038.9

If single-stream throughput matters more than developer ergonomics, pair Ollama with LiteLLM for the API surface and run vLLM behind it. The cost is a substantially more complex deployment.

For a deeper look at these tradeoffs, the official llama.cpp project documents the underlying split modes in detail at the llama.cpp tensor split discussion.


NVLink is the most over-recommended upgrade in the local AI community. For Ollama specifically, here is what it actually does.

Pipeline parallelism transfers per token: approximately 16 KB to 64 KB of hidden-state activations between cards, depending on hidden dimension and batch. A 70B model has hidden_size 8192, so each token boundary moves roughly 16 KB in FP16.

PCIe 4.0 x16 bandwidth: ~32 GB/s theoretical, ~28 GB/s practical.

NVLink 3.0 (3090 SLI bridge): 112.5 GB/s.

A 30 tok/s inference stream at 16 KB per token-boundary needs about 0.5 MB/s — three orders of magnitude below PCIe 4.0 capacity. The actual benchmark gap between PCIe-only and NVLink on 2x 3090 in Ollama is 1-3 percent.

When NVLink does help:

  • Loading the model (one-time cost, not per inference)
  • Tensor-parallel frameworks (vLLM, TGI) that move bulk activations
  • Prefill of very long prompts where activations are larger

When NVLink does not help meaningfully:

  • Standard token-by-token Ollama inference
  • Short prompts followed by long generations
  • Mixed CPU/GPU offload setups

Save the NVLink money for a third GPU or more system RAM.


Mixed VRAM and Asymmetric Rigs {#mixed-vram}

The single most popular real-world combination right now: a new RTX 4090 paired with a used RTX 3090 from the previous build. Here is how to make it work cleanly.

Layout the Layers Manually

A 4090 has 24 GB. A 3090 has 24 GB. They are technically symmetric in capacity but asymmetric in compute. The 4090 is roughly 35-45 percent faster on FP16 inference. So you want more layers on the 4090 even though VRAM matches.

# Modelfile pinning 70 layers to GPU 0 (the 4090)
cat > Modelfile.l3-asym <<'EOF'
FROM llama3.3:70b
PARAMETER num_gpu 70
PARAMETER num_ctx 8192
EOF

ollama create llama3-asym -f Modelfile.l3-asym

The remaining 10 layers land on the 3090, which becomes the bottleneck only briefly per token. Net throughput in our tests was 21.4 tok/s — about 95 percent of a pure 4090 pair.

Mixed 24GB + 16GB (4090 + 4070 Ti)

When VRAM is genuinely asymmetric, lean even harder on OLLAMA_SCHED_SPREAD=0 (the default). Let the larger card take its share first. Override only if you see GPU 1 OOM during model load — then dial num_gpu down by 2-4 layers at a time until stable.

Mixing NVIDIA Generations

Ada (4090, 4080) + Ampere (3090, 3080) works. Hopper (H100) + Ada works. Mixing Turing (RTX 2080) with Ampere or newer is risky — older CUDA capability can force fallback kernels that cap throughput. Stick to capability 8.0+ on every card in the rig.


Troubleshooting Pitfalls {#pitfalls}

Pitfall 1: Second GPU Stays Idle

Symptom: nvidia-smi shows GPU 1 at 0 percent utilisation and 0 MB allocated during inference.

Causes:

  1. The model fits entirely on GPU 0. Confirm with ollama show llama3.3:70b. If size is under 24 GB, expected behaviour.
  2. CUDA_VISIBLE_DEVICES is set to a single device by your shell or another systemd override.
  3. OLLAMA_SCHED_SPREAD is unset on a symmetric rig.

Fix:

# Confirm what Ollama actually sees
sudo systemctl show ollama | grep -E "Environment|CUDA"

# Force the spread
sudo systemctl edit ollama
# add: Environment="OLLAMA_SCHED_SPREAD=1"
sudo systemctl restart ollama

Pitfall 2: OOM on Model Load Despite Combined VRAM Being Sufficient

Symptom: A 70B model with 44 GB weight requirement fails to load on 2x 24 GB cards.

Cause: Each GPU also needs context buffer, KV cache, and CUDA workspace. On a 24 GB card, that overhead is 1.5-2.5 GB. So combined usable VRAM is 42-43 GB, not 48 GB.

Fix: Lower context size or quantization.

# Drop ctx from 16K to 8K
ollama run llama3.3:70b --ctx-size 8192

# Or pull a tighter quantization (Q3_K_M instead of Q4_K_M)
ollama pull llama3.3:70b-instruct-q3_K_M

Pitfall 3: Throughput Worse Than Single GPU

Symptom: A model that fits on one card but is being split anyway runs slower than on the single card.

Cause: You forced a split that should not have happened. Pipeline parallelism adds inter-GPU latency. If the model fits on one card, run it on one card.

Fix: Either remove OLLAMA_SCHED_SPREAD=1, or pin the model entirely with PARAMETER num_gpu 999 in a Modelfile to keep it on GPU 0.

Pitfall 4: Random Crashes Mid-Generation

Symptom: Ollama segfaults or the GPU drops off the bus during long generations.

Causes (in order of frequency): thermal throttling, PSU undervolt under transient load, driver bug.

Fix:

# Cap power on each card to reduce transient spikes
sudo nvidia-smi -i 0 -pl 320
sudo nvidia-smi -i 1 -pl 320

# Check thermals during a stress run
nvidia-smi dmon -s pucvmt -c 60 > thermals.log

If either GPU hits 85 C+ under load, improve airflow before tuning anything else. Sustained 90 C is the leading cause of "random" multi-GPU failures.

Pitfall 5: Different Results Between Runs

Symptom: Same prompt, same seed, different output across two-GPU and one-GPU runs.

Cause: Floating-point reduction order differs by layer placement. This is expected and harmless for normal usage, but breaks deterministic tests.

Fix: Pin the rig to a single GPU for any test that demands exact reproducibility. Multi-GPU is for production throughput, not regression testing.

For the broader pattern of fixing Ollama issues, the Ollama troubleshooting guide covers single-GPU problems exhaustively.


Final Notes

Multi-GPU Ollama is best understood as VRAM expansion, not compute expansion. The moment you internalise that, every confusing benchmark and every disappointed Reddit thread snaps into place. Two 4090s give you 48 GB of fast memory, modest throughput improvement, and the ability to run a model class that simply does not fit anywhere else on consumer hardware. They do not give you twice the speed.

Build the rig. Set OLLAMA_SCHED_SPREAD=1. Pin the layers if your cards are asymmetric. Watch nvidia-smi dmon until both GPUs swing in unison. Then forget the hardware exists and go build the application that needed a 70B model in the first place.

If single-stream throughput is non-negotiable, the rest of the local AI ecosystem — vLLM, TGI, SGLang — is waiting. But for the team that wants 70B-class reasoning on hardware they own, with a CLI as friendly as ollama run, this is still the cleanest path.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: April 23, 2026🔄 Last Updated: April 23, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

Was this helpful?

Get Production Ollama Tips Weekly

Join 5,000+ engineers running Ollama at team and enterprise scale. We send one practical, benchmarked guide per week — no fluff.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Related Guides

Continue your local AI journey with these comprehensive guides

Continue Learning

📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators