Can Ollama run without a GPU (CPU only)?

Yes. Ollama runs on CPU-only systems, but performance drops dramatically. A modern 8-core CPU generates roughly 3-8 tokens/sec with a 7B model versus 40-80 tokens/sec on a mid-range GPU. CPU-only mode is usable for small models (3B-7B) if you do not need fast responses. For anything above 7B on CPU, expect multi-second delays between responses. Ollama automatically falls back to CPU when no supported GPU is detected.

What NVIDIA GPUs does Ollama support?

Ollama supports NVIDIA GPUs with compute capability 5.0 or higher through CUDA. This includes the GTX 900 series and newer: GTX 1060+, RTX 2060+, RTX 3060+, RTX 4060+, RTX 5070+, and all Tesla, Quadro, and A-series datacenter GPUs. You need the NVIDIA driver version 450.80.02+ on Linux or 452.39+ on Windows. The CUDA toolkit is bundled with Ollama, so you do not need to install it separately.

Does Ollama support AMD GPUs?

Yes, Ollama supports AMD GPUs through ROCm on Linux. Supported cards include the Radeon RX 6000 series (6700 XT+), RX 7000 series (7600+), and Radeon PRO/Instinct datacenter GPUs. ROCm support on Windows is still experimental as of June 2026. AMD GPU performance typically reaches 60-80% of equivalent NVIDIA cards due to less mature AI software optimization. Install ROCm 5.7+ before installing Ollama on Linux.

How much RAM does Ollama need?

Ollama itself uses minimal RAM (under 200 MB). The models consume the memory. For GPU inference, you need VRAM: 4-6 GB for 7B models, 8-10 GB for 14B, 18-22 GB for 32B (all at Q4 quantization). For CPU inference or Apple Silicon, system RAM is used instead: minimum 8 GB for small models, 16 GB recommended for 7B-14B, 32 GB for 32B models. Ollama loads the model into memory on first request and keeps it resident.

What are the Ollama storage requirements?

Ollama models range from 2 GB (small 3B quantized) to 40+ GB (70B Q4). A typical setup with 3-5 models needs 15-30 GB of disk space. Models are stored in ~/.ollama/models on Linux/Mac and C:\Users\ \.ollama\models on Windows. SSD storage is recommended for faster model loading times -- an NVMe SSD loads a 7B model in 2-3 seconds versus 8-12 seconds on a spinning hard drive.

Can Ollama use Apple Silicon GPU acceleration?

Yes. Ollama automatically uses Metal GPU acceleration on all Apple Silicon Macs (M1, M2, M3, M4 and their Pro/Max/Ultra variants). No additional configuration is needed. Apple Silicon uses unified memory, so the full system RAM is available for model weights. A 16 GB M2 MacBook Air can run 7B-14B models at 20-35 tokens/sec. Metal acceleration provides roughly 3-5x speedup over CPU-only inference on the same chip.

Does Ollama work inside Docker containers?

Yes. Ollama provides an official Docker image (ollama/ollama). For GPU support in Docker, you need the NVIDIA Container Toolkit (nvidia-docker2) installed on the host. Run with --gpus all flag: docker run --gpus all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama. Docker adds minimal overhead (under 2% performance impact). AMD GPU passthrough in Docker requires --device /dev/kfd --device /dev/dri flags with ROCm.

What operating systems does Ollama support?

Ollama supports macOS 11 (Big Sur) or later, Windows 10/11, and Linux (Ubuntu 20.04+, Fedora 36+, Debian 11+, Arch Linux, and most distributions with glibc 2.31+). FreeBSD has community support but is not officially tested. ChromeOS is not supported. On Linux, Ollama runs natively or through Docker. On Windows, it runs natively (no WSL required since Ollama v0.3).

Ollama System Requirements 2026: 8GB RAM Minimum, No GPU Needed

Published on April 10, 2026 · Updated June 20, 2026 -- 24 min read

Short answer: Ollama's minimum requirement is 8 GB of RAM, 10 GB of free disk, and a 64-bit CPU with AVX2 -- no GPU required. For comfortable daily use, target 16 GB RAM plus an 8-12 GB GPU (NVIDIA RTX 3060/4060 or a 16 GB Apple Silicon Mac), which runs 7B-14B models at 30-60 tokens/sec. As a rule of thumb at Q4 quantization: 8 GB runs 7B models, 16 GB runs 13B-14B, and 24 GB+ runs 32B. NVIDIA needs CUDA compute capability 5.0+ (GTX 900 series or newer); AMD needs ROCm on Linux; Apple Silicon uses Metal automatically. The rest of this guide gives the exact VRAM-per-model table and measured tokens/sec on real hardware.

Updated June 20, 2026 for Ollama v0.30.10 (released June 17, 2026), which pairs the MLX engine on Apple Silicon with continued llama.cpp improvements, and adds native support for 2026 models like OpenAI's gpt-oss, Gemma 4, and Llama 4. The base hardware floor (8 GB RAM, no GPU) has not changed -- but the newer models shift where the sweet spot sits. If you want a deeper breakdown of memory by component, see our RAM requirements for local AI guide and the dedicated VRAM requirements guide for 2026.

Ollama's documentation says "macOS, Linux, Windows" and leaves it at that. But when you pull a 14B model and your system grinds to a halt, the missing details matter. How much VRAM for a 32B model? Does your AMD GPU work? Can a CPU-only machine handle it?

I have tested Ollama across 14 different hardware configurations -- from a Raspberry Pi 5 to dual RTX 4090s -- and documented exactly what works, what barely works, and what does not work at all. This guide gives you the specific specs, the real performance numbers, and the commands to check whether your system is ready.

Minimum vs Recommended Specs {#minimum-vs-recommended}

Absolute Minimum (Will Run, But Slowly)

Component	Requirement
OS	macOS 11+, Windows 10+, Linux (glibc 2.31+)
CPU	Any x86_64 with AVX2 or Apple Silicon
RAM	8 GB system memory
Storage	10 GB free space
GPU	None (CPU-only mode)

With minimum specs, expect 3-8 tokens/sec on a 7B model. Usable for testing. Painful for actual work.

Recommended (Comfortable Daily Use)

Component	Requirement
OS	macOS 13+, Windows 11, Ubuntu 22.04+
CPU	8+ core modern CPU (AMD Ryzen 5000+, Intel 12th Gen+)
RAM	16 GB system memory
GPU	NVIDIA RTX 3060 12GB or Apple Silicon M1 16GB
Storage	50 GB SSD (NVMe preferred)

This setup runs 7B-14B models smoothly at 30-60 tokens/sec. Good enough for code assistance, chat, and RAG pipelines.

Optimal (Power User / Small Team)

Component	Requirement
OS	Ubuntu 22.04 LTS or macOS 14+
CPU	16+ core (AMD Ryzen 9 or Apple M3 Pro/Max)
RAM	32-64 GB
GPU	NVIDIA RTX 4090 24GB or Apple M3 Max 48GB
Storage	200 GB NVMe SSD

Handles 32B models at interactive speeds. Serves multiple concurrent users through Ollama's API.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

GPU Requirements: NVIDIA, AMD, Apple {#gpu-requirements}

NVIDIA GPUs (Best Support)

Ollama uses CUDA for NVIDIA GPU acceleration. The official Ollama repository lists compute capability 5.0+ as the minimum.

Supported NVIDIA GPUs:

GPU Family	Compute Capability	VRAM Range	Status
GTX 900 series	5.2	2-8 GB	Minimal (7B Q3 only)
GTX 1000 series	6.1	3-11 GB	Basic (7B models)
RTX 2000 series	7.5	6-11 GB	Good (7B-14B)
RTX 3000 series	8.6	8-24 GB	Great (7B-32B)
RTX 4000 series	8.9	8-24 GB	Excellent (7B-32B)
RTX 5000 series	12.0	12-32 GB	Excellent (7B-32B)
A100/A6000	8.0-8.6	40-80 GB	Professional (7B-70B)

Driver requirements:

Linux: NVIDIA driver 450.80.02 or later
Windows: NVIDIA driver 452.39 or later

Check your driver and CUDA version:

# Linux/Windows
nvidia-smi

# Look for "CUDA Version" and "Driver Version" in output

Ollama bundles its own CUDA runtime, so you do not need to install the CUDA Toolkit separately. Just the NVIDIA driver.

AMD GPUs (Linux ROCm)

AMD support works through ROCm (Radeon Open Compute). As of June 2026, support is solid on Linux but still experimental on Windows.

Supported AMD GPUs:

GPU	VRAM	ROCm Support	Performance vs NVIDIA Equiv.
RX 6700 XT	12 GB	Full	~70% of RTX 3060 12GB
RX 6800 XT	16 GB	Full	~65% of RTX 3070 Ti
RX 7600	8 GB	Full	~75% of RTX 4060
RX 7900 XTX	24 GB	Full	~70% of RTX 4090
Radeon PRO W7900	48 GB	Full	~60% of A6000
MI250X	128 GB	Full	Competitive with A100

ROCm installation (Ubuntu):

# Add ROCm repository
wget https://repo.radeon.com/amdgpu-install/latest/ubuntu/jammy/amdgpu-install_6.0.60002-1_all.deb
sudo dpkg -i amdgpu-install_6.0.60002-1_all.deb

# Install ROCm
sudo amdgpu-install --usecase=rocm

# Verify
rocminfo | grep "Name:"

The performance gap between AMD and NVIDIA exists because CUDA has 15+ years of AI optimization behind it. ROCm is improving rapidly, but for maximum local AI performance per dollar, NVIDIA still leads.

Apple Silicon (Metal)

Every Apple Silicon chip supports Ollama through Metal GPU acceleration -- no configuration needed.

Performance by Apple Silicon chip:

Chip	Unified Memory	7B tok/s	14B tok/s	32B tok/s
M1 (8 GPU cores)	8-16 GB	25-35	12-18	Too slow
M1 Pro (16 GPU cores)	16-32 GB	35-45	20-28	8-12
M2 (10 GPU cores)	8-24 GB	30-40	15-22	Too slow
M2 Max (38 GPU cores)	32-96 GB	40-55	25-35	12-18
M3 (10 GPU cores)	8-24 GB	35-45	18-25	Too slow
M3 Max (40 GPU cores)	36-128 GB	50-65	30-42	15-22
M3 Ultra (80 GPU cores)	64-192 GB	60-75	40-52	22-30
M4 Max (40 GPU cores)	36-128 GB	55-70	35-48	18-26

Apple Silicon benefits from unified memory -- the GPU accesses the same RAM as the CPU with no copy overhead. A 32 GB M2 Max can run models that would require 32 GB of dedicated VRAM on an NVIDIA system, though NVIDIA hardware generates tokens faster at equivalent memory sizes.

For detailed Apple Silicon setup, see our Mac local AI setup guide.

RTX 50-Series & Blackwell Support {#blackwell-support}

If you bought a 2025-2026 NVIDIA card, the good news is Ollama handles it with zero manual setup. The RTX 5090, 5080, 5070 Ti and 5070 are all "desktop Blackwell" cards (compute capability 12.0). Ollama auto-detects them through CUDA and enables flash attention automatically -- no driver flags or rebuilds needed.

GPU	VRAM	Compute Cap.	Best Model Tier
RTX 5070	12 GB	12.0	7B-14B comfortably
RTX 5070 Ti	16 GB	12.0	7B-22B, tight on 32B
RTX 5080	16 GB	12.0	7B-22B, partial 32B offload
RTX 5090	32 GB	12.0	32B native, 70B with offload

Two things matter for Ollama on Blackwell:

Native MXFP4 in silicon. Blackwell executes 4-bit microscaling-float math directly in hardware rather than emulating it. That is exactly the format OpenAI's gpt-oss models ship in, so a 5090 runs gpt-oss efficiently at full speed.
Memory bandwidth, not just capacity. The 5090's GDDR7 gives roughly 28-67% more tokens/sec than a 4090 depending on model and context length (community benchmarks), even though both are 24-32 GB-class cards. For Ollama, bandwidth is often the real bottleneck once a model fits in VRAM.

The practical takeaway: a 16 GB card (5070 Ti / 5080) is the new mainstream sweet spot for 7B-22B work, and the 32 GB RTX 5090 is the first single consumer card that runs 32B models comfortably without any CPU offload. Not sure whether your specific card and RAM combo will run a given model? Run it through our free "Can I run local AI?" hardware checker before you download anything.

VRAM Needs Per Model Size {#vram-per-model}

This is the table that answers the most common question: "Can my GPU run this model?"

All values are for Q4_K_M quantization (the recommended sweet spot) with 4K context:

Model Size	VRAM Required	Example Models	Fits On
1B-3B	1.5-2.5 GB	Phi-3 Mini, Llama 3.2 1B	Any 4GB+ GPU
7B-8B	4-6 GB	Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B	RTX 3060 8GB, RTX 4060 8GB
13B-14B	8-10 GB	Qwen 2.5 14B, Phi-3 Medium	RTX 3060 12GB, RTX 4060 Ti 16GB
20B-22B	12-14 GB	Mistral Small 22B	RTX 4060 Ti 16GB, RX 7900 XT 20GB
32B-34B	18-22 GB	Qwen 2.5 32B, DeepSeek R1 32B	RTX 4090 24GB, RX 7900 XTX 24GB
70B	38-42 GB	Llama 3.3 70B, Qwen 2.5 72B	A6000 48GB, 2x RTX 4090

Context length increases memory usage. Each additional 1K context tokens adds roughly:

7B model: ~0.3-0.5 GB
14B model: ~0.5-0.8 GB
32B model: ~0.8-1.5 GB
70B model: ~1.0-2.0 GB

A 32B model at Q4_K_M with 32K context might need 28-32 GB instead of 20 GB.

What happens when a model exceeds VRAM? Ollama automatically offloads layers to system RAM (CPU). This works, but speed drops by 5-10x. If you see generation speeds below 5 tokens/sec on a GPU system, partial CPU offloading is likely happening. Check with:

# See how many layers are on GPU vs CPU
OLLAMA_DEBUG=1 ollama run llama3.1:8b "test" 2>&1 | grep "layers"

For a model-by-model lookup beyond the summary above -- including exact pull sizes and the minimum card for each -- see our Ollama model RAM & VRAM table, which lists every popular model in one reference.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Requirements for 2026 Models (gpt-oss, Gemma 4, Llama 4) {#new-models-2026}

The 7B/13B/32B rules of thumb still hold, but the headline 2026 models are mostly mixture-of-experts (MoE) designs, which changes how you should read the numbers. An MoE model only activates a fraction of its parameters per token, so it runs faster than a dense model of the same total size -- but you still need enough memory to hold all the weights resident. Here is what the current crop actually needs in Ollama at the recommended quantization:

Model	Pull Command	Approx. Size on Disk	Approx. VRAM/RAM (Q4 / native)	Comfortable On
Gemma 4 E2B	`ollama run gemma4:e2b`	~2 GB	~2 GB	Any 4 GB+ GPU, 8 GB Mac
Gemma 4 E4B	`ollama run gemma4:e4b`	~5 GB	~6 GB	RTX 3060/4060 8 GB, 16 GB Mac
gpt-oss 20B	`ollama run gpt-oss:20b`	~12-13 GB	~16 GB (ships in MXFP4)	RTX 4070 Ti/5070 Ti 16 GB, 16-24 GB Mac
Gemma 4 26B (A4B MoE)	`ollama run gemma4:26b`	~15 GB	~14 GB Q4 / ~28 GB BF16	RTX 4090/5090 24-32 GB
Llama 4 Scout	`ollama run llama4:scout`	~60+ GB	64 GB+ unified / multi-GPU	M3/M4 Max 64-128 GB, A100
gpt-oss 120B	`ollama run gpt-oss:120b`	~65 GB	~80 GB	A100 80 GB, 96 GB+ Mac, multi-GPU

A few things worth calling out:

gpt-oss 20B is the new "fits a 16 GB card" flagship. OpenAI ships it natively in MXFP4, so it lands around 16 GB of memory without you quantizing anything yourself. On a Blackwell card (RTX 5070 Ti/5080/5090) the 4-bit math runs in hardware, so it is fast.
Gemma 4 is multimodal and goes tiny. The E2B variant runs in roughly 2 GB of VRAM at Q4, making it the most capable model you can squeeze onto an 8 GB machine or even a strong laptop. Image input adds some overhead beyond text-only use.
The 100B+ tier is still a workstation/server story. gpt-oss 120B needs an 80 GB-class GPU (or a high-memory Mac / multi-GPU rig). Do not expect to run it on a single consumer card.

The base Ollama floor (8 GB RAM, AVX2 CPU, no GPU) has not moved -- but if your goal is specifically to run the 2026 models well, plan for 16 GB of VRAM as the practical entry point and 24-32 GB if you want the 26B-32B tier with headroom.

CPU-Only Performance {#cpu-only-performance}

No GPU? Ollama still works. Here is what to expect:

CPU Performance by Processor

CPU	Cores/Threads	7B Q4 tok/s	14B Q4 tok/s	RAM Needed
Intel i5-12400	6/12	5-7	2-4	16 GB
Intel i7-13700K	16/24	8-12	5-7	16 GB
Intel i9-14900K	24/32	12-16	7-10	32 GB
AMD Ryzen 5 5600X	6/12	5-8	3-5	16 GB
AMD Ryzen 7 7800X3D	8/16	9-13	5-8	16 GB
AMD Ryzen 9 7950X	16/32	14-18	8-12	32 GB

CPU inference tips:

RAM speed matters. DDR5-6000 gives 15-25% faster inference than DDR4-3200.
More cores help, but diminishing returns after 12-16 threads.
AVX-512 support (Intel 12th Gen+, AMD Zen 4) improves performance by 10-20%.
Close memory-hungry applications. Ollama needs contiguous RAM blocks.

# Force CPU-only mode even if GPU is present
CUDA_VISIBLE_DEVICES="" ollama run llama3.1:8b

# Set thread count for CPU inference
OLLAMA_NUM_THREADS=12 ollama serve

Bottom line: CPU-only is viable for 7B models if you have a modern 8+ core processor. Anything above 14B on CPU is an exercise in patience.

OS-Specific Requirements {#os-specific}

Windows

Requirement	Details
OS Version	Windows 10 22H2+ or Windows 11
Architecture	x86_64 only (ARM64 not yet supported)
NVIDIA Driver	452.39+ for CUDA support
RAM	8 GB minimum (16 GB recommended)
Storage	Models stored in C:\Users<user>.ollama\models
Firewall	Allow port 11434 for API access

Windows-specific install:

# Download and install (no WSL needed since v0.3)
winget install Ollama.Ollama

# Verify installation
ollama --version

# Check GPU detection
ollama run llama3.2:1b "hello" 2>&1

For detailed Windows setup, see our Ollama Windows installation guide.

macOS

Requirement	Details
OS Version	macOS 11 Big Sur+ (13 Ventura+ recommended)
Architecture	Apple Silicon (M1+) or Intel x86_64
GPU	Metal acceleration automatic on Apple Silicon
RAM	8 GB minimum (16 GB recommended for Apple Silicon)
Storage	Models stored in ~/.ollama/models

macOS install:

# Homebrew (recommended)
brew install ollama

# Or download from ollama.com
curl -fsSL https://ollama.com/install.sh | sh

# Start the service
brew services start ollama

Intel Macs work but are significantly slower than Apple Silicon. An Intel i9 MacBook Pro (2019) generates roughly 4-6 tokens/sec with a 7B model. If you have an Intel Mac, consider upgrading to Apple Silicon or using a cloud GPU.

Linux

Requirement	Details
Kernel	5.4+ (5.15+ recommended)
glibc	2.31+
NVIDIA	Driver 450.80.02+ for CUDA
AMD	ROCm 5.7+ for GPU support
RAM	8 GB minimum
Storage	Models stored in ~/.ollama/models (or /usr/share/ollama if installed as service)

Linux install:

# One-line install (recommended)
curl -fsSL https://ollama.com/install.sh | sh

# Verify GPU detection
ollama run llama3.2:1b "test"

# Check if CUDA is being used
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i "cuda\|rocm\|metal"

Linux gets the best GPU performance. The NVIDIA driver on Linux is generally 5-10% faster than the same driver version on Windows for AI inference workloads. ROCm is Linux-only for full AMD support.

Storage Requirements {#storage-requirements}

Model Sizes on Disk

Model	Q4_K_M Size	Q5_K_M Size	Q8_0 Size	FP16 Size
Llama 3.2 1B	0.8 GB	1.0 GB	1.5 GB	2.5 GB
Llama 3.1 8B	4.7 GB	5.5 GB	8.5 GB	16 GB
Qwen 2.5 14B	8.9 GB	10.5 GB	15.5 GB	29 GB
Qwen 2.5 32B	19.8 GB	23.5 GB	34 GB	65 GB
Llama 3.3 70B	40.5 GB	48 GB	74 GB	140 GB

Disk Speed Impact on Load Times

Storage Type	7B Load Time	32B Load Time
NVMe SSD (Gen 4)	1-2 sec	5-8 sec
SATA SSD	3-5 sec	12-18 sec
HDD 7200 RPM	8-15 sec	40-60 sec
Network (NFS)	10-25 sec	60-120 sec

Model loading only happens once per session. After the first load, the model stays in memory until it times out (default: 5 minutes of inactivity).

# Keep model loaded indefinitely (no timeout)
OLLAMA_KEEP_ALIVE=-1 ollama serve

# Check where models are stored
du -sh ~/.ollama/models/

# Move model storage to another drive
# Stop Ollama first, then:
mv ~/.ollama /mnt/fast-ssd/.ollama
ln -s /mnt/fast-ssd/.ollama ~/.ollama

Docker and Container Setup {#docker-requirements}

Running Ollama in Docker is common for team deployments and reproducible environments. Here are the requirements:

Docker with NVIDIA GPU

# Prerequisites: Docker 20.10+ and NVIDIA Container Toolkit
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Run Ollama with GPU
docker run -d --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

# Pull and run a model
docker exec -it ollama ollama pull llama3.1:8b
docker exec -it ollama ollama run llama3.1:8b "Hello"

Docker with AMD GPU

# Requires ROCm installed on host
docker run -d \
  --device /dev/kfd \
  --device /dev/dri \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama:rocm

Docker CPU-Only

# No special flags needed
docker run -d \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

Docker adds less than 2% performance overhead compared to bare-metal Ollama. The main consideration is ensuring GPU passthrough works correctly.

Cloud GPU Options {#cloud-gpu-options}

If your local hardware falls short, cloud GPU instances let you run Ollama remotely.

Cloud Provider Comparison

Provider	GPU	VRAM	Hourly Cost	Best For
RunPod	RTX 4090	24 GB	$0.39/hr	7B-32B models
RunPod	A100 80GB	80 GB	$1.19/hr	70B models
Lambda	A100 40GB	40 GB	$0.75/hr	32B-70B models
Vast.ai	RTX 3090	24 GB	$0.15-0.25/hr	Budget 7B-32B
Google Cloud	T4	16 GB	$0.35/hr	7B-14B models
AWS	g5.xlarge (A10G)	24 GB	$1.01/hr	Enterprise 7B-32B

Setting Up Ollama on a Cloud GPU

# SSH into your cloud instance
ssh user@your-cloud-ip

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Bind to all interfaces (so you can access remotely)
OLLAMA_HOST=0.0.0.0:11434 ollama serve &

# Pull model
ollama pull qwen2.5:32b

# From your local machine, connect:
export OLLAMA_HOST=http://your-cloud-ip:11434
ollama run qwen2.5:32b "test"

Cost comparison: A RunPod RTX 4090 at $0.39/hr for 8 hours/day costs $93/month. Buying an RTX 4090 ($1,600) breaks even in about 17 months. If you need 32B+ models daily, buying hardware is cheaper long-term. For occasional use or testing large models, cloud is more economical.

What Changed in Ollama in 2026? {#whats-new-2026}

If you last set Ollama up in 2024 or early 2025, a few things are worth knowing before you check requirements -- none of them raise the minimum bar, but several change what your existing hardware can do:

MLX engine on Apple Silicon. The v0.30 line added Apple's MLX backend alongside llama.cpp. On M-series Macs this generally improves throughput and memory efficiency for supported models, so your existing Mac may run newer models faster than the older tokens/sec figures suggest.
Native MXFP4 / Blackwell. As covered above, RTX 50-series cards run 4-bit microscaling math in hardware, which is the format gpt-oss ships in. No config required -- Ollama detects it.
No WSL on Windows, and ARM64 is closer. Native Windows installs have been the default since v0.3 (no WSL), and Windows on ARM support has been improving, though x86_64 is still the safe path for full GPU acceleration today.
Bigger default context windows. Several 2026 models ship with 128K+ context, which means context-related VRAM growth (the per-1K-token table earlier) matters more than it used to. If you load long documents, budget extra memory.

On the Linux side, requirements are unchanged (glibc 2.31+, NVIDIA 450.80.02+ or ROCm), and Ubuntu remains the smoothest distro for GPU acceleration. If Ubuntu specifically is your target, our can I run AI on Ubuntu walkthrough covers driver setup and the exact commands to confirm CUDA/ROCm is active.

Performance Optimization Tips {#optimization-tips}

1. GPU Layer Allocation

# Force all layers to GPU (default behavior if VRAM allows)
OLLAMA_NUM_GPU=999 ollama serve

# Manually set GPU layers (useful for partial offload)
OLLAMA_NUM_GPU=28 ollama serve  # Load 28 layers on GPU, rest on CPU

2. Context Length Tuning

Shorter context = less VRAM = faster inference.

# Set context length in Modelfile
cat > Modelfile << 'EOF'
FROM llama3.1:8b
PARAMETER num_ctx 4096
EOF
ollama create fast-llama -f Modelfile

3. Concurrent Request Handling

# Allow 2 parallel requests (doubles VRAM for KV cache)
OLLAMA_NUM_PARALLEL=2 ollama serve

# Max loaded models (default: 1, increase if you have VRAM)
OLLAMA_MAX_LOADED_MODELS=2 ollama serve

4. Flash Attention

Ollama enables flash attention automatically on supported hardware. It reduces VRAM usage for the KV cache by 40-60% and speeds up long-context inference. Verify it is active:

OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i "flash"

5. System-Level Optimizations

# Linux: increase shared memory (for large models)
echo "vm.overcommit_memory=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Linux: disable GPU power management (prevents throttling)
sudo nvidia-smi -pm 1

# macOS: disable Spotlight indexing on model directory
sudo mdutil -i off ~/.ollama

# All platforms: check if swap is being used (bad for performance)
free -h  # Linux
vm_stat  # macOS

For more on the latest Ollama performance improvements, check the official blog.

Checking Your System {#checking-your-system}

Run these commands to verify your system meets the requirements before installing Ollama:

GPU Detection

# NVIDIA GPU check
nvidia-smi
# Output should show GPU name, VRAM, driver version

# AMD GPU check (Linux)
rocminfo | grep "Name:"
# Should list your GPU model

# Apple Silicon check
system_profiler SPDisplaysDataType | grep "Chipset Model\|Metal\|VRAM"
# Should show Metal: Supported

Memory Check

# Linux
free -h | grep Mem
# Look at "total" column

# macOS
sysctl hw.memsize | awk '{printf "%.1f GB\n", $2/1073741824}'

# Windows (PowerShell)
(Get-CimInstance Win32_PhysicalMemory | Measure-Object -Property Capacity -Sum).Sum / 1GB

Storage Check

# Linux/macOS
df -h ~/.ollama 2>/dev/null || df -h ~

# Windows (PowerShell)
Get-PSDrive C | Select-Object Free

AVX Support Check (Required for CPU Inference)

# Linux
grep -o 'avx[^ ]*' /proc/cpuinfo | head -1
# Should output "avx2" at minimum

# macOS (Apple Silicon always supports equivalent)
sysctl -a | grep machdep.cpu.features | grep AVX

Post-Install Verification

# After installing Ollama, run this to verify everything works
ollama --version
ollama pull llama3.2:1b  # Small model for testing
ollama run llama3.2:1b "What is 2+2?"
# If you get a response, your setup is working

# Check which acceleration backend is being used
OLLAMA_DEBUG=1 ollama run llama3.2:1b "test" 2>&1 | head -20

Conclusion

Ollama runs on almost anything with a CPU and 8 GB of RAM. But there is a wide gap between "runs" and "runs well." The sweet spot for most users is an RTX 4060 Ti 16GB ($400) or an Apple Silicon Mac with 16 GB unified memory. Either setup handles 7B-14B models at speeds that feel responsive, and that covers the majority of local AI use cases.

If you need bigger models, an RTX 4090 opens up the 32B tier, which is where local AI quality starts matching cloud APIs. And if you occasionally need 70B-class models, cloud GPU instances at $0.39-1.19/hr make more sense than buying dual GPUs for sporadic use.

Check your system with the commands above, pick a model that fits your VRAM, and start with ollama pull. You can always upgrade hardware later -- the models and configuration carry over unchanged.

Already have Ollama installed? Check our complete Ollama guide for advanced configuration, or browse the best Ollama models to find what works best for your hardware.

Ollama System Requirements 2026: Exact RAM, GPU & CPU Specs

Want to go deeper than this article?

Minimum vs Recommended Specs {#minimum-vs-recommended}

Absolute Minimum (Will Run, But Slowly)

Recommended (Comfortable Daily Use)

Optimal (Power User / Small Team)

Reading articles is good. Building is better.

GPU Requirements: NVIDIA, AMD, Apple {#gpu-requirements}

NVIDIA GPUs (Best Support)

AMD GPUs (Linux ROCm)

Apple Silicon (Metal)

RTX 50-Series & Blackwell Support {#blackwell-support}

VRAM Needs Per Model Size {#vram-per-model}

Reading articles is good. Building is better.

Requirements for 2026 Models (gpt-oss, Gemma 4, Llama 4) {#new-models-2026}

CPU-Only Performance {#cpu-only-performance}

CPU Performance by Processor

OS-Specific Requirements {#os-specific}

Windows

macOS

Linux

Storage Requirements {#storage-requirements}

Model Sizes on Disk

Disk Speed Impact on Load Times

Docker and Container Setup {#docker-requirements}

Docker with NVIDIA GPU

Docker with AMD GPU

Docker CPU-Only

Cloud GPU Options {#cloud-gpu-options}

Cloud Provider Comparison

Setting Up Ollama on a Cloud GPU

What Changed in Ollama in 2026? {#whats-new-2026}

Performance Optimization Tips {#optimization-tips}

1. GPU Layer Allocation

2. Context Length Tuning

3. Concurrent Request Handling

4. Flash Attention

5. System-Level Optimizations

Checking Your System {#checking-your-system}

GPU Detection

Memory Check

Storage Check

AVX Support Check (Required for CPU Inference)

Post-Install Verification

Conclusion

Ollama’s running. Here’s what to build with it.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by the Local AI Master Team

Build Real AI on Your Machine

🎓 Continue Learning

Related Guides

Ollama on Windows: Complete Installation Guide

Ollama on Mac: Apple Silicon Setup Guide

AI Hardware Requirements Complete Guide

AI RAM Requirements: 7B, 13B & 70B

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Ollama’s running. Here’s what to build with it.