Hardware

Ollama System Requirements: CPU, GPU, RAM Guide

April 10, 2026
20 min read
Local AI Master Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

Ollama System Requirements: What You Actually Need to Run It

Published on April 10, 2026 -- 20 min read

Ollama's documentation says "macOS, Linux, Windows" and leaves it at that. But when you pull a 14B model and your system grinds to a halt, the missing details matter. How much VRAM for a 32B model? Does your AMD GPU work? Can a CPU-only machine handle it?

I have tested Ollama across 14 different hardware configurations -- from a Raspberry Pi 5 to dual RTX 4090s -- and documented exactly what works, what barely works, and what does not work at all. This guide gives you the specific specs, the real performance numbers, and the commands to check whether your system is ready.


Absolute Minimum (Will Run, But Slowly)

ComponentRequirement
OSmacOS 11+, Windows 10+, Linux (glibc 2.31+)
CPUAny x86_64 with AVX2 or Apple Silicon
RAM8 GB system memory
Storage10 GB free space
GPUNone (CPU-only mode)

With minimum specs, expect 3-8 tokens/sec on a 7B model. Usable for testing. Painful for actual work.

ComponentRequirement
OSmacOS 13+, Windows 11, Ubuntu 22.04+
CPU8+ core modern CPU (AMD Ryzen 5000+, Intel 12th Gen+)
RAM16 GB system memory
GPUNVIDIA RTX 3060 12GB or Apple Silicon M1 16GB
Storage50 GB SSD (NVMe preferred)

This setup runs 7B-14B models smoothly at 30-60 tokens/sec. Good enough for code assistance, chat, and RAG pipelines.

Optimal (Power User / Small Team)

ComponentRequirement
OSUbuntu 22.04 LTS or macOS 14+
CPU16+ core (AMD Ryzen 9 or Apple M3 Pro/Max)
RAM32-64 GB
GPUNVIDIA RTX 4090 24GB or Apple M3 Max 48GB
Storage200 GB NVMe SSD

Handles 32B models at interactive speeds. Serves multiple concurrent users through Ollama's API.


GPU Requirements: NVIDIA, AMD, Apple {#gpu-requirements}

NVIDIA GPUs (Best Support)

Ollama uses CUDA for NVIDIA GPU acceleration. The official Ollama repository lists compute capability 5.0+ as the minimum.

Supported NVIDIA GPUs:

GPU FamilyCompute CapabilityVRAM RangeStatus
GTX 900 series5.22-8 GBMinimal (7B Q3 only)
GTX 1000 series6.13-11 GBBasic (7B models)
RTX 2000 series7.56-11 GBGood (7B-14B)
RTX 3000 series8.68-24 GBGreat (7B-32B)
RTX 4000 series8.98-24 GBExcellent (7B-32B)
RTX 5000 series10.012-32 GBExcellent (7B-32B)
A100/A60008.0-8.640-80 GBProfessional (7B-70B)

Driver requirements:

  • Linux: NVIDIA driver 450.80.02 or later
  • Windows: NVIDIA driver 452.39 or later

Check your driver and CUDA version:

# Linux/Windows
nvidia-smi

# Look for "CUDA Version" and "Driver Version" in output

Ollama bundles its own CUDA runtime, so you do not need to install the CUDA Toolkit separately. Just the NVIDIA driver.

AMD GPUs (Linux ROCm)

AMD support works through ROCm (Radeon Open Compute). As of April 2026, support is solid on Linux but experimental on Windows.

Supported AMD GPUs:

GPUVRAMROCm SupportPerformance vs NVIDIA Equiv.
RX 6700 XT12 GBFull~70% of RTX 3060 12GB
RX 6800 XT16 GBFull~65% of RTX 3070 Ti
RX 76008 GBFull~75% of RTX 4060
RX 7900 XTX24 GBFull~70% of RTX 4090
Radeon PRO W790048 GBFull~60% of A6000
MI250X128 GBFullCompetitive with A100

ROCm installation (Ubuntu):

# Add ROCm repository
wget https://repo.radeon.com/amdgpu-install/latest/ubuntu/jammy/amdgpu-install_6.0.60002-1_all.deb
sudo dpkg -i amdgpu-install_6.0.60002-1_all.deb

# Install ROCm
sudo amdgpu-install --usecase=rocm

# Verify
rocminfo | grep "Name:"

The performance gap between AMD and NVIDIA exists because CUDA has 15+ years of AI optimization behind it. ROCm is improving rapidly, but for maximum local AI performance per dollar, NVIDIA still leads.

Apple Silicon (Metal)

Every Apple Silicon chip supports Ollama through Metal GPU acceleration -- no configuration needed.

Performance by Apple Silicon chip:

ChipUnified Memory7B tok/s14B tok/s32B tok/s
M1 (8 GPU cores)8-16 GB25-3512-18Too slow
M1 Pro (16 GPU cores)16-32 GB35-4520-288-12
M2 (10 GPU cores)8-24 GB30-4015-22Too slow
M2 Max (38 GPU cores)32-96 GB40-5525-3512-18
M3 (10 GPU cores)8-24 GB35-4518-25Too slow
M3 Max (40 GPU cores)36-128 GB50-6530-4215-22
M3 Ultra (80 GPU cores)64-192 GB60-7540-5222-30
M4 Max (40 GPU cores)36-128 GB55-7035-4818-26

Apple Silicon benefits from unified memory -- the GPU accesses the same RAM as the CPU with no copy overhead. A 32 GB M2 Max can run models that would require 32 GB of dedicated VRAM on an NVIDIA system, though NVIDIA hardware generates tokens faster at equivalent memory sizes.

For detailed Apple Silicon setup, see our Mac local AI setup guide.


VRAM Needs Per Model Size {#vram-per-model}

This is the table that answers the most common question: "Can my GPU run this model?"

All values are for Q4_K_M quantization (the recommended sweet spot) with 4K context:

Model SizeVRAM RequiredExample ModelsFits On
1B-3B1.5-2.5 GBPhi-3 Mini, Llama 3.2 1BAny 4GB+ GPU
7B-8B4-6 GBLlama 3.1 8B, Mistral 7B, Qwen 2.5 7BRTX 3060 8GB, RTX 4060 8GB
13B-14B8-10 GBQwen 2.5 14B, Phi-3 MediumRTX 3060 12GB, RTX 4060 Ti 16GB
20B-22B12-14 GBMistral Small 22BRTX 4060 Ti 16GB, RX 7900 XT 20GB
32B-34B18-22 GBQwen 2.5 32B, DeepSeek R1 32BRTX 4090 24GB, RX 7900 XTX 24GB
70B38-42 GBLlama 3.3 70B, Qwen 2.5 72BA6000 48GB, 2x RTX 4090

Context length increases memory usage. Each additional 1K context tokens adds roughly:

  • 7B model: ~0.3-0.5 GB
  • 14B model: ~0.5-0.8 GB
  • 32B model: ~0.8-1.5 GB
  • 70B model: ~1.0-2.0 GB

A 32B model at Q4_K_M with 32K context might need 28-32 GB instead of 20 GB.

What happens when a model exceeds VRAM? Ollama automatically offloads layers to system RAM (CPU). This works, but speed drops by 5-10x. If you see generation speeds below 5 tokens/sec on a GPU system, partial CPU offloading is likely happening. Check with:

# See how many layers are on GPU vs CPU
OLLAMA_DEBUG=1 ollama run llama3.1:8b "test" 2>&1 | grep "layers"

CPU-Only Performance {#cpu-only-performance}

No GPU? Ollama still works. Here is what to expect:

CPU Performance by Processor

CPUCores/Threads7B Q4 tok/s14B Q4 tok/sRAM Needed
Intel i5-124006/125-72-416 GB
Intel i7-13700K16/248-125-716 GB
Intel i9-14900K24/3212-167-1032 GB
AMD Ryzen 5 5600X6/125-83-516 GB
AMD Ryzen 7 7800X3D8/169-135-816 GB
AMD Ryzen 9 7950X16/3214-188-1232 GB

CPU inference tips:

  • RAM speed matters. DDR5-6000 gives 15-25% faster inference than DDR4-3200.
  • More cores help, but diminishing returns after 12-16 threads.
  • AVX-512 support (Intel 12th Gen+, AMD Zen 4) improves performance by 10-20%.
  • Close memory-hungry applications. Ollama needs contiguous RAM blocks.
# Force CPU-only mode even if GPU is present
CUDA_VISIBLE_DEVICES="" ollama run llama3.1:8b

# Set thread count for CPU inference
OLLAMA_NUM_THREADS=12 ollama serve

Bottom line: CPU-only is viable for 7B models if you have a modern 8+ core processor. Anything above 14B on CPU is an exercise in patience.


OS-Specific Requirements {#os-specific}

Windows

RequirementDetails
OS VersionWindows 10 22H2+ or Windows 11
Architecturex86_64 only (ARM64 not yet supported)
NVIDIA Driver452.39+ for CUDA support
RAM8 GB minimum (16 GB recommended)
StorageModels stored in C:\Users<user>.ollama\models
FirewallAllow port 11434 for API access

Windows-specific install:

# Download and install (no WSL needed since v0.3)
winget install Ollama.Ollama

# Verify installation
ollama --version

# Check GPU detection
ollama run llama3.2:1b "hello" 2>&1

For detailed Windows setup, see our Ollama Windows installation guide.

macOS

RequirementDetails
OS VersionmacOS 11 Big Sur+ (13 Ventura+ recommended)
ArchitectureApple Silicon (M1+) or Intel x86_64
GPUMetal acceleration automatic on Apple Silicon
RAM8 GB minimum (16 GB recommended for Apple Silicon)
StorageModels stored in ~/.ollama/models

macOS install:

# Homebrew (recommended)
brew install ollama

# Or download from ollama.com
curl -fsSL https://ollama.com/install.sh | sh

# Start the service
brew services start ollama

Intel Macs work but are significantly slower than Apple Silicon. An Intel i9 MacBook Pro (2019) generates roughly 4-6 tokens/sec with a 7B model. If you have an Intel Mac, consider upgrading to Apple Silicon or using a cloud GPU.

Linux

RequirementDetails
Kernel5.4+ (5.15+ recommended)
glibc2.31+
NVIDIADriver 450.80.02+ for CUDA
AMDROCm 5.7+ for GPU support
RAM8 GB minimum
StorageModels stored in ~/.ollama/models (or /usr/share/ollama if installed as service)

Linux install:

# One-line install (recommended)
curl -fsSL https://ollama.com/install.sh | sh

# Verify GPU detection
ollama run llama3.2:1b "test"

# Check if CUDA is being used
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i "cuda\|rocm\|metal"

Linux gets the best GPU performance. The NVIDIA driver on Linux is generally 5-10% faster than the same driver version on Windows for AI inference workloads. ROCm is Linux-only for full AMD support.


Storage Requirements {#storage-requirements}

Model Sizes on Disk

ModelQ4_K_M SizeQ5_K_M SizeQ8_0 SizeFP16 Size
Llama 3.2 1B0.8 GB1.0 GB1.5 GB2.5 GB
Llama 3.1 8B4.7 GB5.5 GB8.5 GB16 GB
Qwen 2.5 14B8.9 GB10.5 GB15.5 GB29 GB
Qwen 2.5 32B19.8 GB23.5 GB34 GB65 GB
Llama 3.3 70B40.5 GB48 GB74 GB140 GB

Disk Speed Impact on Load Times

Storage Type7B Load Time32B Load Time
NVMe SSD (Gen 4)1-2 sec5-8 sec
SATA SSD3-5 sec12-18 sec
HDD 7200 RPM8-15 sec40-60 sec
Network (NFS)10-25 sec60-120 sec

Model loading only happens once per session. After the first load, the model stays in memory until it times out (default: 5 minutes of inactivity).

# Keep model loaded indefinitely (no timeout)
OLLAMA_KEEP_ALIVE=-1 ollama serve

# Check where models are stored
du -sh ~/.ollama/models/

# Move model storage to another drive
# Stop Ollama first, then:
mv ~/.ollama /mnt/fast-ssd/.ollama
ln -s /mnt/fast-ssd/.ollama ~/.ollama

Docker and Container Setup {#docker-requirements}

Running Ollama in Docker is common for team deployments and reproducible environments. Here are the requirements:

Docker with NVIDIA GPU

# Prerequisites: Docker 20.10+ and NVIDIA Container Toolkit
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Run Ollama with GPU
docker run -d --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

# Pull and run a model
docker exec -it ollama ollama pull llama3.1:8b
docker exec -it ollama ollama run llama3.1:8b "Hello"

Docker with AMD GPU

# Requires ROCm installed on host
docker run -d \
  --device /dev/kfd \
  --device /dev/dri \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama:rocm

Docker CPU-Only

# No special flags needed
docker run -d \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

Docker adds less than 2% performance overhead compared to bare-metal Ollama. The main consideration is ensuring GPU passthrough works correctly.


Cloud GPU Options {#cloud-gpu-options}

If your local hardware falls short, cloud GPU instances let you run Ollama remotely.

Cloud Provider Comparison

ProviderGPUVRAMHourly CostBest For
RunPodRTX 409024 GB$0.39/hr7B-32B models
RunPodA100 80GB80 GB$1.19/hr70B models
LambdaA100 40GB40 GB$0.75/hr32B-70B models
Vast.aiRTX 309024 GB$0.15-0.25/hrBudget 7B-32B
Google CloudT416 GB$0.35/hr7B-14B models
AWSg5.xlarge (A10G)24 GB$1.01/hrEnterprise 7B-32B

Setting Up Ollama on a Cloud GPU

# SSH into your cloud instance
ssh user@your-cloud-ip

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Bind to all interfaces (so you can access remotely)
OLLAMA_HOST=0.0.0.0:11434 ollama serve &

# Pull model
ollama pull qwen2.5:32b

# From your local machine, connect:
export OLLAMA_HOST=http://your-cloud-ip:11434
ollama run qwen2.5:32b "test"

Cost comparison: A RunPod RTX 4090 at $0.39/hr for 8 hours/day costs $93/month. Buying an RTX 4090 ($1,600) breaks even in about 17 months. If you need 32B+ models daily, buying hardware is cheaper long-term. For occasional use or testing large models, cloud is more economical.


Performance Optimization Tips {#optimization-tips}

1. GPU Layer Allocation

# Force all layers to GPU (default behavior if VRAM allows)
OLLAMA_NUM_GPU=999 ollama serve

# Manually set GPU layers (useful for partial offload)
OLLAMA_NUM_GPU=28 ollama serve  # Load 28 layers on GPU, rest on CPU

2. Context Length Tuning

Shorter context = less VRAM = faster inference.

# Set context length in Modelfile
cat > Modelfile << 'EOF'
FROM llama3.1:8b
PARAMETER num_ctx 4096
EOF
ollama create fast-llama -f Modelfile

3. Concurrent Request Handling

# Allow 2 parallel requests (doubles VRAM for KV cache)
OLLAMA_NUM_PARALLEL=2 ollama serve

# Max loaded models (default: 1, increase if you have VRAM)
OLLAMA_MAX_LOADED_MODELS=2 ollama serve

4. Flash Attention

Ollama enables flash attention automatically on supported hardware. It reduces VRAM usage for the KV cache by 40-60% and speeds up long-context inference. Verify it is active:

OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i "flash"

5. System-Level Optimizations

# Linux: increase shared memory (for large models)
echo "vm.overcommit_memory=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Linux: disable GPU power management (prevents throttling)
sudo nvidia-smi -pm 1

# macOS: disable Spotlight indexing on model directory
sudo mdutil -i off ~/.ollama

# All platforms: check if swap is being used (bad for performance)
free -h  # Linux
vm_stat  # macOS

For more on the latest Ollama performance improvements, check the official blog.


Checking Your System {#checking-your-system}

Run these commands to verify your system meets the requirements before installing Ollama:

GPU Detection

# NVIDIA GPU check
nvidia-smi
# Output should show GPU name, VRAM, driver version

# AMD GPU check (Linux)
rocminfo | grep "Name:"
# Should list your GPU model

# Apple Silicon check
system_profiler SPDisplaysDataType | grep "Chipset Model\|Metal\|VRAM"
# Should show Metal: Supported

Memory Check

# Linux
free -h | grep Mem
# Look at "total" column

# macOS
sysctl hw.memsize | awk '{printf "%.1f GB\n", $2/1073741824}'

# Windows (PowerShell)
(Get-CimInstance Win32_PhysicalMemory | Measure-Object -Property Capacity -Sum).Sum / 1GB

Storage Check

# Linux/macOS
df -h ~/.ollama 2>/dev/null || df -h ~

# Windows (PowerShell)
Get-PSDrive C | Select-Object Free

AVX Support Check (Required for CPU Inference)

# Linux
grep -o 'avx[^ ]*' /proc/cpuinfo | head -1
# Should output "avx2" at minimum

# macOS (Apple Silicon always supports equivalent)
sysctl -a | grep machdep.cpu.features | grep AVX

Post-Install Verification

# After installing Ollama, run this to verify everything works
ollama --version
ollama pull llama3.2:1b  # Small model for testing
ollama run llama3.2:1b "What is 2+2?"
# If you get a response, your setup is working

# Check which acceleration backend is being used
OLLAMA_DEBUG=1 ollama run llama3.2:1b "test" 2>&1 | head -20

Conclusion

Ollama runs on almost anything with a CPU and 8 GB of RAM. But there is a wide gap between "runs" and "runs well." The sweet spot for most users is an RTX 4060 Ti 16GB ($400) or an Apple Silicon Mac with 16 GB unified memory. Either setup handles 7B-14B models at speeds that feel responsive, and that covers the majority of local AI use cases.

If you need bigger models, an RTX 4090 opens up the 32B tier, which is where local AI quality starts matching cloud APIs. And if you occasionally need 70B-class models, cloud GPU instances at $0.39-1.19/hr make more sense than buying dual GPUs for sporadic use.

Check your system with the commands above, pick a model that fits your VRAM, and start with ollama pull. You can always upgrade hardware later -- the models and configuration carry over unchanged.


Already have Ollama installed? Check our complete Ollama guide for advanced configuration, or browse the best Ollama models to find what works best for your hardware.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, MLOps — chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: April 10, 2026🔄 Last Updated: April 10, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset

I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.

✓ 10+ Years in ML/AI✓ 77K Dataset Creator✓ Open Source Contributor

Build Real AI on Your Machine

RAG, agents, NLP, vision, MLOps — chapters across 10 courses that take you from reading about AI to building AI.

Was this helpful?

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators