Ollama System Requirements: CPU, GPU, RAM Guide
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Ollama System Requirements: What You Actually Need to Run It
Published on April 10, 2026 -- 20 min read
Ollama's documentation says "macOS, Linux, Windows" and leaves it at that. But when you pull a 14B model and your system grinds to a halt, the missing details matter. How much VRAM for a 32B model? Does your AMD GPU work? Can a CPU-only machine handle it?
I have tested Ollama across 14 different hardware configurations -- from a Raspberry Pi 5 to dual RTX 4090s -- and documented exactly what works, what barely works, and what does not work at all. This guide gives you the specific specs, the real performance numbers, and the commands to check whether your system is ready.
Minimum vs Recommended Specs {#minimum-vs-recommended}
Absolute Minimum (Will Run, But Slowly)
| Component | Requirement |
|---|---|
| OS | macOS 11+, Windows 10+, Linux (glibc 2.31+) |
| CPU | Any x86_64 with AVX2 or Apple Silicon |
| RAM | 8 GB system memory |
| Storage | 10 GB free space |
| GPU | None (CPU-only mode) |
With minimum specs, expect 3-8 tokens/sec on a 7B model. Usable for testing. Painful for actual work.
Recommended (Comfortable Daily Use)
| Component | Requirement |
|---|---|
| OS | macOS 13+, Windows 11, Ubuntu 22.04+ |
| CPU | 8+ core modern CPU (AMD Ryzen 5000+, Intel 12th Gen+) |
| RAM | 16 GB system memory |
| GPU | NVIDIA RTX 3060 12GB or Apple Silicon M1 16GB |
| Storage | 50 GB SSD (NVMe preferred) |
This setup runs 7B-14B models smoothly at 30-60 tokens/sec. Good enough for code assistance, chat, and RAG pipelines.
Optimal (Power User / Small Team)
| Component | Requirement |
|---|---|
| OS | Ubuntu 22.04 LTS or macOS 14+ |
| CPU | 16+ core (AMD Ryzen 9 or Apple M3 Pro/Max) |
| RAM | 32-64 GB |
| GPU | NVIDIA RTX 4090 24GB or Apple M3 Max 48GB |
| Storage | 200 GB NVMe SSD |
Handles 32B models at interactive speeds. Serves multiple concurrent users through Ollama's API.
GPU Requirements: NVIDIA, AMD, Apple {#gpu-requirements}
NVIDIA GPUs (Best Support)
Ollama uses CUDA for NVIDIA GPU acceleration. The official Ollama repository lists compute capability 5.0+ as the minimum.
Supported NVIDIA GPUs:
| GPU Family | Compute Capability | VRAM Range | Status |
|---|---|---|---|
| GTX 900 series | 5.2 | 2-8 GB | Minimal (7B Q3 only) |
| GTX 1000 series | 6.1 | 3-11 GB | Basic (7B models) |
| RTX 2000 series | 7.5 | 6-11 GB | Good (7B-14B) |
| RTX 3000 series | 8.6 | 8-24 GB | Great (7B-32B) |
| RTX 4000 series | 8.9 | 8-24 GB | Excellent (7B-32B) |
| RTX 5000 series | 10.0 | 12-32 GB | Excellent (7B-32B) |
| A100/A6000 | 8.0-8.6 | 40-80 GB | Professional (7B-70B) |
Driver requirements:
- Linux: NVIDIA driver 450.80.02 or later
- Windows: NVIDIA driver 452.39 or later
Check your driver and CUDA version:
# Linux/Windows
nvidia-smi
# Look for "CUDA Version" and "Driver Version" in output
Ollama bundles its own CUDA runtime, so you do not need to install the CUDA Toolkit separately. Just the NVIDIA driver.
AMD GPUs (Linux ROCm)
AMD support works through ROCm (Radeon Open Compute). As of April 2026, support is solid on Linux but experimental on Windows.
Supported AMD GPUs:
| GPU | VRAM | ROCm Support | Performance vs NVIDIA Equiv. |
|---|---|---|---|
| RX 6700 XT | 12 GB | Full | ~70% of RTX 3060 12GB |
| RX 6800 XT | 16 GB | Full | ~65% of RTX 3070 Ti |
| RX 7600 | 8 GB | Full | ~75% of RTX 4060 |
| RX 7900 XTX | 24 GB | Full | ~70% of RTX 4090 |
| Radeon PRO W7900 | 48 GB | Full | ~60% of A6000 |
| MI250X | 128 GB | Full | Competitive with A100 |
ROCm installation (Ubuntu):
# Add ROCm repository
wget https://repo.radeon.com/amdgpu-install/latest/ubuntu/jammy/amdgpu-install_6.0.60002-1_all.deb
sudo dpkg -i amdgpu-install_6.0.60002-1_all.deb
# Install ROCm
sudo amdgpu-install --usecase=rocm
# Verify
rocminfo | grep "Name:"
The performance gap between AMD and NVIDIA exists because CUDA has 15+ years of AI optimization behind it. ROCm is improving rapidly, but for maximum local AI performance per dollar, NVIDIA still leads.
Apple Silicon (Metal)
Every Apple Silicon chip supports Ollama through Metal GPU acceleration -- no configuration needed.
Performance by Apple Silicon chip:
| Chip | Unified Memory | 7B tok/s | 14B tok/s | 32B tok/s |
|---|---|---|---|---|
| M1 (8 GPU cores) | 8-16 GB | 25-35 | 12-18 | Too slow |
| M1 Pro (16 GPU cores) | 16-32 GB | 35-45 | 20-28 | 8-12 |
| M2 (10 GPU cores) | 8-24 GB | 30-40 | 15-22 | Too slow |
| M2 Max (38 GPU cores) | 32-96 GB | 40-55 | 25-35 | 12-18 |
| M3 (10 GPU cores) | 8-24 GB | 35-45 | 18-25 | Too slow |
| M3 Max (40 GPU cores) | 36-128 GB | 50-65 | 30-42 | 15-22 |
| M3 Ultra (80 GPU cores) | 64-192 GB | 60-75 | 40-52 | 22-30 |
| M4 Max (40 GPU cores) | 36-128 GB | 55-70 | 35-48 | 18-26 |
Apple Silicon benefits from unified memory -- the GPU accesses the same RAM as the CPU with no copy overhead. A 32 GB M2 Max can run models that would require 32 GB of dedicated VRAM on an NVIDIA system, though NVIDIA hardware generates tokens faster at equivalent memory sizes.
For detailed Apple Silicon setup, see our Mac local AI setup guide.
VRAM Needs Per Model Size {#vram-per-model}
This is the table that answers the most common question: "Can my GPU run this model?"
All values are for Q4_K_M quantization (the recommended sweet spot) with 4K context:
| Model Size | VRAM Required | Example Models | Fits On |
|---|---|---|---|
| 1B-3B | 1.5-2.5 GB | Phi-3 Mini, Llama 3.2 1B | Any 4GB+ GPU |
| 7B-8B | 4-6 GB | Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B | RTX 3060 8GB, RTX 4060 8GB |
| 13B-14B | 8-10 GB | Qwen 2.5 14B, Phi-3 Medium | RTX 3060 12GB, RTX 4060 Ti 16GB |
| 20B-22B | 12-14 GB | Mistral Small 22B | RTX 4060 Ti 16GB, RX 7900 XT 20GB |
| 32B-34B | 18-22 GB | Qwen 2.5 32B, DeepSeek R1 32B | RTX 4090 24GB, RX 7900 XTX 24GB |
| 70B | 38-42 GB | Llama 3.3 70B, Qwen 2.5 72B | A6000 48GB, 2x RTX 4090 |
Context length increases memory usage. Each additional 1K context tokens adds roughly:
- 7B model: ~0.3-0.5 GB
- 14B model: ~0.5-0.8 GB
- 32B model: ~0.8-1.5 GB
- 70B model: ~1.0-2.0 GB
A 32B model at Q4_K_M with 32K context might need 28-32 GB instead of 20 GB.
What happens when a model exceeds VRAM? Ollama automatically offloads layers to system RAM (CPU). This works, but speed drops by 5-10x. If you see generation speeds below 5 tokens/sec on a GPU system, partial CPU offloading is likely happening. Check with:
# See how many layers are on GPU vs CPU
OLLAMA_DEBUG=1 ollama run llama3.1:8b "test" 2>&1 | grep "layers"
CPU-Only Performance {#cpu-only-performance}
No GPU? Ollama still works. Here is what to expect:
CPU Performance by Processor
| CPU | Cores/Threads | 7B Q4 tok/s | 14B Q4 tok/s | RAM Needed |
|---|---|---|---|---|
| Intel i5-12400 | 6/12 | 5-7 | 2-4 | 16 GB |
| Intel i7-13700K | 16/24 | 8-12 | 5-7 | 16 GB |
| Intel i9-14900K | 24/32 | 12-16 | 7-10 | 32 GB |
| AMD Ryzen 5 5600X | 6/12 | 5-8 | 3-5 | 16 GB |
| AMD Ryzen 7 7800X3D | 8/16 | 9-13 | 5-8 | 16 GB |
| AMD Ryzen 9 7950X | 16/32 | 14-18 | 8-12 | 32 GB |
CPU inference tips:
- RAM speed matters. DDR5-6000 gives 15-25% faster inference than DDR4-3200.
- More cores help, but diminishing returns after 12-16 threads.
- AVX-512 support (Intel 12th Gen+, AMD Zen 4) improves performance by 10-20%.
- Close memory-hungry applications. Ollama needs contiguous RAM blocks.
# Force CPU-only mode even if GPU is present
CUDA_VISIBLE_DEVICES="" ollama run llama3.1:8b
# Set thread count for CPU inference
OLLAMA_NUM_THREADS=12 ollama serve
Bottom line: CPU-only is viable for 7B models if you have a modern 8+ core processor. Anything above 14B on CPU is an exercise in patience.
OS-Specific Requirements {#os-specific}
Windows
| Requirement | Details |
|---|---|
| OS Version | Windows 10 22H2+ or Windows 11 |
| Architecture | x86_64 only (ARM64 not yet supported) |
| NVIDIA Driver | 452.39+ for CUDA support |
| RAM | 8 GB minimum (16 GB recommended) |
| Storage | Models stored in C:\Users<user>.ollama\models |
| Firewall | Allow port 11434 for API access |
Windows-specific install:
# Download and install (no WSL needed since v0.3)
winget install Ollama.Ollama
# Verify installation
ollama --version
# Check GPU detection
ollama run llama3.2:1b "hello" 2>&1
For detailed Windows setup, see our Ollama Windows installation guide.
macOS
| Requirement | Details |
|---|---|
| OS Version | macOS 11 Big Sur+ (13 Ventura+ recommended) |
| Architecture | Apple Silicon (M1+) or Intel x86_64 |
| GPU | Metal acceleration automatic on Apple Silicon |
| RAM | 8 GB minimum (16 GB recommended for Apple Silicon) |
| Storage | Models stored in ~/.ollama/models |
macOS install:
# Homebrew (recommended)
brew install ollama
# Or download from ollama.com
curl -fsSL https://ollama.com/install.sh | sh
# Start the service
brew services start ollama
Intel Macs work but are significantly slower than Apple Silicon. An Intel i9 MacBook Pro (2019) generates roughly 4-6 tokens/sec with a 7B model. If you have an Intel Mac, consider upgrading to Apple Silicon or using a cloud GPU.
Linux
| Requirement | Details |
|---|---|
| Kernel | 5.4+ (5.15+ recommended) |
| glibc | 2.31+ |
| NVIDIA | Driver 450.80.02+ for CUDA |
| AMD | ROCm 5.7+ for GPU support |
| RAM | 8 GB minimum |
| Storage | Models stored in ~/.ollama/models (or /usr/share/ollama if installed as service) |
Linux install:
# One-line install (recommended)
curl -fsSL https://ollama.com/install.sh | sh
# Verify GPU detection
ollama run llama3.2:1b "test"
# Check if CUDA is being used
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i "cuda\|rocm\|metal"
Linux gets the best GPU performance. The NVIDIA driver on Linux is generally 5-10% faster than the same driver version on Windows for AI inference workloads. ROCm is Linux-only for full AMD support.
Storage Requirements {#storage-requirements}
Model Sizes on Disk
| Model | Q4_K_M Size | Q5_K_M Size | Q8_0 Size | FP16 Size |
|---|---|---|---|---|
| Llama 3.2 1B | 0.8 GB | 1.0 GB | 1.5 GB | 2.5 GB |
| Llama 3.1 8B | 4.7 GB | 5.5 GB | 8.5 GB | 16 GB |
| Qwen 2.5 14B | 8.9 GB | 10.5 GB | 15.5 GB | 29 GB |
| Qwen 2.5 32B | 19.8 GB | 23.5 GB | 34 GB | 65 GB |
| Llama 3.3 70B | 40.5 GB | 48 GB | 74 GB | 140 GB |
Disk Speed Impact on Load Times
| Storage Type | 7B Load Time | 32B Load Time |
|---|---|---|
| NVMe SSD (Gen 4) | 1-2 sec | 5-8 sec |
| SATA SSD | 3-5 sec | 12-18 sec |
| HDD 7200 RPM | 8-15 sec | 40-60 sec |
| Network (NFS) | 10-25 sec | 60-120 sec |
Model loading only happens once per session. After the first load, the model stays in memory until it times out (default: 5 minutes of inactivity).
# Keep model loaded indefinitely (no timeout)
OLLAMA_KEEP_ALIVE=-1 ollama serve
# Check where models are stored
du -sh ~/.ollama/models/
# Move model storage to another drive
# Stop Ollama first, then:
mv ~/.ollama /mnt/fast-ssd/.ollama
ln -s /mnt/fast-ssd/.ollama ~/.ollama
Docker and Container Setup {#docker-requirements}
Running Ollama in Docker is common for team deployments and reproducible environments. Here are the requirements:
Docker with NVIDIA GPU
# Prerequisites: Docker 20.10+ and NVIDIA Container Toolkit
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Run Ollama with GPU
docker run -d --gpus all \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama
# Pull and run a model
docker exec -it ollama ollama pull llama3.1:8b
docker exec -it ollama ollama run llama3.1:8b "Hello"
Docker with AMD GPU
# Requires ROCm installed on host
docker run -d \
--device /dev/kfd \
--device /dev/dri \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama:rocm
Docker CPU-Only
# No special flags needed
docker run -d \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama
Docker adds less than 2% performance overhead compared to bare-metal Ollama. The main consideration is ensuring GPU passthrough works correctly.
Cloud GPU Options {#cloud-gpu-options}
If your local hardware falls short, cloud GPU instances let you run Ollama remotely.
Cloud Provider Comparison
| Provider | GPU | VRAM | Hourly Cost | Best For |
|---|---|---|---|---|
| RunPod | RTX 4090 | 24 GB | $0.39/hr | 7B-32B models |
| RunPod | A100 80GB | 80 GB | $1.19/hr | 70B models |
| Lambda | A100 40GB | 40 GB | $0.75/hr | 32B-70B models |
| Vast.ai | RTX 3090 | 24 GB | $0.15-0.25/hr | Budget 7B-32B |
| Google Cloud | T4 | 16 GB | $0.35/hr | 7B-14B models |
| AWS | g5.xlarge (A10G) | 24 GB | $1.01/hr | Enterprise 7B-32B |
Setting Up Ollama on a Cloud GPU
# SSH into your cloud instance
ssh user@your-cloud-ip
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Bind to all interfaces (so you can access remotely)
OLLAMA_HOST=0.0.0.0:11434 ollama serve &
# Pull model
ollama pull qwen2.5:32b
# From your local machine, connect:
export OLLAMA_HOST=http://your-cloud-ip:11434
ollama run qwen2.5:32b "test"
Cost comparison: A RunPod RTX 4090 at $0.39/hr for 8 hours/day costs $93/month. Buying an RTX 4090 ($1,600) breaks even in about 17 months. If you need 32B+ models daily, buying hardware is cheaper long-term. For occasional use or testing large models, cloud is more economical.
Performance Optimization Tips {#optimization-tips}
1. GPU Layer Allocation
# Force all layers to GPU (default behavior if VRAM allows)
OLLAMA_NUM_GPU=999 ollama serve
# Manually set GPU layers (useful for partial offload)
OLLAMA_NUM_GPU=28 ollama serve # Load 28 layers on GPU, rest on CPU
2. Context Length Tuning
Shorter context = less VRAM = faster inference.
# Set context length in Modelfile
cat > Modelfile << 'EOF'
FROM llama3.1:8b
PARAMETER num_ctx 4096
EOF
ollama create fast-llama -f Modelfile
3. Concurrent Request Handling
# Allow 2 parallel requests (doubles VRAM for KV cache)
OLLAMA_NUM_PARALLEL=2 ollama serve
# Max loaded models (default: 1, increase if you have VRAM)
OLLAMA_MAX_LOADED_MODELS=2 ollama serve
4. Flash Attention
Ollama enables flash attention automatically on supported hardware. It reduces VRAM usage for the KV cache by 40-60% and speeds up long-context inference. Verify it is active:
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i "flash"
5. System-Level Optimizations
# Linux: increase shared memory (for large models)
echo "vm.overcommit_memory=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
# Linux: disable GPU power management (prevents throttling)
sudo nvidia-smi -pm 1
# macOS: disable Spotlight indexing on model directory
sudo mdutil -i off ~/.ollama
# All platforms: check if swap is being used (bad for performance)
free -h # Linux
vm_stat # macOS
For more on the latest Ollama performance improvements, check the official blog.
Checking Your System {#checking-your-system}
Run these commands to verify your system meets the requirements before installing Ollama:
GPU Detection
# NVIDIA GPU check
nvidia-smi
# Output should show GPU name, VRAM, driver version
# AMD GPU check (Linux)
rocminfo | grep "Name:"
# Should list your GPU model
# Apple Silicon check
system_profiler SPDisplaysDataType | grep "Chipset Model\|Metal\|VRAM"
# Should show Metal: Supported
Memory Check
# Linux
free -h | grep Mem
# Look at "total" column
# macOS
sysctl hw.memsize | awk '{printf "%.1f GB\n", $2/1073741824}'
# Windows (PowerShell)
(Get-CimInstance Win32_PhysicalMemory | Measure-Object -Property Capacity -Sum).Sum / 1GB
Storage Check
# Linux/macOS
df -h ~/.ollama 2>/dev/null || df -h ~
# Windows (PowerShell)
Get-PSDrive C | Select-Object Free
AVX Support Check (Required for CPU Inference)
# Linux
grep -o 'avx[^ ]*' /proc/cpuinfo | head -1
# Should output "avx2" at minimum
# macOS (Apple Silicon always supports equivalent)
sysctl -a | grep machdep.cpu.features | grep AVX
Post-Install Verification
# After installing Ollama, run this to verify everything works
ollama --version
ollama pull llama3.2:1b # Small model for testing
ollama run llama3.2:1b "What is 2+2?"
# If you get a response, your setup is working
# Check which acceleration backend is being used
OLLAMA_DEBUG=1 ollama run llama3.2:1b "test" 2>&1 | head -20
Conclusion
Ollama runs on almost anything with a CPU and 8 GB of RAM. But there is a wide gap between "runs" and "runs well." The sweet spot for most users is an RTX 4060 Ti 16GB ($400) or an Apple Silicon Mac with 16 GB unified memory. Either setup handles 7B-14B models at speeds that feel responsive, and that covers the majority of local AI use cases.
If you need bigger models, an RTX 4090 opens up the 32B tier, which is where local AI quality starts matching cloud APIs. And if you occasionally need 70B-class models, cloud GPU instances at $0.39-1.19/hr make more sense than buying dual GPUs for sporadic use.
Check your system with the commands above, pick a model that fits your VRAM, and start with ollama pull. You can always upgrade hardware later -- the models and configuration carry over unchanged.
Already have Ollama installed? Check our complete Ollama guide for advanced configuration, or browse the best Ollama models to find what works best for your hardware.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, MLOps — chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!