Why is Ollama not using my GPU?

The most common reason is outdated or missing GPU drivers. Run 'nvidia-smi' to check if your NVIDIA driver is detected. You need Driver 520+ for CUDA 11.8 support. If the command fails, install the latest driver with 'sudo apt install nvidia-driver-550' on Ubuntu. After installing, reboot and verify. If running in Docker, you also need the NVIDIA Container Toolkit and the '--gpus all' flag.

How do I make Ollama accessible from other machines on my network?

By default, Ollama only listens on localhost (127.0.0.1). Set the OLLAMA_HOST environment variable to '0.0.0.0:11434' to listen on all interfaces. On Linux with systemd, create an override file at /etc/systemd/system/ollama.service.d/override.conf with Environment="OLLAMA_HOST=0.0.0.0:11434", then run 'systemctl daemon-reload && systemctl restart ollama'. Remember to configure firewall rules for security.

What size model can I run with 8GB RAM?

With 8GB of RAM, you can comfortably run 3B to 7B parameter models in Q4 quantization. Recommended models include Phi-4 Mini (3.8B), Llama 3.2 3B, and Gemma 3 1B. A 7B Q4_K_M model uses about 4.5GB of RAM, leaving room for the OS and other applications. Do not attempt 13B+ models on 8GB as they will either crash or run extremely slowly due to disk swapping.

Why does my model get slower after running for a few minutes on a laptop?

This is thermal throttling. Your GPU hits its temperature limit (typically 83-90C) and reduces its clock speed to prevent damage. Monitor temperature with 'nvidia-smi --query-gpu=temperature.gpu --format=csv'. Fixes include using a laptop cooling pad, setting the laptop to Performance mode, reducing the GPU power limit with 'nvidia-smi -pl 120', and ensuring proper airflow under the laptop.

Should I use Q4, Q5, or Q8 quantization for Ollama models?

Start with Q4_K_M - it offers the best balance of quality and size for most users. Q5_K_M provides slightly better quality with about 30% more memory usage. Q8_0 is near-lossless but uses 2x the memory of Q4. Avoid Q2_K as quality degrades significantly. Only use Q8_0 or F16 if you have plenty of RAM and need maximum accuracy for specialized tasks.

How do I fix Docker GPU passthrough for Ollama?

Install the NVIDIA Container Toolkit: run 'sudo apt install nvidia-container-toolkit', then 'sudo nvidia-ctk runtime configure --runtime=docker', and 'sudo systemctl restart docker'. When running Ollama in Docker, always use the '--gpus all' flag: 'docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama'. Verify with 'docker exec -it ollama nvidia-smi'.

Do I need to fine-tune a model for custom behavior?

Usually not. 90% of customization needs are solved by system prompts in a Modelfile, few-shot examples in your prompt, or RAG for domain knowledge. Fine-tuning is only worth the effort when you need deeply baked-in behavioral changes that prompting cannot achieve, such as a specific output style across thousands of varied inputs. Try prompting and RAG first - they are faster, cheaper, and easier to iterate on.

Why does Ollama say 'connection refused' when I try to connect?

Three common causes: 1) Ollama is not running - start it with 'ollama serve' or 'brew services start ollama'. 2) You are connecting from a non-localhost address but OLLAMA_HOST is set to 127.0.0.1 - change it to 0.0.0.0:11434. 3) A firewall is blocking port 11434 - check with 'sudo ufw status' on Ubuntu or System Settings > Network > Firewall on macOS. Run 'curl http://localhost:11434/api/version' to test local connectivity first.

First-Time Ollama Setup: 15 Mistakes Everyone Makes

Published on April 11, 2026 • 18 min read

I have watched dozens of people set up Ollama for the first time. Smart people, experienced developers, sysadmins with decades of Linux under their belts. They all hit the same walls. Not because the documentation is bad, but because there are assumptions baked into the setup process that nobody warns you about until you have already wasted two hours staring at a cryptic error message.

This is not a troubleshooting guide. This is the list of mistakes I have seen repeated so often that I can predict which one you will hit based on your OS and hardware. Every fix includes the exact command to diagnose and resolve the problem.

If you have not installed Ollama yet, start with the Windows installation guide or the Mac setup guide. Then come back here when something goes wrong. It will.

Mistake 1: Installing Wrong or Outdated GPU Drivers {#mistake-1}

This is the number one reason Ollama runs on CPU instead of GPU. The model loads, text generates, but it is painfully slow because your GPU is sitting idle.

The problem: Ollama needs CUDA 11.8+ for NVIDIA GPUs. Many systems ship with older drivers, or the driver version does not match the CUDA toolkit version.

Diagnose it:

# Check your current driver version
nvidia-smi

# You need Driver 520+ for CUDA 11.8, Driver 535+ for CUDA 12.2
# If this command fails, you have no NVIDIA driver installed at all

Fix it:

# Ubuntu/Debian - install latest driver
sudo apt update
sudo apt install nvidia-driver-550

# Reboot required after driver install
sudo reboot

# Verify after reboot
nvidia-smi
# Should show Driver Version: 550.xx and CUDA Version: 12.x

For AMD GPUs: Ollama supports ROCm on Linux. You need ROCm 5.7+ and a supported GPU (RX 6000/7000 series or Instinct). Check the Ollama GitHub repo for the current compatibility matrix.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Mistake 2: Pulling a Model Too Large for Your RAM {#mistake-2}

Everyone wants to run Llama 3.1 70B. Almost nobody has the hardware for it. The model pulls successfully, but inference either crashes or crawls at 0.5 tokens per second because the system is swapping to disk.

The rule of thumb: A model needs roughly 1.2x its file size in RAM during inference. A Q4_K_M quantized 70B model is about 40GB on disk, so you need at least 48GB of available RAM.

Diagnose it:

# Check available memory
free -h  # Linux
vm_stat  # macOS

# Check model sizes you have pulled
ollama list
# The SIZE column tells you what is loaded into RAM

What actually fits:

Available RAM	Max Model Size	Best Models
8GB	3B-7B (Q4)	Phi-4 Mini, Llama 3.2 3B, Gemma 3 1B
16GB	7B-13B (Q4)	Llama 3.2 7B, Mistral 7B, CodeLlama 13B
32GB	13B-30B (Q4)	Mixtral 8x7B, Llama 3.1 13B, Qwen 2.5 32B
64GB+	70B (Q4)	Llama 3.1 70B, Qwen 2.5 72B

If you have 8GB of RAM, stop trying to run anything larger than 7B — see our roundup of the best Ollama models for 8GB RAM for picks that actually run smoothly. Read our RAM requirements guide for detailed model-to-memory mapping.

Mistake 3: Not Setting OLLAMA_HOST for Remote Access {#mistake-3}

By default, Ollama only listens on 127.0.0.1:11434. If you try to connect from another machine on your network, or from a Docker container, or from Open WebUI running on a different port, you get "connection refused."

Diagnose it:

# Check what Ollama is listening on
ss -tlnp | grep 11434   # Linux
lsof -i :11434          # macOS

# If it shows 127.0.0.1:11434, it is localhost-only

Fix it:

# Option 1: Environment variable (temporary)
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# Option 2: Systemd override (persistent on Linux)
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollama

# Option 3: launchd on macOS
launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
# Then restart Ollama from the menu bar

Security note: Setting 0.0.0.0 exposes Ollama to your entire network. If you are on a shared network, use a firewall rule to restrict access to specific IPs.

Mistake 4: Windows PATH Not Updated After Install {#mistake-4}

On Windows, the Ollama installer adds itself to PATH, but the terminal session you had open before installation does not pick up the change. You type ollama and get "'ollama' is not recognized."

Fix it:

# Option 1: Just close and reopen your terminal

# Option 2: Refresh PATH in current session
$env:Path = [System.Environment]::GetEnvironmentVariable("Path","Machine") + ";" + [System.Environment]::GetEnvironmentVariable("Path","User")

# Option 3: Verify Ollama is in PATH
where.exe ollama
# Should return: C:\Users\YourName\AppData\Local\Programs\Ollama\ollama.exe

If the path is completely missing, the installer may have failed silently. Reinstall from ollama.com and run as Administrator.

For a complete Windows walkthrough, see our Ollama Windows installation guide.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Mistake 5: Forgetting to Enable WSL2 on Windows {#mistake-5}

If you are running Ollama inside WSL2 (instead of the native Windows build), you need WSL2 properly configured with GPU passthrough. WSL1 does not support CUDA.

Diagnose it:

# Check WSL version
wsl --list --verbose
# VERSION column must say 2, not 1

# Inside WSL, check GPU access
nvidia-smi
# If this fails inside WSL, GPU passthrough is not working

Fix it:

# Upgrade WSL1 to WSL2
wsl --set-version Ubuntu 2

# If WSL2 is not installed at all
wsl --install

# Ensure Windows GPU driver is recent (not the Linux driver inside WSL)
# WSL2 uses the Windows host driver - do NOT install nvidia-driver inside WSL

Critical detail: Inside WSL2, you should NOT install NVIDIA drivers. The Windows host driver is shared automatically. Installing Linux NVIDIA drivers inside WSL2 will break GPU access.

Mistake 6: Choosing the Wrong Quantization Level {#mistake-6}

You see a model available in Q2_K, Q4_K_M, Q5_K_M, Q8_0, and F16. You pick Q2 because it is the smallest, and then wonder why the output is garbled nonsense.

The hierarchy:

Quantization	Size vs F16	Quality	When to Use
F16	100%	Perfect	You have unlimited RAM
Q8_0	50%	Near-perfect	You have plenty of RAM
Q6_K	38%	Excellent	Best quality-to-size ratio
Q5_K_M	33%	Very good	Daily driver sweet spot
Q4_K_M	25%	Good	Most popular choice
Q3_K_M	19%	Acceptable	Tight on RAM
Q2_K	13%	Poor	Avoid unless desperate

My recommendation: Start with Q4_K_M. It is the default Ollama quantization for good reason. Drop to Q3_K_M only if the model barely does not fit in RAM. Never use Q2 unless you are testing whether a model architecture works at all.

# Pull a specific quantization
ollama pull llama3.2:7b-q4_K_M
ollama pull llama3.2:7b-q5_K_M

# Check what quantization you have
ollama show llama3.2 --modelfile

Mistake 7: Docker GPU Passthrough Not Configured {#mistake-7}

You run Ollama in Docker, pull a model, start generating, and it is extremely slow. Docker does not pass through the GPU by default.

Diagnose it:

# Inside the container, check for GPU
docker exec -it ollama nvidia-smi
# If this fails, GPU passthrough is not working

Fix it:

# Step 1: Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Step 2: Run Ollama with GPU access
docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

The key flag is --gpus all. Without it, the container has zero GPU access regardless of what is installed on the host.

Mistake 8: macOS Gatekeeper Blocking Ollama {#mistake-8}

On macOS, downloading Ollama directly (not through Homebrew) triggers Gatekeeper. You get "Ollama cannot be opened because the developer cannot be verified." Double-clicking does nothing.

Fix it:

# Option 1: Right-click → Open (bypasses Gatekeeper once)
# In Finder, right-click Ollama.app → Open → Click "Open" in the dialog

# Option 2: Remove the quarantine attribute
xattr -cr /Applications/Ollama.app

# Option 3: Use Homebrew instead (no Gatekeeper issues)
brew install ollama

Homebrew-installed packages are not subject to Gatekeeper because Homebrew builds from source or uses signed bottles. This is the simplest way to avoid the issue entirely. See the full Mac setup guide for the Homebrew approach.

Mistake 9: Running Out of Disk Space Mid-Download {#mistake-9}

Ollama stores models in ~/.ollama/models/ by default. A single 70B Q4 model is 40GB. If you are pulling multiple models on a 256GB laptop, you run out fast, and the error message is not always clear about what happened.

Diagnose it:

# Check disk usage of Ollama models
du -sh ~/.ollama/models/

# Check remaining disk space
df -h /    # Linux
df -h /    # macOS (same command)

# List all pulled models with sizes
ollama list

Fix it:

# Remove models you are not using
ollama rm mixtral:8x7b
ollama rm llama3.1:70b

# Move model storage to a larger drive
# Stop Ollama first
sudo systemctl stop ollama   # Linux
# or quit from menu bar       # macOS

# Move and symlink
mv ~/.ollama /mnt/bigdrive/ollama
ln -s /mnt/bigdrive/ollama ~/.ollama

# Or set the environment variable
export OLLAMA_MODELS=/mnt/bigdrive/ollama/models

Planning ahead: Budget 10GB per 7B model, 25GB per 30B model, 45GB per 70B model. If you plan to experiment with multiple models, you want at least 100GB free.

Mistake 10: Using the Wrong Model for the Task {#mistake-10}

Not all models are general-purpose. Running a coding model for creative writing produces stiff, mechanical text. Running a general model for code completion produces syntactically broken garbage.

Task-to-model mapping:

Task	Best Model	Why
General chat	Llama 3.2 7B, Mistral 7B	Trained on diverse data
Code generation	CodeLlama 7B/13B, DeepSeek Coder	Code-specific training
Creative writing	Llama 3.2 7B, Qwen 2.5 7B	Better narrative coherence
Summarization	Phi-4 Mini (3.8B)	Fast, great at extraction
Math/reasoning	Qwen 2.5 Math, DeepSeek R1	Explicit reasoning chains
Document Q&A	Llama 3.2 7B + RAG	Good context following

# Pull task-specific models
ollama pull codellama:7b      # For code
ollama pull llama3.2:7b       # For general use
ollama pull deepseek-r1:7b    # For reasoning

Check our best Ollama models guide for benchmark comparisons across tasks. Qwen shows up across several categories above — if you want the newest version, the Qwen 3 local setup guide covers install and VRAM by model size.

Mistake 11: Ignoring Context Window Limits {#mistake-11}

Every model has a maximum context window. Llama 3.2 defaults to 2048 tokens in Ollama. If you paste in a 5,000-word document and ask questions about it, the model silently drops the beginning of the input. The answers are wrong, but they sound confident.

Diagnose it:

# Check current context window
ollama show llama3.2 --modelfile | grep num_ctx

# Default is often 2048 - way too small for long documents

Fix it:

# Set larger context at runtime
ollama run llama3.2 --num-ctx 8192

# Or create a Modelfile with a larger default
cat > Modelfile << 'EOF'
FROM llama3.2
PARAMETER num_ctx 8192
PARAMETER num_predict 2048
EOF
ollama create llama3.2-long -f Modelfile
ollama run llama3.2-long

The tradeoff: Doubling the context window roughly doubles RAM usage for the KV cache. On an 8GB machine running a 7B model, going from 2048 to 8192 context might push you into swap. Monitor memory usage when increasing context.

Mistake 12: Not Creating a Custom Modelfile {#mistake-12}

Running ollama run llama3.2 uses default parameters. These defaults are conservative. A custom Modelfile lets you set system prompts, temperature, context size, and stop sequences, turning a generic model into a specialized tool.

Create one:

cat > Modelfile << 'EOF'
FROM llama3.2:7b

# System prompt - define the model's role
SYSTEM """You are a senior software engineer. You write clean, well-documented code.
You explain your reasoning before writing code. You always include error handling."""

# Parameters
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
PARAMETER num_predict 1024
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1

# Stop sequences
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"
EOF

# Build the custom model
ollama create my-coder -f Modelfile

# Run it
ollama run my-coder

You can create multiple Modelfiles for different tasks: one for coding, one for writing, one for analysis. Each loads the same base model weights but behaves differently.

Mistake 13: Missing CUDA Toolkit (Not Just the Driver) {#mistake-13}

The NVIDIA driver alone is not enough on some setups. If you are building Ollama from source, or running certain GPU-accelerated backends, you need the CUDA Toolkit installed separately.

Diagnose it:

# Check if nvcc (CUDA compiler) is available
nvcc --version
# If this fails, the CUDA toolkit is not installed

# Check if the driver alone is present
nvidia-smi
# This can succeed even without the toolkit

Fix it:

# Ubuntu - install CUDA toolkit
sudo apt install nvidia-cuda-toolkit

# Or install a specific version
# Visit: https://developer.nvidia.com/cuda-downloads
# Select your OS and follow the instructions

# After install, add to PATH
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Note: For standard Ollama binary installs (not from source), you typically only need the driver. The CUDA toolkit matters when building from source or using advanced features.

Mistake 14: Thermal Throttling on Laptops {#mistake-14}

Your laptop runs a 7B model at 30 tok/s for the first minute, then drops to 15 tok/s. The GPU is hitting thermal limits and clocking down. This is especially common on gaming laptops in "quiet" mode and ultrabooks with thin cooling.

Diagnose it:

# Monitor GPU temperature (NVIDIA)
watch -n 1 nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader

# Throttling starts around 83-90C depending on the GPU
# If you see temperatures above 85C, you are throttling

# macOS - check thermal state
sudo powermetrics --samplers thermal -i 1000 -n 5

Fix it:

Use a laptop cooling pad (seriously, $20 makes a measurable difference)
Set your laptop to "Performance" mode in power settings
On NVIDIA, set a power limit: sudo nvidia-smi -pl 120 (120W instead of 150W reduces heat with ~10% performance loss)
Elevate the back of the laptop for better airflow
On macOS, close the lid and use an external display to let the bottom breathe

Long sessions: If you are running inference for hours, desktop hardware or a server makes more sense. See our system requirements guide for hardware recommendations.

Mistake 15: Trying to Fine-Tune When Prompting Would Work {#mistake-15}

You want the model to respond in a specific format. Or always speak in a certain tone. Or know about your company's products. So you start looking into fine-tuning, which requires datasets, GPU hours, and a decent amount of ML knowledge.

Before you fine-tune, try these in order:

System prompt: Most behavioral changes are solved by a good system prompt in a Modelfile
Few-shot examples: Include 2-3 examples of the desired input/output format in your prompt
RAG (Retrieval-Augmented Generation): For knowledge about your documents, RAG is almost always better than fine-tuning
Fine-tuning: Only if the above three fail and you need deeply baked-in behavior changes

# Example: Instead of fine-tuning for JSON output, use a system prompt
cat > Modelfile << 'EOF'
FROM llama3.2:7b
SYSTEM """You are a data extraction assistant. You ALWAYS respond in valid JSON format.
Never include explanations outside the JSON. Example output format:
{"name": "value", "category": "value", "confidence": 0.95}"""
PARAMETER temperature 0.3
EOF
ollama create json-extractor -f Modelfile
echo "Extract entities: The new iPhone 16 Pro costs $999 and has 8GB RAM" | ollama run json-extractor

Fine-tuning has its place, but 90% of "I need to fine-tune" situations are actually "I need a better prompt." Save yourself weeks of work and try prompting first.

The Quick-Reference Checklist {#checklist}

Before you spend an hour debugging, run through this list:

# 1. Is Ollama actually running?
curl http://localhost:11434/api/version

# 2. Is the GPU detected?
nvidia-smi   # NVIDIA
rocm-smi     # AMD

# 3. Is the model fully downloaded?
ollama list

# 4. How much RAM is available right now?
free -h      # Linux
vm_stat      # macOS

# 5. Is Ollama using the GPU?
# While a model is running:
nvidia-smi   # Check "Processes" section at the bottom

# 6. What is the actual model size vs your RAM?
ollama show llama3.2 --modelfile

# 7. Is the port accessible?
curl http://localhost:11434/api/tags

If all seven checks pass and you still have issues, check the Ollama logs:

# Linux (systemd)
journalctl -u ollama -f

# macOS
cat ~/.ollama/logs/server.log

# Windows
# Check: %LOCALAPPDATA%\Ollama\server.log

Conclusion

Every one of these mistakes has cost someone at least an hour. Some have cost entire weekends. The pattern is almost always the same: something that should "just work" does not, and the error message points in the wrong direction.

Bookmark this page. The next time Ollama acts up, run the seven-command checklist first. Most problems resolve in under five minutes once you know where to look.

For ongoing setup and optimization, our Ollama system requirements guide covers hardware planning in depth, and the complete Ollama guide walks through advanced configuration.

Built something with Ollama? Hit a wall not covered here? Our community courses cover advanced troubleshooting scenarios including multi-GPU setups, clustered inference, and production deployments.

First-Time Ollama Setup: 15 Mistakes Everyone Makes

Want to go deeper than this article?

Mistake 1: Installing Wrong or Outdated GPU Drivers {#mistake-1}

Reading articles is good. Building is better.

Mistake 2: Pulling a Model Too Large for Your RAM {#mistake-2}

Mistake 3: Not Setting OLLAMA_HOST for Remote Access {#mistake-3}

Mistake 4: Windows PATH Not Updated After Install {#mistake-4}

Reading articles is good. Building is better.

Mistake 5: Forgetting to Enable WSL2 on Windows {#mistake-5}

Mistake 6: Choosing the Wrong Quantization Level {#mistake-6}

Mistake 7: Docker GPU Passthrough Not Configured {#mistake-7}

Mistake 8: macOS Gatekeeper Blocking Ollama {#mistake-8}

Mistake 9: Running Out of Disk Space Mid-Download {#mistake-9}

Mistake 10: Using the Wrong Model for the Task {#mistake-10}

Mistake 11: Ignoring Context Window Limits {#mistake-11}

Mistake 12: Not Creating a Custom Modelfile {#mistake-12}

Mistake 13: Missing CUDA Toolkit (Not Just the Driver) {#mistake-13}

Mistake 14: Thermal Throttling on Laptops {#mistake-14}

Mistake 15: Trying to Fine-Tune When Prompting Would Work {#mistake-15}

The Quick-Reference Checklist {#checklist}

Conclusion

Ollama’s running. Here’s what to build with it.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by the Local AI Master Team

Stop Guessing, Start Building

Build Real AI on Your Machine

🎓 Continue Learning

Related Guides

Ollama on Windows: Complete Installation Guide

Ollama on Mac: Apple Silicon Setup & Metal GPU Guide

Ollama System Requirements: RAM, GPU, Storage

How Much RAM Do You Need for Local AI?

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Ollama’s running. Here’s what to build with it.