Complete Ollama Guide 2026: Install, Run & Manage Local AI Models
Before we dive deeper...
Get your free AI Starter Kit
Join 12,000+ developers. Instant download: Career Roadmap + Fundamentals Cheat Sheets.
Ollama is a free, open-source tool that runs AI models locally on your computer with one command. Install it with curl -fsSL https://ollama.com/install.sh | sh, then run ollama run llama3.1 to start chatting. It supports 500+ models, works on macOS, Windows, and Linux, and requires zero cloud accounts or API keys.
Ollama is the simplest way to run large language models on your own computer. One command installs it. One command downloads a model. One command starts a conversation. No cloud accounts, no API keys, no monthly fees — your data stays on your machine.
This guide covers everything from first install to production deployment: model management, the REST API, custom Modelfiles, tool calling, GPU acceleration, Docker, and performance optimization. Whether you are setting up your first local AI or building applications, start here.
Table of Contents
- What Is Ollama?
- Install Ollama
- Your First Model
- Essential Commands
- Choosing the Right Model
- Custom Models with Modelfiles
- The Ollama REST API
- Tool Calling / Function Calling
- GPU Acceleration
- Docker Deployment
- Performance Optimization
- Chat Interfaces
- Troubleshooting
- FAQ
What Is Ollama? {#what-is-ollama}
Ollama is an open-source tool that runs large language models (LLMs) locally on your computer. It handles model downloading, quantization, GPU acceleration, and serving — all behind a simple CLI and REST API.
Why use Ollama over cloud AI (ChatGPT, Claude)?
- Privacy: Your conversations never leave your machine
- Cost: $0/month. No API fees, no usage limits
- Speed: No network latency — responses start instantly
- Offline: Works without internet after downloading models
- Control: Customize system prompts, temperature, context length
Ollama supports 500+ models from the Ollama library, including Llama 3.3 70B, Qwen 2.5, Mistral, DeepSeek R1, Gemma 2, and GPT-OSS. It runs on macOS, Windows, and Linux.
Install Ollama {#install-ollama}
macOS
# Option 1: Download from ollama.com (recommended)
# Visit https://ollama.com/download and install the .dmg
# Option 2: Homebrew
brew install ollama
After installing, Ollama runs as a menubar app. It automatically uses Metal GPU acceleration on Apple Silicon — no configuration needed.
Windows
# Download the installer from https://ollama.com/download/windows
# Or use winget:
winget install Ollama.Ollama
For detailed Windows setup including WSL2 options, see our Ollama Windows Installation Guide.
Linux
# One-line install (recommended)
curl -fsSL https://ollama.com/install.sh | sh
# Or install manually:
# 1. Download binary from GitHub releases
# 2. Place in /usr/local/bin/
# 3. Create systemd service (see docs)
The install script detects your GPU (NVIDIA/AMD) and configures drivers automatically. For manual NVIDIA setup, ensure CUDA 11.8+ drivers are installed.
Verify Installation
ollama --version
# Should output: ollama version 0.5.x
Your First Model {#your-first-model}
# Download and run Llama 3.1 8B (recommended starter model)
ollama run llama3.1
# That's it. Start chatting:
>>> What is the capital of France?
The capital of France is Paris...
# Press Ctrl+D to exit
The first run downloads the model (~4.7 GB for Llama 3.1 8B Q4). Subsequent runs start instantly from cache.
Recommended first models by hardware:
| VRAM / RAM | Best Starter Model | Command | Download Size |
|---|---|---|---|
| 4-8 GB | Llama 3.2 3B | ollama run llama3.2:3b | ~2.0 GB |
| 8-12 GB | Llama 3.1 8B | ollama run llama3.1 | ~4.7 GB |
| 16 GB | Qwen 2.5 14B | ollama run qwen2.5:14b | ~9.0 GB |
| 24 GB | Qwen 2.5 32B | ollama run qwen2.5:32b | ~20 GB |
| 48 GB+ | Llama 3.3 70B | ollama run llama3.3:70b | ~40 GB |
Not sure which model fits your hardware? Use our Model Recommender or VRAM Calculator.
Essential Ollama Commands {#essential-ollama-commands}
Model Management
# List downloaded models
ollama list
# Download a model without running it
ollama pull qwen2.5-coder:7b
# Remove a model
ollama rm mistral
# Show model details (size, parameters, template)
ollama show llama3.1
# Copy a model (for creating variants)
ollama cp llama3.1 my-assistant
Running Models
# Interactive chat
ollama run llama3.1
# Single prompt (no interactive mode)
ollama run llama3.1 "Explain quantum computing in one paragraph"
# Pipe input from a file
cat code.py | ollama run qwen2.5-coder:7b "Review this code for bugs"
# Set system prompt inline
ollama run llama3.1 --system "You are a helpful Python tutor"
Server Management
# Start the Ollama server (macOS does this automatically)
ollama serve
# Check running models
ollama ps
# Stop a loaded model (free memory)
ollama stop llama3.1
Useful Environment Variables
# Change model storage location (default: ~/.ollama/models)
export OLLAMA_MODELS=/path/to/models
# Change API port (default: 11434)
export OLLAMA_HOST=0.0.0.0:11434
# Set how long models stay loaded (default: 5m)
export OLLAMA_KEEP_ALIVE=30m
# Limit GPU layers (for partial offloading)
export OLLAMA_NUM_GPU=20
# Enable debug logging
export OLLAMA_DEBUG=1
Choosing the Right Model {#choosing-the-right-model}
Ollama's library has 500+ models. Here is what to actually use:
For Coding
| Model | Params | VRAM (Q4) | HumanEval | Best For |
|---|---|---|---|---|
| Qwen 2.5 Coder 32B | 32B | 22 GB | 92.7% | Best overall code quality |
| Qwen 2.5 Coder 7B | 7B | 5 GB | 88.4% | Daily coding on 8GB hardware |
| DeepSeek Coder V2 16B | 16B | 10 GB | 85.2% | Multi-language support |
| Qwen 2.5 Coder 1.5B | 1.5B | 1.5 GB | 74.5% | IDE autocomplete |
ollama run qwen2.5-coder:32b # Best coding model
For General Chat & Q&A
| Model | Params | VRAM (Q4) | MMLU | Best For |
|---|---|---|---|---|
| Llama 3.3 70B | 70B | 42 GB | 86.0% | Maximum quality |
| Qwen 2.5 32B | 32B | 22 GB | 83.2% | Best mid-range |
| Llama 3.1 8B | 8B | 5.5 GB | 73.0% | Budget all-rounder |
| Phi-3.5 Mini | 3.8B | 3 GB | 69.0% | Ultra-low hardware |
For Reasoning & Math
| Model | Params | VRAM (Q4) | MATH | Best For |
|---|---|---|---|---|
| DeepSeek R1 32B | 32B | 22 GB | 88.5% | Complex reasoning |
| DeepSeek R1 14B | 14B | 9.5 GB | 82.3% | Budget reasoning |
| Qwen 2.5 14B | 14B | 9.5 GB | 78.0% | Balanced option |
ollama run deepseek-r1:14b # Watch it think step-by-step
For the complete ranking of all 15 top models, see our Best Ollama Models guide.
Custom Models with Modelfiles {#custom-models-with-modelfiles}
Modelfiles let you create customized models with specific system prompts, parameters, and behaviors.
Basic Modelfile
Create a file named Modelfile:
FROM llama3.1
SYSTEM """
You are a senior Python developer at a tech startup. You write clean,
well-documented code following PEP 8. You prefer simple solutions over
clever ones. Always include type hints and docstrings.
"""
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
Build and run it:
ollama create python-dev -f Modelfile
ollama run python-dev
Advanced Modelfile Parameters
FROM qwen2.5:14b
SYSTEM "You are a technical writing assistant."
# Generation parameters
PARAMETER temperature 0.7 # Creativity (0.0-2.0, default 0.8)
PARAMETER top_p 0.9 # Nucleus sampling
PARAMETER top_k 40 # Top-k sampling
PARAMETER repeat_penalty 1.1 # Reduce repetition
PARAMETER num_ctx 32768 # Context window size
PARAMETER num_predict 2048 # Max tokens to generate
PARAMETER stop "<|end|>" # Stop sequence
# Template customization (for chat format)
TEMPLATE """
{{- if .System }}<|system|>{{ .System }}<|end|>{{ end }}
{{- range .Messages }}<|{{ .Role }}|>{{ .Content }}<|end|>{{ end }}
<|assistant|>
"""
Loading LoRA Adapters
FROM llama3.1
ADAPTER /path/to/my-lora-adapter.gguf
SYSTEM "You are a domain-specific assistant."
This lets you use fine-tuned models with Ollama.
The Ollama REST API {#the-ollama-rest-api}
Ollama runs a REST API on http://localhost:11434 by default. This is how you integrate Ollama into applications.
Chat Endpoint (Recommended)
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
],
"stream": false
}'
Response:
{
"model": "llama3.1",
"message": {
"role": "assistant",
"content": "Machine learning is a subset of AI..."
},
"done": true,
"total_duration": 1250000000,
"eval_count": 145,
"eval_duration": 980000000
}
Generate Endpoint (Single Prompt)
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Write a haiku about programming",
"stream": false
}'
Embeddings Endpoint (For RAG)
curl http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": "What is retrieval-augmented generation?"
}'
Returns a 768-dimensional vector for use in RAG pipelines.
OpenAI-Compatible API
Ollama supports the OpenAI API format, so existing code works with minimal changes:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="not-needed" # Ollama ignores API keys
)
response = client.chat.completions.create(
model="llama3.1",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
This compatibility means Ollama works with LangChain, LlamaIndex, CrewAI, and any tool that supports the OpenAI API.
Python Library
import ollama
# Chat
response = ollama.chat(
model='llama3.1',
messages=[{'role': 'user', 'content': 'Why is the sky blue?'}]
)
print(response['message']['content'])
# Streaming
for chunk in ollama.chat(
model='llama3.1',
messages=[{'role': 'user', 'content': 'Tell me a story'}],
stream=True
):
print(chunk['message']['content'], end='', flush=True)
# Embeddings
embedding = ollama.embed(model='nomic-embed-text', input='Hello world')
Install with: pip install ollama
JavaScript/TypeScript Library
import { Ollama } from 'ollama'
const ollama = new Ollama()
const response = await ollama.chat({
model: 'llama3.1',
messages: [{ role: 'user', content: 'Hello!' }]
})
console.log(response.message.content)
Install with: npm install ollama
Tool Calling / Function Calling {#tool-calling-function-calling}
Since Ollama 0.4, compatible models can call functions you define. This enables AI agents that interact with databases, APIs, file systems, and more — all running locally.
Basic Tool Calling
import ollama
# Define available tools
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["city"]
}
}
}]
# Send request with tools
response = ollama.chat(
model='llama3.1',
messages=[{'role': 'user', 'content': 'What is the weather in Tokyo?'}],
tools=tools
)
# Check for tool calls
if response['message'].get('tool_calls'):
for tool_call in response['message']['tool_calls']:
print(f"Function: {tool_call['function']['name']}")
print(f"Arguments: {tool_call['function']['arguments']}")
# Execute function and send result back
Compatible models for tool calling: Llama 3.1+, Qwen 2.5+, Mistral 7B Instruct, Mistral Small. For building full AI agent systems locally, combine tool calling with frameworks like CrewAI or LangGraph.
GPU Acceleration Setup {#gpu-acceleration-setup}
NVIDIA GPUs (Recommended)
Ollama automatically uses NVIDIA GPUs if CUDA drivers are installed:
# Check GPU is detected
nvidia-smi
# Verify Ollama sees GPU
ollama ps
# PROCESSOR column should show "GPU" for loaded models
If GPU is not detected:
- Install latest NVIDIA drivers from nvidia.com
- Restart Ollama service
- Check with
nvidia-smithat driver loads
Apple Silicon (Metal)
Metal acceleration is automatic on M1/M2/M3/M4 Macs. No configuration needed. Apple Silicon's unified memory means the model can use your full system RAM — a 32GB Mac M4 Max runs 32B models at full speed.
AMD GPUs (ROCm)
Linux: Install ROCm 6.0+ and Ollama will detect AMD GPUs automatically. Windows: AMD GPU support is experimental. Most users fall back to CPU inference on Windows with AMD.
Partial GPU Offloading
If your model is too large for VRAM, Ollama automatically splits between GPU and CPU:
# Manually control GPU layers (useful for fine-tuning memory usage)
OLLAMA_NUM_GPU=20 ollama run llama3.3:70b
# 0 = CPU only, -1 = all layers on GPU
The layers on GPU run fast; layers on CPU run slower. For details on how this affects performance, see our RTX 5090 vs 5080 comparison which includes real offloading benchmarks.
Docker Deployment {#docker-deployment}
CPU Only
docker run -d \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama
NVIDIA GPU
docker run -d \
--gpus all \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama
AMD GPU
docker run -d \
--device /dev/kfd --device /dev/dri \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama:rocm
Docker Compose (with Open WebUI)
version: '3.8'
services:
ollama:
image: ollama/ollama
ports:
- "11434:11434"
volumes:
- ollama:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
open-webui:
image: ghcr.io/open-webui/open-webui:main
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
volumes:
- open-webui:/app/backend/data
volumes:
ollama:
open-webui:
Run with docker compose up -d, then open http://localhost:3000 for a ChatGPT-like interface. See our Open WebUI setup guide for full configuration.
Performance Optimization {#performance-optimization}
1. Use the Right Quantization
Ollama defaults to Q4_K_M quantization — the best balance of quality and speed. For higher quality at the cost of more VRAM:
ollama pull llama3.1:8b-instruct-q8_0 # 8-bit, near-lossless
ollama pull llama3.1:8b-instruct-fp16 # Full precision (doubles VRAM)
For lower VRAM usage:
ollama pull llama3.1:8b-instruct-q2_K # 2-bit, noticeable quality loss
See our quantization format comparison for detailed quality benchmarks.
2. Optimize Context Length
Longer context uses more memory. Set only what you need:
# In Modelfile:
PARAMETER num_ctx 4096 # Default, good for most chat
PARAMETER num_ctx 32768 # For long documents
PARAMETER num_ctx 131072 # Max (Llama 3.1), uses significant VRAM
3. Keep Models Loaded
# Keep model in memory for 30 minutes (default: 5m)
export OLLAMA_KEEP_ALIVE=30m
# Keep loaded indefinitely (useful for servers)
export OLLAMA_KEEP_ALIVE=-1
4. Batch Processing
For high-throughput scenarios, use the API with multiple concurrent requests. Ollama handles request queuing automatically.
5. Monitor Performance
# Check loaded models and memory usage
ollama ps
# Measure tokens/second from API response
# Look for eval_count and eval_duration in JSON response
# Speed = eval_count / (eval_duration / 1e9) tokens/sec
Chat Interfaces for Ollama {#chat-interfaces-for-ollama}
While the CLI works great, many users prefer a visual interface:
| Interface | Stars | Features | Install |
|---|---|---|---|
| Open WebUI | 126K+ | ChatGPT-like UI, RAG, plugins | Docker or pip |
| Jan | 30K+ | Desktop app, offline-first | Download from jan.ai |
| Chatbox | 25K+ | Minimal desktop client | Download from chatboxai.app |
| Enchanted | 5K+ | macOS/iOS native client | App Store |
| LobeChat | 50K+ | Advanced multi-agent UI | Docker |
Open WebUI is the most popular — it supports file uploads, RAG, model switching, conversation management, and plugins.
Troubleshooting {#troubleshooting}
Model Won't Download
# Check available storage
df -h ~/.ollama
# Clear partial downloads
ollama rm <model-name>
ollama pull <model-name>
GPU Not Detected
# NVIDIA: Check driver
nvidia-smi
# Check Ollama sees GPU
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i gpu
# Restart Ollama
systemctl restart ollama # Linux
# Or restart the app on macOS/Windows
Out of Memory
# Use smaller quantization
ollama pull llama3.1:8b-instruct-q2_K
# Reduce context length in Modelfile
PARAMETER num_ctx 2048
# Use a smaller model
ollama run phi3.5 # 3.8B instead of 8B
Slow Generation
- Check GPU is being used:
ollama psshould show GPU processor - Reduce context length
- Close other GPU-intensive apps
- Consider a model that fits entirely in VRAM — partial offloading is 5-10x slower
For hardware recommendations, see our hardware requirements guide.
FAQ {#faq}
See the FAQ section below for answers to the most common Ollama questions.
Sources: Ollama GitHub | Ollama Documentation | Ollama Model Library | Ollama API Reference
Ready to start your AI career?
Get the complete roadmap
Download the AI Starter Kit: Career path, fundamentals, and cheat sheets used by 12K+ developers.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!