Local AI Setup

Complete Ollama Guide 2026: Install, Run & Manage Local AI Models

March 18, 2026
25 min read
LocalAimaster Research Team
🎁 4 PDFs included
Newsletter

Before we dive deeper...

Get your free AI Starter Kit

Join 12,000+ developers. Instant download: Career Roadmap + Fundamentals Cheat Sheets.

No spam, everUnsubscribe anytime
12,000+ downloads

Ollama is a free, open-source tool that runs AI models locally on your computer with one command. Install it with curl -fsSL https://ollama.com/install.sh | sh, then run ollama run llama3.1 to start chatting. It supports 500+ models, works on macOS, Windows, and Linux, and requires zero cloud accounts or API keys.

Ollama is the simplest way to run large language models on your own computer. One command installs it. One command downloads a model. One command starts a conversation. No cloud accounts, no API keys, no monthly fees — your data stays on your machine.

This guide covers everything from first install to production deployment: model management, the REST API, custom Modelfiles, tool calling, GPU acceleration, Docker, and performance optimization. Whether you are setting up your first local AI or building applications, start here.

Table of Contents

  1. What Is Ollama?
  2. Install Ollama
  3. Your First Model
  4. Essential Commands
  5. Choosing the Right Model
  6. Custom Models with Modelfiles
  7. The Ollama REST API
  8. Tool Calling / Function Calling
  9. GPU Acceleration
  10. Docker Deployment
  11. Performance Optimization
  12. Chat Interfaces
  13. Troubleshooting
  14. FAQ

What Is Ollama? {#what-is-ollama}

Ollama is an open-source tool that runs large language models (LLMs) locally on your computer. It handles model downloading, quantization, GPU acceleration, and serving — all behind a simple CLI and REST API.

Why use Ollama over cloud AI (ChatGPT, Claude)?

  • Privacy: Your conversations never leave your machine
  • Cost: $0/month. No API fees, no usage limits
  • Speed: No network latency — responses start instantly
  • Offline: Works without internet after downloading models
  • Control: Customize system prompts, temperature, context length

Ollama supports 500+ models from the Ollama library, including Llama 3.3 70B, Qwen 2.5, Mistral, DeepSeek R1, Gemma 2, and GPT-OSS. It runs on macOS, Windows, and Linux.


Install Ollama {#install-ollama}

macOS

# Option 1: Download from ollama.com (recommended)
# Visit https://ollama.com/download and install the .dmg

# Option 2: Homebrew
brew install ollama

After installing, Ollama runs as a menubar app. It automatically uses Metal GPU acceleration on Apple Silicon — no configuration needed.

Windows

# Download the installer from https://ollama.com/download/windows
# Or use winget:
winget install Ollama.Ollama

For detailed Windows setup including WSL2 options, see our Ollama Windows Installation Guide.

Linux

# One-line install (recommended)
curl -fsSL https://ollama.com/install.sh | sh

# Or install manually:
# 1. Download binary from GitHub releases
# 2. Place in /usr/local/bin/
# 3. Create systemd service (see docs)

The install script detects your GPU (NVIDIA/AMD) and configures drivers automatically. For manual NVIDIA setup, ensure CUDA 11.8+ drivers are installed.

Verify Installation

ollama --version
# Should output: ollama version 0.5.x

Your First Model {#your-first-model}

# Download and run Llama 3.1 8B (recommended starter model)
ollama run llama3.1

# That's it. Start chatting:
>>> What is the capital of France?
The capital of France is Paris...

# Press Ctrl+D to exit

The first run downloads the model (~4.7 GB for Llama 3.1 8B Q4). Subsequent runs start instantly from cache.

Recommended first models by hardware:

VRAM / RAMBest Starter ModelCommandDownload Size
4-8 GBLlama 3.2 3Bollama run llama3.2:3b~2.0 GB
8-12 GBLlama 3.1 8Bollama run llama3.1~4.7 GB
16 GBQwen 2.5 14Bollama run qwen2.5:14b~9.0 GB
24 GBQwen 2.5 32Bollama run qwen2.5:32b~20 GB
48 GB+Llama 3.3 70Bollama run llama3.3:70b~40 GB

Not sure which model fits your hardware? Use our Model Recommender or VRAM Calculator.


Essential Ollama Commands {#essential-ollama-commands}

Model Management

# List downloaded models
ollama list

# Download a model without running it
ollama pull qwen2.5-coder:7b

# Remove a model
ollama rm mistral

# Show model details (size, parameters, template)
ollama show llama3.1

# Copy a model (for creating variants)
ollama cp llama3.1 my-assistant

Running Models

# Interactive chat
ollama run llama3.1

# Single prompt (no interactive mode)
ollama run llama3.1 "Explain quantum computing in one paragraph"

# Pipe input from a file
cat code.py | ollama run qwen2.5-coder:7b "Review this code for bugs"

# Set system prompt inline
ollama run llama3.1 --system "You are a helpful Python tutor"

Server Management

# Start the Ollama server (macOS does this automatically)
ollama serve

# Check running models
ollama ps

# Stop a loaded model (free memory)
ollama stop llama3.1

Useful Environment Variables

# Change model storage location (default: ~/.ollama/models)
export OLLAMA_MODELS=/path/to/models

# Change API port (default: 11434)
export OLLAMA_HOST=0.0.0.0:11434

# Set how long models stay loaded (default: 5m)
export OLLAMA_KEEP_ALIVE=30m

# Limit GPU layers (for partial offloading)
export OLLAMA_NUM_GPU=20

# Enable debug logging
export OLLAMA_DEBUG=1

Choosing the Right Model {#choosing-the-right-model}

Ollama's library has 500+ models. Here is what to actually use:

For Coding

ModelParamsVRAM (Q4)HumanEvalBest For
Qwen 2.5 Coder 32B32B22 GB92.7%Best overall code quality
Qwen 2.5 Coder 7B7B5 GB88.4%Daily coding on 8GB hardware
DeepSeek Coder V2 16B16B10 GB85.2%Multi-language support
Qwen 2.5 Coder 1.5B1.5B1.5 GB74.5%IDE autocomplete
ollama run qwen2.5-coder:32b  # Best coding model

For General Chat & Q&A

ModelParamsVRAM (Q4)MMLUBest For
Llama 3.3 70B70B42 GB86.0%Maximum quality
Qwen 2.5 32B32B22 GB83.2%Best mid-range
Llama 3.1 8B8B5.5 GB73.0%Budget all-rounder
Phi-3.5 Mini3.8B3 GB69.0%Ultra-low hardware

For Reasoning & Math

ModelParamsVRAM (Q4)MATHBest For
DeepSeek R1 32B32B22 GB88.5%Complex reasoning
DeepSeek R1 14B14B9.5 GB82.3%Budget reasoning
Qwen 2.5 14B14B9.5 GB78.0%Balanced option
ollama run deepseek-r1:14b  # Watch it think step-by-step

For the complete ranking of all 15 top models, see our Best Ollama Models guide.


Custom Models with Modelfiles {#custom-models-with-modelfiles}

Modelfiles let you create customized models with specific system prompts, parameters, and behaviors.

Basic Modelfile

Create a file named Modelfile:

FROM llama3.1

SYSTEM """
You are a senior Python developer at a tech startup. You write clean,
well-documented code following PEP 8. You prefer simple solutions over
clever ones. Always include type hints and docstrings.
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 8192

Build and run it:

ollama create python-dev -f Modelfile
ollama run python-dev

Advanced Modelfile Parameters

FROM qwen2.5:14b

SYSTEM "You are a technical writing assistant."

# Generation parameters
PARAMETER temperature 0.7      # Creativity (0.0-2.0, default 0.8)
PARAMETER top_p 0.9            # Nucleus sampling
PARAMETER top_k 40             # Top-k sampling
PARAMETER repeat_penalty 1.1   # Reduce repetition
PARAMETER num_ctx 32768        # Context window size
PARAMETER num_predict 2048     # Max tokens to generate
PARAMETER stop "<|end|>"       # Stop sequence

# Template customization (for chat format)
TEMPLATE """
{{- if .System }}<|system|>{{ .System }}<|end|>{{ end }}
{{- range .Messages }}<|{{ .Role }}|>{{ .Content }}<|end|>{{ end }}
<|assistant|>
"""

Loading LoRA Adapters

FROM llama3.1
ADAPTER /path/to/my-lora-adapter.gguf
SYSTEM "You are a domain-specific assistant."

This lets you use fine-tuned models with Ollama.


The Ollama REST API {#the-ollama-rest-api}

Ollama runs a REST API on http://localhost:11434 by default. This is how you integrate Ollama into applications.

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"}
  ],
  "stream": false
}'

Response:

{
  "model": "llama3.1",
  "message": {
    "role": "assistant",
    "content": "Machine learning is a subset of AI..."
  },
  "done": true,
  "total_duration": 1250000000,
  "eval_count": 145,
  "eval_duration": 980000000
}

Generate Endpoint (Single Prompt)

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Write a haiku about programming",
  "stream": false
}'

Embeddings Endpoint (For RAG)

curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "What is retrieval-augmented generation?"
}'

Returns a 768-dimensional vector for use in RAG pipelines.

OpenAI-Compatible API

Ollama supports the OpenAI API format, so existing code works with minimal changes:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="not-needed"  # Ollama ignores API keys
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

This compatibility means Ollama works with LangChain, LlamaIndex, CrewAI, and any tool that supports the OpenAI API.

Python Library

import ollama

# Chat
response = ollama.chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'Why is the sky blue?'}]
)
print(response['message']['content'])

# Streaming
for chunk in ollama.chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'Tell me a story'}],
    stream=True
):
    print(chunk['message']['content'], end='', flush=True)

# Embeddings
embedding = ollama.embed(model='nomic-embed-text', input='Hello world')

Install with: pip install ollama

JavaScript/TypeScript Library

import { Ollama } from 'ollama'

const ollama = new Ollama()

const response = await ollama.chat({
  model: 'llama3.1',
  messages: [{ role: 'user', content: 'Hello!' }]
})
console.log(response.message.content)

Install with: npm install ollama


Tool Calling / Function Calling {#tool-calling-function-calling}

Since Ollama 0.4, compatible models can call functions you define. This enables AI agents that interact with databases, APIs, file systems, and more — all running locally.

Basic Tool Calling

import ollama

# Define available tools
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["city"]
        }
    }
}]

# Send request with tools
response = ollama.chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'What is the weather in Tokyo?'}],
    tools=tools
)

# Check for tool calls
if response['message'].get('tool_calls'):
    for tool_call in response['message']['tool_calls']:
        print(f"Function: {tool_call['function']['name']}")
        print(f"Arguments: {tool_call['function']['arguments']}")
        # Execute function and send result back

Compatible models for tool calling: Llama 3.1+, Qwen 2.5+, Mistral 7B Instruct, Mistral Small. For building full AI agent systems locally, combine tool calling with frameworks like CrewAI or LangGraph.


GPU Acceleration Setup {#gpu-acceleration-setup}

Ollama automatically uses NVIDIA GPUs if CUDA drivers are installed:

# Check GPU is detected
nvidia-smi

# Verify Ollama sees GPU
ollama ps
# PROCESSOR column should show "GPU" for loaded models

If GPU is not detected:

  1. Install latest NVIDIA drivers from nvidia.com
  2. Restart Ollama service
  3. Check with nvidia-smi that driver loads

Apple Silicon (Metal)

Metal acceleration is automatic on M1/M2/M3/M4 Macs. No configuration needed. Apple Silicon's unified memory means the model can use your full system RAM — a 32GB Mac M4 Max runs 32B models at full speed.

AMD GPUs (ROCm)

Linux: Install ROCm 6.0+ and Ollama will detect AMD GPUs automatically. Windows: AMD GPU support is experimental. Most users fall back to CPU inference on Windows with AMD.

Partial GPU Offloading

If your model is too large for VRAM, Ollama automatically splits between GPU and CPU:

# Manually control GPU layers (useful for fine-tuning memory usage)
OLLAMA_NUM_GPU=20 ollama run llama3.3:70b

# 0 = CPU only, -1 = all layers on GPU

The layers on GPU run fast; layers on CPU run slower. For details on how this affects performance, see our RTX 5090 vs 5080 comparison which includes real offloading benchmarks.


Docker Deployment {#docker-deployment}

CPU Only

docker run -d \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

NVIDIA GPU

docker run -d \
  --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

AMD GPU

docker run -d \
  --device /dev/kfd --device /dev/dri \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama:rocm

Docker Compose (with Open WebUI)

version: '3.8'
services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    volumes:
      - open-webui:/app/backend/data

volumes:
  ollama:
  open-webui:

Run with docker compose up -d, then open http://localhost:3000 for a ChatGPT-like interface. See our Open WebUI setup guide for full configuration.


Performance Optimization {#performance-optimization}

1. Use the Right Quantization

Ollama defaults to Q4_K_M quantization — the best balance of quality and speed. For higher quality at the cost of more VRAM:

ollama pull llama3.1:8b-instruct-q8_0   # 8-bit, near-lossless
ollama pull llama3.1:8b-instruct-fp16    # Full precision (doubles VRAM)

For lower VRAM usage:

ollama pull llama3.1:8b-instruct-q2_K    # 2-bit, noticeable quality loss

See our quantization format comparison for detailed quality benchmarks.

2. Optimize Context Length

Longer context uses more memory. Set only what you need:

# In Modelfile:
PARAMETER num_ctx 4096    # Default, good for most chat
PARAMETER num_ctx 32768   # For long documents
PARAMETER num_ctx 131072  # Max (Llama 3.1), uses significant VRAM

3. Keep Models Loaded

# Keep model in memory for 30 minutes (default: 5m)
export OLLAMA_KEEP_ALIVE=30m

# Keep loaded indefinitely (useful for servers)
export OLLAMA_KEEP_ALIVE=-1

4. Batch Processing

For high-throughput scenarios, use the API with multiple concurrent requests. Ollama handles request queuing automatically.

5. Monitor Performance

# Check loaded models and memory usage
ollama ps

# Measure tokens/second from API response
# Look for eval_count and eval_duration in JSON response
# Speed = eval_count / (eval_duration / 1e9) tokens/sec

Chat Interfaces for Ollama {#chat-interfaces-for-ollama}

While the CLI works great, many users prefer a visual interface:

InterfaceStarsFeaturesInstall
Open WebUI126K+ChatGPT-like UI, RAG, pluginsDocker or pip
Jan30K+Desktop app, offline-firstDownload from jan.ai
Chatbox25K+Minimal desktop clientDownload from chatboxai.app
Enchanted5K+macOS/iOS native clientApp Store
LobeChat50K+Advanced multi-agent UIDocker

Open WebUI is the most popular — it supports file uploads, RAG, model switching, conversation management, and plugins.


Troubleshooting {#troubleshooting}

Model Won't Download

# Check available storage
df -h ~/.ollama

# Clear partial downloads
ollama rm <model-name>
ollama pull <model-name>

GPU Not Detected

# NVIDIA: Check driver
nvidia-smi

# Check Ollama sees GPU
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i gpu

# Restart Ollama
systemctl restart ollama  # Linux
# Or restart the app on macOS/Windows

Out of Memory

# Use smaller quantization
ollama pull llama3.1:8b-instruct-q2_K

# Reduce context length in Modelfile
PARAMETER num_ctx 2048

# Use a smaller model
ollama run phi3.5  # 3.8B instead of 8B

Slow Generation

  1. Check GPU is being used: ollama ps should show GPU processor
  2. Reduce context length
  3. Close other GPU-intensive apps
  4. Consider a model that fits entirely in VRAM — partial offloading is 5-10x slower

For hardware recommendations, see our hardware requirements guide.


FAQ {#faq}

See the FAQ section below for answers to the most common Ollama questions.


Sources: Ollama GitHub | Ollama Documentation | Ollama Model Library | Ollama API Reference

🚀 Join 12K+ developers
Newsletter

Ready to start your AI career?

Get the complete roadmap

Download the AI Starter Kit: Career path, fundamentals, and cheat sheets used by 12K+ developers.

No spam, everUnsubscribe anytime
12,000+ downloads
Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

My 77K Dataset Insights Delivered Weekly

Get exclusive access to real dataset optimization strategies and AI model performance tips.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

Free Tools & Calculators