Is Ollama free to use?

Yes, Ollama is completely free and open-source (MIT license). You can use it for personal and commercial projects without any fees, API costs, or usage limits. The models you download through Ollama are also free — they are open-weight models from Meta (Llama), Google (Gemma), Alibaba (Qwen), Mistral, and others. Your data stays on your machine and is never sent to any server.

How much RAM and VRAM do I need for Ollama?

Minimum 8GB system RAM for 3B-7B models. For GPU acceleration, your VRAM determines model size: 8GB VRAM runs 7B models, 16GB runs 14B models, 24GB runs 32B models, 48GB+ runs 70B models. All at Q4_K_M quantization. Apple Silicon Macs share unified memory between CPU and GPU, so a 16GB Mac runs 14B models well. Use our VRAM Calculator at localaimaster.com/tools/vram-calculator for exact numbers.

What is the best Ollama model for general use?

For most users, Llama 3.1 8B is the best starting point — it handles chat, coding, and reasoning well on any 8GB+ setup. If you have 24GB+ VRAM, Qwen 2.5 32B approaches GPT-4 quality. For coding specifically, Qwen 2.5 Coder 32B scores 92.7% on HumanEval. For reasoning, DeepSeek R1 14B uses chain-of-thought at just 9.5GB VRAM. See our full ranking at localaimaster.com/blog/best-ollama-models.

Can Ollama use my GPU? How do I enable GPU acceleration?

Ollama automatically detects and uses your GPU. On NVIDIA, install the latest NVIDIA drivers (CUDA is bundled). On AMD, ROCm 6.0+ is required (Linux) or it falls back to CPU (Windows). On Apple Silicon Macs, Metal acceleration is automatic with zero configuration. Run "ollama ps" to verify GPU usage — the PROCESSOR column shows "GPU" or "CPU". If stuck on CPU, check NVIDIA drivers with "nvidia-smi".

How do I use the Ollama API in my application?

Ollama exposes a REST API on localhost:11434. For chat: POST to /api/chat with {"model": "llama3.1", "messages": [...]}. For completions: POST to /api/generate. For embeddings: POST to /api/embed. It also supports the OpenAI-compatible API at /v1/chat/completions, so any OpenAI SDK works by just changing the base URL to http://localhost:11434/v1. No API key needed.

How do I create a custom model with a system prompt in Ollama?

Create a Modelfile with: FROM llama3.1 / SYSTEM "Your system prompt here" / PARAMETER temperature 0.7. Then run "ollama create mymodel -f Modelfile". You can customize temperature, top_p, context window (num_ctx), repeat penalty, and more. The custom model appears in "ollama list" and runs with "ollama run mymodel". Modelfiles also support ADAPTER for loading LoRA fine-tunes.

Does Ollama support tool calling / function calling?

Yes, since Ollama 0.4+. Send a "tools" array in the /api/chat request with function definitions. Compatible models (Llama 3.1+, Qwen 2.5+, Mistral) will return tool_calls in the response. You then execute the function locally and send results back. This enables AI agents, database queries, web searches, and code execution — all running locally without any cloud API.

How do I run Ollama in Docker?

For CPU: "docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama". For NVIDIA GPU: add "--gpus all" flag. For AMD GPU: use "--device /dev/kfd --device /dev/dri". Models persist in the ollama Docker volume. The container exposes the same API on port 11434. This is ideal for servers, CI/CD pipelines, and production deployments where you want isolated, reproducible environments.

Complete Ollama Guide 2026: Install, Run & Manage 500+ Local AI Models

Ollama is a free, open-source tool that runs AI models locally on your computer with one command. Install it with curl -fsSL https://ollama.com/install.sh | sh, then run ollama run llama3.1 to start chatting. It supports 500+ models, works on macOS, Windows, and Linux, and requires zero cloud accounts or API keys.

Ollama is the simplest way to run large language models on your own computer. One command installs it. One command downloads a model. One command starts a conversation. No cloud accounts, no API keys, no monthly fees — your data stays on your machine.

This guide covers everything from first install to production deployment: model management, the REST API, custom Modelfiles, tool calling, GPU acceleration, Docker, and performance optimization. Whether you are setting up your first local AI or building applications, start here.

What Is Ollama?
Install Ollama
Your First Model
Essential Commands
Choosing the Right Model
Custom Models with Modelfiles
The Ollama REST API
Tool Calling / Function Calling
GPU Acceleration
Docker Deployment
Performance Optimization
Chat Interfaces
Troubleshooting
FAQ

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What Is Ollama? {#what-is-ollama}

Ollama is an open-source tool that runs large language models (LLMs) locally on your computer. It handles model downloading, quantization, GPU acceleration, and serving — all behind a simple CLI and REST API.

Why use Ollama over cloud AI (ChatGPT, Claude)?

Privacy: Your conversations never leave your machine
Cost: $0/month. No API fees, no usage limits
Speed: No network latency — responses start instantly
Offline: Works without internet after downloading models
Control: Customize system prompts, temperature, context length

Ollama supports 500+ models from the Ollama library, including Llama 3.3 70B, Qwen 2.5, Mistral, DeepSeek R1, Gemma 2, and GPT-OSS. It runs on macOS, Windows, and Linux.

Install Ollama {#install-ollama}

macOS

# Option 1: Download from ollama.com (recommended)
# Visit https://ollama.com/download and install the .dmg

# Option 2: Homebrew
brew install ollama

After installing, Ollama runs as a menubar app. It automatically uses Metal GPU acceleration on Apple Silicon — no configuration needed.

Windows

# Download the installer from https://ollama.com/download/windows
# Or use winget:
winget install Ollama.Ollama

For detailed Windows setup including WSL2 options, see our Ollama Windows Installation Guide.

Linux

# One-line install (recommended)
curl -fsSL https://ollama.com/install.sh | sh

# Or install manually:
# 1. Download binary from GitHub releases
# 2. Place in /usr/local/bin/
# 3. Create systemd service (see docs)

The install script detects your GPU (NVIDIA/AMD) and configures drivers automatically. For manual NVIDIA setup, ensure CUDA 11.8+ drivers are installed.

Verify Installation

ollama --version
# Should output: ollama version is 0.30.x (or newer)

Your First Model {#your-first-model}

# Download and run Llama 3.1 8B (recommended starter model)
ollama run llama3.1

# That's it. Start chatting:
>>> What is the capital of France?
The capital of France is Paris...

# Press Ctrl+D to exit

The first run downloads the model (~4.7 GB for Llama 3.1 8B Q4). Subsequent runs start instantly from cache.

Recommended first models by hardware:

VRAM / RAM	Best Starter Model	Command	Download Size
4-8 GB	Llama 3.2 3B	`ollama run llama3.2:3b`	~2.0 GB
8-12 GB	Llama 3.1 8B	`ollama run llama3.1`	~4.7 GB
16 GB	Qwen 2.5 14B	`ollama run qwen2.5:14b`	~9.0 GB
24 GB	Qwen 2.5 32B	`ollama run qwen2.5:32b`	~20 GB
48 GB+	Llama 3.3 70B	`ollama run llama3.3:70b`	~40 GB

Not sure which model fits your hardware? Use our Model Recommender or VRAM Calculator. For the full breakdown of minimum specs, see the Ollama system requirements guide, and if you're on a budget laptop, our roundup of the best Ollama models for 8GB RAM covers the picks that actually run smoothly.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Essential Ollama Commands {#essential-ollama-commands}

Model Management

# List downloaded models
ollama list

# Download a model without running it
ollama pull qwen2.5-coder:7b

# Remove a model
ollama rm mistral

# Show model details (size, parameters, template)
ollama show llama3.1

# Copy a model (for creating variants)
ollama cp llama3.1 my-assistant

Running Models

# Interactive chat
ollama run llama3.1

# Single prompt (no interactive mode)
ollama run llama3.1 "Explain quantum computing in one paragraph"

# Pipe input from a file
cat code.py | ollama run qwen2.5-coder:7b "Review this code for bugs"

# Set system prompt inline
ollama run llama3.1 --system "You are a helpful Python tutor"

Server Management

# Start the Ollama server (macOS does this automatically)
ollama serve

# Check running models
ollama ps

# Stop a loaded model (free memory)
ollama stop llama3.1

Useful Environment Variables

# Change model storage location (default: ~/.ollama/models)
export OLLAMA_MODELS=/path/to/models

# Change API port (default: 11434)
export OLLAMA_HOST=0.0.0.0:11434

# Set how long models stay loaded (default: 5m)
export OLLAMA_KEEP_ALIVE=30m

# Limit GPU layers (for partial offloading)
export OLLAMA_NUM_GPU=20

# Enable debug logging
export OLLAMA_DEBUG=1

Choosing the Right Model {#choosing-the-right-model}

Ollama's library has 500+ models. Here is what to actually use:

For Coding

Model	Params	VRAM (Q4)	HumanEval	Best For
Qwen 2.5 Coder 32B	32B	22 GB	92.7%	Best overall code quality
Qwen 2.5 Coder 7B	7B	5 GB	88.4%	Daily coding on 8GB hardware
DeepSeek Coder V2 16B	16B	10 GB	85.2%	Multi-language support
Qwen 2.5 Coder 1.5B	1.5B	1.5 GB	74.5%	IDE autocomplete

ollama run qwen2.5-coder:32b  # Best coding model

For General Chat & Q&A

Model	Params	VRAM (Q4)	MMLU	Best For
Llama 3.3 70B	70B	42 GB	86.0%	Maximum quality
Qwen 2.5 32B	32B	22 GB	83.2%	Best mid-range
Llama 3.1 8B	8B	5.5 GB	73.0%	Budget all-rounder
Phi-3.5 Mini	3.8B	3 GB	69.0%	Ultra-low hardware

For Alibaba's newest flagship open model, follow our dedicated Qwen 3 local setup guide — it covers pull commands, VRAM by size, and benchmarks.

For Reasoning & Math

Model	Params	VRAM (Q4)	MATH	Best For
DeepSeek R1 32B	32B	22 GB	88.5%	Complex reasoning
DeepSeek R1 14B	14B	9.5 GB	82.3%	Budget reasoning
Qwen 2.5 14B	14B	9.5 GB	78.0%	Balanced option

ollama run deepseek-r1:14b  # Watch it think step-by-step

For the complete ranking of all 15 top models, see our Best Ollama Models guide.

Custom Models with Modelfiles {#custom-models-with-modelfiles}

Modelfiles let you create customized models with specific system prompts, parameters, and behaviors.

Basic Modelfile

Create a file named Modelfile:

FROM llama3.1

SYSTEM """
You are a senior Python developer at a tech startup. You write clean,
well-documented code following PEP 8. You prefer simple solutions over
clever ones. Always include type hints and docstrings.
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 8192

Build and run it:

ollama create python-dev -f Modelfile
ollama run python-dev

Advanced Modelfile Parameters

FROM qwen2.5:14b

SYSTEM "You are a technical writing assistant."

# Generation parameters
PARAMETER temperature 0.7      # Creativity (0.0-2.0, default 0.8)
PARAMETER top_p 0.9            # Nucleus sampling
PARAMETER top_k 40             # Top-k sampling
PARAMETER repeat_penalty 1.1   # Reduce repetition
PARAMETER num_ctx 32768        # Context window size
PARAMETER num_predict 2048     # Max tokens to generate
PARAMETER stop "<|end|>"       # Stop sequence

# Template customization (for chat format)
TEMPLATE """
{{- if .System }}<|system|>{{ .System }}<|end|>{{ end }}
{{- range .Messages }}<|{{ .Role }}|>{{ .Content }}<|end|>{{ end }}
<|assistant|>
"""

Loading LoRA Adapters

FROM llama3.1
ADAPTER /path/to/my-lora-adapter.gguf
SYSTEM "You are a domain-specific assistant."

This lets you use fine-tuned models with Ollama.

The Ollama REST API {#the-ollama-rest-api}

Ollama runs a REST API on http://localhost:11434 by default. This is how you integrate Ollama into applications.

Chat Endpoint (Recommended)

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"}
  ],
  "stream": false
}'

Response:

{
  "model": "llama3.1",
  "message": {
    "role": "assistant",
    "content": "Machine learning is a subset of AI..."
  },
  "done": true,
  "total_duration": 1250000000,
  "eval_count": 145,
  "eval_duration": 980000000
}

Generate Endpoint (Single Prompt)

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Write a haiku about programming",
  "stream": false
}'

Embeddings Endpoint (For RAG)

curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "What is retrieval-augmented generation?"
}'

Returns a 768-dimensional vector for use in RAG pipelines.

OpenAI-Compatible API

Ollama supports the OpenAI API format, so existing code works with minimal changes:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="not-needed"  # Ollama ignores API keys
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

This compatibility means Ollama works with LangChain, LlamaIndex, CrewAI, and any tool that supports the OpenAI API.

Python Library

import ollama

# Chat
response = ollama.chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'Why is the sky blue?'}]
)
print(response['message']['content'])

# Streaming
for chunk in ollama.chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'Tell me a story'}],
    stream=True
):
    print(chunk['message']['content'], end='', flush=True)

# Embeddings
embedding = ollama.embed(model='nomic-embed-text', input='Hello world')

Install with: pip install ollama

JavaScript/TypeScript Library

import { Ollama } from 'ollama'

const ollama = new Ollama()

const response = await ollama.chat({
  model: 'llama3.1',
  messages: [{ role: 'user', content: 'Hello!' }]
})
console.log(response.message.content)

Install with: npm install ollama

Tool Calling / Function Calling {#tool-calling-function-calling}

Since Ollama 0.4, compatible models can call functions you define. This enables AI agents that interact with databases, APIs, file systems, and more — all running locally.

Basic Tool Calling

import ollama

# Define available tools
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["city"]
        }
    }
}]

# Send request with tools
response = ollama.chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'What is the weather in Tokyo?'}],
    tools=tools
)

# Check for tool calls
if response['message'].get('tool_calls'):
    for tool_call in response['message']['tool_calls']:
        print(f"Function: {tool_call['function']['name']}")
        print(f"Arguments: {tool_call['function']['arguments']}")
        # Execute function and send result back

Compatible models for tool calling: Llama 3.1+, Qwen 2.5+, Mistral 7B Instruct, Mistral Small. For building full AI agent systems locally, combine tool calling with frameworks like CrewAI or LangGraph.

GPU Acceleration Setup {#gpu-acceleration-setup}

NVIDIA GPUs (Recommended)

Ollama automatically uses NVIDIA GPUs if CUDA drivers are installed:

# Check GPU is detected
nvidia-smi

# Verify Ollama sees GPU
ollama ps
# PROCESSOR column should show "GPU" for loaded models

If GPU is not detected:

Install latest NVIDIA drivers from nvidia.com
Restart Ollama service
Check with nvidia-smi that driver loads

Apple Silicon (Metal)

Metal acceleration is automatic on M1/M2/M3/M4 Macs. No configuration needed. Apple Silicon's unified memory means the model can use your full system RAM — a 32GB Mac M4 Max runs 32B models at full speed.

AMD GPUs (ROCm)

Linux: Install ROCm 6.0+ and Ollama will detect AMD GPUs automatically. Windows: AMD GPU support is experimental. Most users fall back to CPU inference on Windows with AMD.

Partial GPU Offloading

If your model is too large for VRAM, Ollama automatically splits between GPU and CPU:

# Manually control GPU layers (useful for fine-tuning memory usage)
OLLAMA_NUM_GPU=20 ollama run llama3.3:70b

# 0 = CPU only, -1 = all layers on GPU

The layers on GPU run fast; layers on CPU run slower. For details on how this affects performance, see our RTX 5090 vs 5080 comparison which includes real offloading benchmarks.

Docker Deployment {#docker-deployment}

CPU Only

docker run -d \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

NVIDIA GPU

docker run -d \
  --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

AMD GPU

docker run -d \
  --device /dev/kfd --device /dev/dri \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama:rocm

Docker Compose (with Open WebUI)

version: '3.8'
services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    volumes:
      - open-webui:/app/backend/data

volumes:
  ollama:
  open-webui:

Run with docker compose up -d, then open http://localhost:3000 for a ChatGPT-like interface. See our Open WebUI setup guide for full configuration.

Performance Optimization {#performance-optimization}

1. Use the Right Quantization

Ollama defaults to Q4_K_M quantization — the best balance of quality and speed. For higher quality at the cost of more VRAM:

ollama pull llama3.1:8b-instruct-q8_0   # 8-bit, near-lossless
ollama pull llama3.1:8b-instruct-fp16    # Full precision (doubles VRAM)

For lower VRAM usage:

ollama pull llama3.1:8b-instruct-q2_K    # 2-bit, noticeable quality loss

See our quantization format comparison for detailed quality benchmarks.

2. Optimize Context Length

Longer context uses more memory. Set only what you need:

# In Modelfile:
PARAMETER num_ctx 4096    # Default, good for most chat
PARAMETER num_ctx 32768   # For long documents
PARAMETER num_ctx 131072  # Max (Llama 3.1), uses significant VRAM

3. Keep Models Loaded

# Keep model in memory for 30 minutes (default: 5m)
export OLLAMA_KEEP_ALIVE=30m

# Keep loaded indefinitely (useful for servers)
export OLLAMA_KEEP_ALIVE=-1

4. Batch Processing

For high-throughput scenarios, use the API with multiple concurrent requests. Ollama handles request queuing automatically.

5. Monitor Performance

# Check loaded models and memory usage
ollama ps

# Measure tokens/second from API response
# Look for eval_count and eval_duration in JSON response
# Speed = eval_count / (eval_duration / 1e9) tokens/sec

Chat Interfaces for Ollama {#chat-interfaces-for-ollama}

While the CLI works great, many users prefer a visual interface:

Interface	Stars	Features	Install
Open WebUI	126K+	ChatGPT-like UI, RAG, plugins	Docker or pip
Jan	30K+	Desktop app, offline-first	Download from jan.ai
Chatbox	25K+	Minimal desktop client	Download from chatboxai.app
Enchanted	5K+	macOS/iOS native client	App Store
LobeChat	50K+	Advanced multi-agent UI	Docker

Open WebUI is the most popular — it supports file uploads, RAG, model switching, conversation management, and plugins.

Troubleshooting {#troubleshooting}

Model Won't Download

# Check available storage
df -h ~/.ollama

# Clear partial downloads
ollama rm <model-name>
ollama pull <model-name>

GPU Not Detected

# NVIDIA: Check driver
nvidia-smi

# Check Ollama sees GPU
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i gpu

# Restart Ollama
systemctl restart ollama  # Linux
# Or restart the app on macOS/Windows

Out of Memory

# Use smaller quantization
ollama pull llama3.1:8b-instruct-q2_K

# Reduce context length in Modelfile
PARAMETER num_ctx 2048

# Use a smaller model
ollama run phi3.5  # 3.8B instead of 8B

Slow Generation

Check GPU is being used: ollama ps should show GPU processor
Reduce context length
Close other GPU-intensive apps
Consider a model that fits entirely in VRAM — partial offloading is 5-10x slower

For hardware recommendations, see our hardware requirements guide.

FAQ {#faq}

See the FAQ section below for answers to the most common Ollama questions.

Sources: Ollama GitHub | Ollama Documentation | Ollama Model Library | Ollama API Reference

Complete Ollama Guide 2026: Install, Run & Manage Local AI Models

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What Is Ollama? {#what-is-ollama}

Install Ollama {#install-ollama}

macOS

Windows

Linux

Verify Installation

Your First Model {#your-first-model}

Reading articles is good. Building is better.

Essential Ollama Commands {#essential-ollama-commands}

Model Management

Running Models

Server Management

Useful Environment Variables

Choosing the Right Model {#choosing-the-right-model}

For Coding

For General Chat & Q&A

For Reasoning & Math

Custom Models with Modelfiles {#custom-models-with-modelfiles}

Basic Modelfile

Advanced Modelfile Parameters

Loading LoRA Adapters

The Ollama REST API {#the-ollama-rest-api}

Chat Endpoint (Recommended)

Generate Endpoint (Single Prompt)

Embeddings Endpoint (For RAG)

OpenAI-Compatible API

Python Library

JavaScript/TypeScript Library

Tool Calling / Function Calling {#tool-calling-function-calling}

Basic Tool Calling

GPU Acceleration Setup {#gpu-acceleration-setup}

NVIDIA GPUs (Recommended)

Apple Silicon (Metal)

AMD GPUs (ROCm)

Partial GPU Offloading

Docker Deployment {#docker-deployment}

CPU Only

NVIDIA GPU

AMD GPU

Docker Compose (with Open WebUI)

Performance Optimization {#performance-optimization}

1. Use the Right Quantization

2. Optimize Context Length

3. Keep Models Loaded

4. Batch Processing

5. Monitor Performance

Chat Interfaces for Ollama {#chat-interfaces-for-ollama}

Troubleshooting {#troubleshooting}

Model Won't Download

GPU Not Detected

Out of Memory

Slow Generation

FAQ {#faq}

Ollama’s running. Here’s what to build with it.

Liked this? 20 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Ollama’s running. Here’s what to build with it.