Should I use the official ollama-python package or call the REST API directly?

For new code, use the official ollama-python package. It handles base URL configuration, retries, streaming generators, and async transparently. The REST API is fine for one-off scripts or non-Python clients, but the package is a thin wrapper that adds value at zero cost. The only reason to bypass it is if you are already on httpx or aiohttp and want to share a connection pool — in which case POST to /api/chat directly.

How do I stream tokens from Ollama in Python?

Pass stream=True to ollama.chat() or ollama.generate() and iterate the returned generator. Each iteration yields a dict with a message.content key for the new token chunk. For async, use ollama.AsyncClient and async for. Always exhaust the generator — abandoning it mid-stream leaves an open HTTP connection that holds GPU resources until Ollama times out the request, typically 30 to 60 seconds later.

Can I use the OpenAI Python SDK with Ollama?

Yes. Ollama 0.1.20+ exposes an OpenAI-compatible endpoint at /v1. Set base_url="http://localhost:11434/v1" and api_key="ollama" (any non-empty string works) on the OpenAI client and most OpenAI code runs unmodified. The compatibility surface covers chat completions, embeddings, and basic tool calling. It does not yet cover the assistants API, file uploads, or fine-tuning. For pure Ollama features (custom Modelfiles, model management) use ollama-python directly.

How do I get JSON output from Ollama in Python?

Three options ranked by reliability. Best: pass format="json" to ollama.chat() — Ollama constrains generation to valid JSON at the token sampler level, so the response is always parseable. Good: pass a Pydantic model schema as format=Model.model_json_schema(), which constrains to the schema. Worst: ask in the system prompt and parse hopefully — works most of the time, fails on long outputs. Always wrap json.loads() in a try/except even with format="json", because constrained sampling sometimes truncates at the response_token_limit.

What happens if Ollama is not running when my Python code calls it?

You get a httpx.ConnectError. The ollama package does not auto-start the daemon. Production code should either ensure Ollama is launched as a systemd service before the Python app boots, or implement a startup health check that polls /api/tags until it succeeds before serving traffic. For development on macOS, ollama serve in a tmux pane is fine. For Docker, depends_on with a healthcheck on the Ollama service is the cleanest pattern.

How do I generate embeddings with the Python API?

Use ollama.embed() (newer) or ollama.embeddings() (legacy single-input). Pass an embeddings model — nomic-embed-text, mxbai-embed-large, or bge-m3 are the popular choices. The newer ollama.embed() accepts a list of strings and returns a list of vectors in one call, which is roughly 8x faster than looping the legacy endpoint. Output dimensions vary: 768 for nomic, 1024 for mxbai-embed-large, 1024 for bge-m3.

How do I handle long-running requests without timing out?

Ollama itself has no request timeout — it will run for hours on a 70B model with 128K context. The timeout you hit is the HTTP client default (5 seconds for httpx, 60 for requests). Pass a generous timeout to the client: ollama.Client(host=..., timeout=600). For truly long jobs, prefer streaming so you receive partial output continuously rather than waiting for completion. For batch jobs, run them async and use a job queue rather than holding open HTTP connections.

Can I run multiple Ollama Python clients concurrently?

Yes, and the package is thread-safe. But Ollama itself processes requests sequentially per loaded model unless OLLAMA_NUM_PARALLEL is increased. So sending 10 concurrent chat requests to a single Ollama instance produces a queue, not parallel inference. To get real concurrency, either increase OLLAMA_NUM_PARALLEL (bounded by GPU memory) or run multiple Ollama instances behind a load balancer.

Ollama Python API Guide: From Hello World to Production

Published on April 23, 2026 -- 22 min read

To call Ollama from Python, run pip install ollama, start the Ollama daemon, then call ollama.chat(model="llama3.2", messages=[...]) and read response.message.content. Pass stream=True to stream tokens, use ollama.AsyncClient for async/FastAPI apps, format="json" or a Pydantic schema for structured output, and ollama.embed() for batched embeddings. As of June 2026 the official ollama package is at v0.6.2 (Python 3.8+); everything below runs against a real local Ollama instance.

The single line pip install ollama is everything most tutorials cover before declaring victory. The interesting questions start two minutes later. How do I stream tokens? Why does my Pydantic schema validate sometimes and not others? Should I keep using requests or switch to the official package? Why does my FastAPI endpoint hang forever? What changes when I move this code from a laptop to a shared Ollama server?

This guide is the answer to those questions, in the order a real Python application encounters them. We start with the smallest working example, then layer on streaming, async, structured output, embeddings, retries, FastAPI integration, and finally a comparison with the OpenAI SDK route. Every snippet runs against a real local Ollama instance, with benchmarks where they matter.

Quick Start: pip install ollama then run the four-line example below. If it answers, you are ready for the rest of this guide.

import ollama

response = ollama.chat(model="llama3.2", messages=[
    {"role": "user", "content": "What is the capital of Japan?"}
])
print(response.message.content)

Setup and First Call
Streaming Tokens
Async Client
Structured Output and JSON Mode
Embeddings
Tool Calling and Functions
Production Patterns: Retries, Timeouts, Pools
FastAPI Integration
OpenAI SDK Compatibility
Common Pitfalls
Frequently Asked Questions

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Setup and First Call {#setup}

Install the Package

# Requires Python 3.8+
pip install ollama

# Pin a known-good version for production (latest as of June 2026)
pip install ollama==0.6.2

Verify Ollama Is Running

# Should return a JSON list of installed models
curl -s http://localhost:11434/api/tags | python -m json.tool

If the curl fails, Ollama is not running. On Linux: sudo systemctl start ollama. On macOS: brew services start ollama or ollama serve in another terminal. On Windows the Ollama installer adds a tray app — make sure it is running.

Smallest Working Example

import ollama

response = ollama.chat(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a concise assistant."},
        {"role": "user", "content": "Three reasons to learn Python."},
    ],
)
print(response.message.content)
print(f"Tokens: {response.eval_count} in {response.eval_duration / 1e9:.2f}s")

The response object is a ChatResponse with attributes: message.content, message.role, model, created_at, done_reason, total_duration, prompt_eval_count, prompt_eval_duration, eval_count, eval_duration. The duration fields are nanoseconds — divide by 1e9 for seconds.

For a guide on which model to put after model=, the best Ollama models shortlist is the right starting point.

Streaming Tokens {#streaming}

Streaming is non-optional for any user-facing app. Without it, the user stares at a spinner for 8 seconds; with it, they see the first word in 200 ms.

Synchronous Stream

import ollama

stream = ollama.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Explain TLS in three sentences."}],
    stream=True,
)
for chunk in stream:
    print(chunk.message.content, end="", flush=True)
print()

Each chunk yields one token (or a small group of tokens). chunk.done becomes True on the final iteration, and that final chunk contains the full eval timing fields.

Capture Both Tokens and Final Stats

buffer = []
for chunk in ollama.chat(model="llama3.2", messages=msgs, stream=True):
    buffer.append(chunk.message.content)
    if chunk.done:
        full_text = "".join(buffer)
        tps = chunk.eval_count / (chunk.eval_duration / 1e9)
        print(f"\n[generated {chunk.eval_count} tokens at {tps:.1f} tok/s]")

This pattern — accumulate while emitting, capture stats on the final chunk — is the right shape for most production code.

Cancelling a Stream Mid-Flight

import ollama

stream = ollama.chat(model="llama3.2", messages=msgs, stream=True)
try:
    for chunk in stream:
        print(chunk.message.content, end="", flush=True)
        if user_cancelled():
            break
finally:
    stream.close()  # Critical: releases the HTTP connection

Forgetting stream.close() is the most common cause of "Ollama keeps using GPU after my script exits." The HTTP connection stays open until Ollama's internal timeout (30-60 seconds) cleans it up.

Async Client {#async}

Anywhere you would use asyncio, FastAPI, or aiohttp, use ollama.AsyncClient instead of the sync API.

import asyncio
import ollama

async def main():
    client = ollama.AsyncClient()
    response = await client.chat(
        model="llama3.2",
        messages=[{"role": "user", "content": "Hello in three languages."}],
    )
    print(response.message.content)

asyncio.run(main())

Async Streaming

async def stream_async():
    client = ollama.AsyncClient()
    async for chunk in await client.chat(
        model="llama3.2",
        messages=msgs,
        stream=True,
    ):
        print(chunk.message.content, end="", flush=True)

Note the awkward double-await: await client.chat(stream=True) returns the async generator, then async for iterates it. This is the documented signature even though it looks redundant.

Concurrency Caveat

async def fan_out():
    client = ollama.AsyncClient()
    prompts = ["Question 1", "Question 2", "Question 3"]
    tasks = [
        client.chat(model="llama3.2", messages=[{"role": "user", "content": p}])
        for p in prompts
    ]
    results = await asyncio.gather(*tasks)
    return [r.message.content for r in results]

This will run 3 chats. Whether they execute in parallel or sequentially depends on the server, not the client. Set OLLAMA_NUM_PARALLEL=4 in the Ollama environment to actually parallelise. Otherwise the requests queue.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Structured Output and JSON Mode {#structured}

Real applications need structured output, not free text. Three options exist.

Option 1: `format="json"`

import json
import ollama

response = ollama.chat(
    model="llama3.2",
    messages=[{
        "role": "user",
        "content": "Extract name, age, and email from: 'I am Alice, 32, alice@example.com'. Return JSON."
    }],
    format="json",
)
data = json.loads(response.message.content)
print(data)
# {'name': 'Alice', 'age': 32, 'email': 'alice@example.com'}

format="json" constrains generation to valid JSON at the sampler. The output always parses, but the schema is whatever the model decides.

Option 2: Pydantic Schema

from pydantic import BaseModel
import ollama

class Person(BaseModel):
    name: str
    age: int
    email: str

response = ollama.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Extract: Alice, 32, alice@example.com"}],
    format=Person.model_json_schema(),
)
person = Person.model_validate_json(response.message.content)

Pydantic schema mode constrains generation to the actual schema. This is what you want for production extraction pipelines.

Option 3: Tools (for function-calling models)

Tool calling models (Llama 3.1+, Qwen 2.5, Mistral Large) accept a tools array and return a tool call object. See the tool calling section below for a full example.

Reliability Comparison

Method	Schema enforcement	Truncation safe	Token overhead
Plain prompt	None	No	0
`format="json"`	JSON only	Sometimes	~5%
Pydantic schema	Full schema	Sometimes	~10%
Tool call	Function signature	Yes (typed)	~15%

For embeddings-driven retrieval pipelines that complement structured extraction, the Ollama semantic search guide covers the surrounding architecture.

Embeddings {#embeddings}

The Ollama Python API supports two endpoints: embeddings() (legacy, single input) and embed() (newer, batched).

import ollama

# Pull an embedding model first
# ollama pull nomic-embed-text

# Single input
result = ollama.embeddings(model="nomic-embed-text", prompt="Hello world")
print(len(result.embedding))  # 768

# Batched (recommended)
result = ollama.embed(
    model="nomic-embed-text",
    input=["Document A", "Document B", "Document C"],
)
print(len(result.embeddings))     # 3
print(len(result.embeddings[0]))  # 768

Throughput Comparison

For 1000 short documents on an RTX 4090:

Method	Wall time	Tokens/sec
`embeddings()` looped	24.3 s	~330
`embed()` batched (32)	3.1 s	~2600
`embed()` batched (128)	2.4 s	~3400

Always batch when you can. Diminishing returns kick in around batch size 64 on consumer GPUs and 256 on workstation cards.

Storing for Retrieval

import numpy as np

vectors = np.array(result.embeddings, dtype=np.float32)
# Normalise for cosine similarity via dot product
vectors /= np.linalg.norm(vectors, axis=1, keepdims=True)
np.save("embeddings.npy", vectors)

Pair with FAISS, ChromaDB, or pgvector depending on scale.

Tool Calling and Functions {#tool-calling}

Tool calling lets the model decide to invoke a function with structured arguments instead of responding in prose.

import ollama
import json

def get_weather(city: str) -> dict:
    return {"city": city, "temp_c": 21, "condition": "clear"}

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

response = ollama.chat(
    model="llama3.1",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
)

if response.message.tool_calls:
    for call in response.message.tool_calls:
        if call.function.name == "get_weather":
            args = call.function.arguments
            result = get_weather(**args)
            # Send result back to model for final response
            follow_up = ollama.chat(
                model="llama3.1",
                messages=[
                    {"role": "user", "content": "What's the weather in Tokyo?"},
                    response.message,
                    {"role": "tool", "content": json.dumps(result)},
                ],
            )
            print(follow_up.message.content)

Tool calling reliability depends heavily on the model. Llama 3.1 8B is hit-or-miss on multi-tool decisions. Llama 3.1 70B and Qwen 2.5 72B are excellent (and the newer Qwen 3 models push reliability further still). Smaller 3B models often hallucinate tool names. The official Ollama tool support post tracks which models are tested for tool calling.

Production Patterns: Retries, Timeouts, Pools {#production}

Custom Client with Sane Defaults

import ollama
import httpx

client = ollama.Client(
    host="http://ollama.internal:11434",
    timeout=httpx.Timeout(60.0, read=600.0, connect=5.0),
    headers={"X-Source": "myapp/1.0"},
)

The three-part timeout is critical: 5s to connect, 60s for total request, 600s for read (which is what streaming needs).

Retry with Backoff

import time
import httpx
from ollama import Client, ResponseError

def chat_with_retry(client: Client, messages, model="llama3.2", max_attempts=3):
    delay = 1.0
    for attempt in range(max_attempts):
        try:
            return client.chat(model=model, messages=messages)
        except (httpx.ConnectError, httpx.ReadTimeout) as e:
            if attempt == max_attempts - 1:
                raise
            time.sleep(delay)
            delay *= 2
        except ResponseError as e:
            # 4xx — bad request, do not retry
            if 400 <= e.status_code < 500:
                raise
            time.sleep(delay)
            delay *= 2

Distinguish 4xx (client error, do not retry) from 5xx and connection errors (worth retrying). Exponential backoff with jitter prevents thundering herd when an Ollama instance restarts.

Connection Pooling

The default ollama.Client already pools HTTP connections via httpx. For high-throughput workloads, share one client across the application — do not instantiate per request.

# Bad: creates a new connection pool every call
def chat(msg):
    return ollama.chat(model="llama3.2", messages=[{"role": "user", "content": msg}])

# Good: shared client
_client = ollama.Client()
def chat(msg):
    return _client.chat(model="llama3.2", messages=[{"role": "user", "content": msg}])

For the wider context of running Ollama under heavy traffic, the Ollama production deployment guide covers the server-side configuration.

FastAPI Integration {#fastapi}

FastAPI is the most common pairing with Ollama in Python production stacks.

from contextlib import asynccontextmanager
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import ollama

class ChatRequest(BaseModel):
    prompt: str
    model: str = "llama3.2"

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Warm up the model on startup
    client = ollama.AsyncClient()
    await client.chat(model="llama3.2", messages=[{"role": "user", "content": "ping"}])
    app.state.ollama = client
    yield

app = FastAPI(lifespan=lifespan)

@app.post("/chat/stream")
async def chat_stream(req: ChatRequest):
    async def generate():
        async for chunk in await app.state.ollama.chat(
            model=req.model,
            messages=[{"role": "user", "content": req.prompt}],
            stream=True,
        ):
            yield chunk.message.content

    return StreamingResponse(generate(), media_type="text/plain")

Three things make this production-ready: the lifespan handler warms the model so the first user does not pay the load cost, the AsyncClient is shared in app state, and the response is a true streaming response so users see tokens immediately.

Server-Sent Events

@app.post("/chat/sse")
async def chat_sse(req: ChatRequest):
    async def generate():
        async for chunk in await app.state.ollama.chat(
            model=req.model,
            messages=[{"role": "user", "content": req.prompt}],
            stream=True,
        ):
            yield f"data: {chunk.message.content}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

SSE is the right protocol for browser-side streaming and integrates cleanly with the Vercel AI SDK or similar frontend libraries.

OpenAI SDK Compatibility {#openai-compat}

If you have existing OpenAI code, you can swap in Ollama with three lines of change.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by the SDK, value is ignored
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)

What Works

Chat completions (sync and streaming)
Embeddings (client.embeddings.create())
Basic tool calling
Most prompt parameters (temperature, top_p, max_tokens)

What Does Not Work

Assistants API (no equivalent)
File uploads
Fine-tuning endpoints
Custom Modelfile creation (use ollama-python for that)
Some OpenAI-specific parameters (logprobs, n>1)

When to Use Which

Scenario	Recommended
Migrating from OpenAI to local	OpenAI SDK with Ollama base_url
Greenfield local-only app	ollama-python
Need custom Modelfiles	ollama-python
Hybrid local + cloud routing	OpenAI SDK behind LiteLLM gateway

For more on the gateway pattern, see the AI gateway guide.

The official Python SDK for Ollama is documented at github.com/ollama/ollama-python and stays current with the Ollama server release cycle.

Common Pitfalls {#pitfalls}

Pitfall 1: Forgetting to Pull the Model

ollama.chat(model="llama3.2", messages=msgs)
# ResponseError: model "llama3.2" not found

The Python client does not auto-pull. Always run ollama pull llama3.2 first, or implement a pull-if-missing helper:

def ensure_model(name: str):
    installed = {m.model for m in ollama.list().models}
    if name not in installed:
        for progress in ollama.pull(name, stream=True):
            print(progress.status, end="\r")

Pitfall 2: Mixing Sync and Async Clients

The sync ollama.chat() and the AsyncClient are not interchangeable. Calling sync from inside an async event loop blocks the loop. Always pick one and stick with it per code path.

Pitfall 3: Long Context Without `num_ctx`

Ollama defaults to a 2048-token context window unless you override it. Pass long documents and the model silently truncates the prompt:

response = ollama.chat(
    model="llama3.2",
    messages=msgs,
    options={"num_ctx": 8192},  # explicitly set
)

Pitfall 4: Not Handling Empty Tool Calls

response.message.tool_calls can be None even on a tool-enabled model — the model decides whether to call a tool. Always null-check before iterating.

Pitfall 5: Logging Whole Responses

response.message.content can be megabytes for long generations. Logging the full object on every request fills disks fast. Log a hash and length, not the content.

For broader troubleshooting beyond Python-specific issues, the Ollama troubleshooting guide covers server-side problems.

Final Notes

The Ollama Python API is small enough to learn in an afternoon and capable enough to ship to production. The interesting work is not the API surface — it is what you build on top: streaming UIs that feel instant, retry policies that survive a flaky GPU server, structured-output pipelines that extract data from messy text, embedding stores that turn a folder of PDFs into a searchable knowledge base.

Start with the four-line example. Layer in streaming once the basic call works. Move to AsyncClient when the app gets a real server. Add retries when the first 502 wakes you up. Switch to the OpenAI SDK only if you genuinely need cross-provider portability — most apps do not. By the time you have walked through every section above, you will have a Python integration that is indistinguishable in quality from a paid API client, with the bill set permanently to zero.

Ollama Python API Guide: From Hello World to Production

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

Setup and First Call {#setup}

Install the Package

Verify Ollama Is Running

Smallest Working Example

Streaming Tokens {#streaming}

Synchronous Stream

Capture Both Tokens and Final Stats

Cancelling a Stream Mid-Flight

Async Client {#async}

Async Streaming

Concurrency Caveat

Reading articles is good. Building is better.

Structured Output and JSON Mode {#structured}

Option 1: format="json"

Option 2: Pydantic Schema

Option 3: Tools (for function-calling models)

Reliability Comparison

Embeddings {#embeddings}

Throughput Comparison

Storing for Retrieval

Tool Calling and Functions {#tool-calling}

Production Patterns: Retries, Timeouts, Pools {#production}

Custom Client with Sane Defaults

Retry with Backoff

Connection Pooling

FastAPI Integration {#fastapi}

Server-Sent Events

OpenAI SDK Compatibility {#openai-compat}

What Works

What Does Not Work

When to Use Which

Common Pitfalls {#pitfalls}

Pitfall 1: Forgetting to Pull the Model

Pitfall 2: Mixing Sync and Async Clients

Pitfall 3: Long Context Without num_ctx

Pitfall 4: Not Handling Empty Tool Calls

Pitfall 5: Logging Whole Responses

Final Notes

Ollama’s running. Here’s what to build with it.

Liked this? 20 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by the Local AI Master Team

🎓 Continue Learning

Practical Local AI Code, Weekly

Build Real AI on Your Machine

Related Guides

Continue Learning

Ollama Semantic Search

Ollama in Production

Best Ollama Models

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Ollama’s running. Here’s what to build with it.

Option 1: `format="json"`

Pitfall 3: Long Context Without `num_ctx`