★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Developer Integration

Ollama Python API Guide: From Hello World to Production

April 23, 2026
22 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

📚AI Learning Path

Like this article? The AI Learning Path covers this and more — hands-on chapters, real projects, runs on your hardware.

Start free

Published on April 23, 2026 -- 22 min read

The single line pip install ollama is everything most tutorials cover before declaring victory. The interesting questions start two minutes later. How do I stream tokens? Why does my Pydantic schema validate sometimes and not others? Should I keep using requests or switch to the official package? Why does my FastAPI endpoint hang forever? What changes when I move this code from a laptop to a shared Ollama server?

This guide is the answer to those questions, in the order a real Python application encounters them. We start with the smallest working example, then layer on streaming, async, structured output, embeddings, retries, FastAPI integration, and finally a comparison with the OpenAI SDK route. Every snippet runs against a real local Ollama instance, with benchmarks where they matter.

Quick Start: pip install ollama then run the four-line example below. If it answers, you are ready for the rest of this guide.

import ollama

response = ollama.chat(model="llama3.2", messages=[
    {"role": "user", "content": "What is the capital of Japan?"}
])
print(response.message.content)

Table of Contents

  1. Setup and First Call
  2. Streaming Tokens
  3. Async Client
  4. Structured Output and JSON Mode
  5. Embeddings
  6. Tool Calling and Functions
  7. Production Patterns: Retries, Timeouts, Pools
  8. FastAPI Integration
  9. OpenAI SDK Compatibility
  10. Common Pitfalls
  11. Frequently Asked Questions

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Setup and First Call {#setup}

Install the Package

# Requires Python 3.8+
pip install ollama

# Pin a known-good version for production
pip install ollama==0.4.4

Verify Ollama Is Running

# Should return a JSON list of installed models
curl -s http://localhost:11434/api/tags | python -m json.tool

If the curl fails, Ollama is not running. On Linux: sudo systemctl start ollama. On macOS: brew services start ollama or ollama serve in another terminal. On Windows the Ollama installer adds a tray app — make sure it is running.

Smallest Working Example

import ollama

response = ollama.chat(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a concise assistant."},
        {"role": "user", "content": "Three reasons to learn Python."},
    ],
)
print(response.message.content)
print(f"Tokens: {response.eval_count} in {response.eval_duration / 1e9:.2f}s")

The response object is a ChatResponse with attributes: message.content, message.role, model, created_at, done_reason, total_duration, prompt_eval_count, prompt_eval_duration, eval_count, eval_duration. The duration fields are nanoseconds — divide by 1e9 for seconds.

For a guide on which model to put after model=, the best Ollama models shortlist is the right starting point.


Streaming Tokens {#streaming}

Streaming is non-optional for any user-facing app. Without it, the user stares at a spinner for 8 seconds; with it, they see the first word in 200 ms.

Synchronous Stream

import ollama

stream = ollama.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Explain TLS in three sentences."}],
    stream=True,
)
for chunk in stream:
    print(chunk.message.content, end="", flush=True)
print()

Each chunk yields one token (or a small group of tokens). chunk.done becomes True on the final iteration, and that final chunk contains the full eval timing fields.

Capture Both Tokens and Final Stats

buffer = []
for chunk in ollama.chat(model="llama3.2", messages=msgs, stream=True):
    buffer.append(chunk.message.content)
    if chunk.done:
        full_text = "".join(buffer)
        tps = chunk.eval_count / (chunk.eval_duration / 1e9)
        print(f"\n[generated {chunk.eval_count} tokens at {tps:.1f} tok/s]")

This pattern — accumulate while emitting, capture stats on the final chunk — is the right shape for most production code.

Cancelling a Stream Mid-Flight

import ollama

stream = ollama.chat(model="llama3.2", messages=msgs, stream=True)
try:
    for chunk in stream:
        print(chunk.message.content, end="", flush=True)
        if user_cancelled():
            break
finally:
    stream.close()  # Critical: releases the HTTP connection

Forgetting stream.close() is the most common cause of "Ollama keeps using GPU after my script exits." The HTTP connection stays open until Ollama's internal timeout (30-60 seconds) cleans it up.


Async Client {#async}

Anywhere you would use asyncio, FastAPI, or aiohttp, use ollama.AsyncClient instead of the sync API.

import asyncio
import ollama

async def main():
    client = ollama.AsyncClient()
    response = await client.chat(
        model="llama3.2",
        messages=[{"role": "user", "content": "Hello in three languages."}],
    )
    print(response.message.content)

asyncio.run(main())

Async Streaming

async def stream_async():
    client = ollama.AsyncClient()
    async for chunk in await client.chat(
        model="llama3.2",
        messages=msgs,
        stream=True,
    ):
        print(chunk.message.content, end="", flush=True)

Note the awkward double-await: await client.chat(stream=True) returns the async generator, then async for iterates it. This is the documented signature even though it looks redundant.

Concurrency Caveat

async def fan_out():
    client = ollama.AsyncClient()
    prompts = ["Question 1", "Question 2", "Question 3"]
    tasks = [
        client.chat(model="llama3.2", messages=[{"role": "user", "content": p}])
        for p in prompts
    ]
    results = await asyncio.gather(*tasks)
    return [r.message.content for r in results]

This will run 3 chats. Whether they execute in parallel or sequentially depends on the server, not the client. Set OLLAMA_NUM_PARALLEL=4 in the Ollama environment to actually parallelise. Otherwise the requests queue.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Structured Output and JSON Mode {#structured}

Real applications need structured output, not free text. Three options exist.

Option 1: format="json"

import json
import ollama

response = ollama.chat(
    model="llama3.2",
    messages=[{
        "role": "user",
        "content": "Extract name, age, and email from: 'I am Alice, 32, alice@example.com'. Return JSON."
    }],
    format="json",
)
data = json.loads(response.message.content)
print(data)
# {'name': 'Alice', 'age': 32, 'email': 'alice@example.com'}

format="json" constrains generation to valid JSON at the sampler. The output always parses, but the schema is whatever the model decides.

Option 2: Pydantic Schema

from pydantic import BaseModel
import ollama

class Person(BaseModel):
    name: str
    age: int
    email: str

response = ollama.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Extract: Alice, 32, alice@example.com"}],
    format=Person.model_json_schema(),
)
person = Person.model_validate_json(response.message.content)

Pydantic schema mode constrains generation to the actual schema. This is what you want for production extraction pipelines.

Option 3: Tools (for function-calling models)

Tool calling models (Llama 3.1+, Qwen 2.5, Mistral Large) accept a tools array and return a tool call object. See the tool calling section below for a full example.

Reliability Comparison

MethodSchema enforcementTruncation safeToken overhead
Plain promptNoneNo0
format="json"JSON onlySometimes~5%
Pydantic schemaFull schemaSometimes~10%
Tool callFunction signatureYes (typed)~15%

For embeddings-driven retrieval pipelines that complement structured extraction, the Ollama semantic search guide covers the surrounding architecture.


Embeddings {#embeddings}

The Ollama Python API supports two endpoints: embeddings() (legacy, single input) and embed() (newer, batched).

import ollama

# Pull an embedding model first
# ollama pull nomic-embed-text

# Single input
result = ollama.embeddings(model="nomic-embed-text", prompt="Hello world")
print(len(result.embedding))  # 768

# Batched (recommended)
result = ollama.embed(
    model="nomic-embed-text",
    input=["Document A", "Document B", "Document C"],
)
print(len(result.embeddings))     # 3
print(len(result.embeddings[0]))  # 768

Throughput Comparison

For 1000 short documents on an RTX 4090:

MethodWall timeTokens/sec
embeddings() looped24.3 s~330
embed() batched (32)3.1 s~2600
embed() batched (128)2.4 s~3400

Always batch when you can. Diminishing returns kick in around batch size 64 on consumer GPUs and 256 on workstation cards.

Storing for Retrieval

import numpy as np

vectors = np.array(result.embeddings, dtype=np.float32)
# Normalise for cosine similarity via dot product
vectors /= np.linalg.norm(vectors, axis=1, keepdims=True)
np.save("embeddings.npy", vectors)

Pair with FAISS, ChromaDB, or pgvector depending on scale.


Tool Calling and Functions {#tool-calling}

Tool calling lets the model decide to invoke a function with structured arguments instead of responding in prose.

import ollama
import json

def get_weather(city: str) -> dict:
    return {"city": city, "temp_c": 21, "condition": "clear"}

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

response = ollama.chat(
    model="llama3.1",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
)

if response.message.tool_calls:
    for call in response.message.tool_calls:
        if call.function.name == "get_weather":
            args = call.function.arguments
            result = get_weather(**args)
            # Send result back to model for final response
            follow_up = ollama.chat(
                model="llama3.1",
                messages=[
                    {"role": "user", "content": "What's the weather in Tokyo?"},
                    response.message,
                    {"role": "tool", "content": json.dumps(result)},
                ],
            )
            print(follow_up.message.content)

Tool calling reliability depends heavily on the model. Llama 3.1 8B is hit-or-miss on multi-tool decisions. Llama 3.1 70B and Qwen 2.5 72B are excellent. Smaller 3B models often hallucinate tool names. The official Ollama tool support post tracks which models are tested for tool calling.


Production Patterns: Retries, Timeouts, Pools {#production}

Custom Client with Sane Defaults

import ollama
import httpx

client = ollama.Client(
    host="http://ollama.internal:11434",
    timeout=httpx.Timeout(60.0, read=600.0, connect=5.0),
    headers={"X-Source": "myapp/1.0"},
)

The three-part timeout is critical: 5s to connect, 60s for total request, 600s for read (which is what streaming needs).

Retry with Backoff

import time
import httpx
from ollama import Client, ResponseError

def chat_with_retry(client: Client, messages, model="llama3.2", max_attempts=3):
    delay = 1.0
    for attempt in range(max_attempts):
        try:
            return client.chat(model=model, messages=messages)
        except (httpx.ConnectError, httpx.ReadTimeout) as e:
            if attempt == max_attempts - 1:
                raise
            time.sleep(delay)
            delay *= 2
        except ResponseError as e:
            # 4xx — bad request, do not retry
            if 400 <= e.status_code < 500:
                raise
            time.sleep(delay)
            delay *= 2

Distinguish 4xx (client error, do not retry) from 5xx and connection errors (worth retrying). Exponential backoff with jitter prevents thundering herd when an Ollama instance restarts.

Connection Pooling

The default ollama.Client already pools HTTP connections via httpx. For high-throughput workloads, share one client across the application — do not instantiate per request.

# Bad: creates a new connection pool every call
def chat(msg):
    return ollama.chat(model="llama3.2", messages=[{"role": "user", "content": msg}])

# Good: shared client
_client = ollama.Client()
def chat(msg):
    return _client.chat(model="llama3.2", messages=[{"role": "user", "content": msg}])

For the wider context of running Ollama under heavy traffic, the Ollama production deployment guide covers the server-side configuration.


FastAPI Integration {#fastapi}

FastAPI is the most common pairing with Ollama in Python production stacks.

from contextlib import asynccontextmanager
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import ollama

class ChatRequest(BaseModel):
    prompt: str
    model: str = "llama3.2"

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Warm up the model on startup
    client = ollama.AsyncClient()
    await client.chat(model="llama3.2", messages=[{"role": "user", "content": "ping"}])
    app.state.ollama = client
    yield

app = FastAPI(lifespan=lifespan)

@app.post("/chat/stream")
async def chat_stream(req: ChatRequest):
    async def generate():
        async for chunk in await app.state.ollama.chat(
            model=req.model,
            messages=[{"role": "user", "content": req.prompt}],
            stream=True,
        ):
            yield chunk.message.content

    return StreamingResponse(generate(), media_type="text/plain")

Three things make this production-ready: the lifespan handler warms the model so the first user does not pay the load cost, the AsyncClient is shared in app state, and the response is a true streaming response so users see tokens immediately.

Server-Sent Events

@app.post("/chat/sse")
async def chat_sse(req: ChatRequest):
    async def generate():
        async for chunk in await app.state.ollama.chat(
            model=req.model,
            messages=[{"role": "user", "content": req.prompt}],
            stream=True,
        ):
            yield f"data: {chunk.message.content}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

SSE is the right protocol for browser-side streaming and integrates cleanly with the Vercel AI SDK or similar frontend libraries.


OpenAI SDK Compatibility {#openai-compat}

If you have existing OpenAI code, you can swap in Ollama with three lines of change.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by the SDK, value is ignored
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)

What Works

  • Chat completions (sync and streaming)
  • Embeddings (client.embeddings.create())
  • Basic tool calling
  • Most prompt parameters (temperature, top_p, max_tokens)

What Does Not Work

  • Assistants API (no equivalent)
  • File uploads
  • Fine-tuning endpoints
  • Custom Modelfile creation (use ollama-python for that)
  • Some OpenAI-specific parameters (logprobs, n>1)

When to Use Which

ScenarioRecommended
Migrating from OpenAI to localOpenAI SDK with Ollama base_url
Greenfield local-only appollama-python
Need custom Modelfilesollama-python
Hybrid local + cloud routingOpenAI SDK behind LiteLLM gateway

For more on the gateway pattern, see the AI gateway guide.

The official Python SDK for Ollama is documented at github.com/ollama/ollama-python and stays current with the Ollama server release cycle.


Common Pitfalls {#pitfalls}

Pitfall 1: Forgetting to Pull the Model

ollama.chat(model="llama3.2", messages=msgs)
# ResponseError: model "llama3.2" not found

The Python client does not auto-pull. Always run ollama pull llama3.2 first, or implement a pull-if-missing helper:

def ensure_model(name: str):
    installed = {m.model for m in ollama.list().models}
    if name not in installed:
        for progress in ollama.pull(name, stream=True):
            print(progress.status, end="\r")

Pitfall 2: Mixing Sync and Async Clients

The sync ollama.chat() and the AsyncClient are not interchangeable. Calling sync from inside an async event loop blocks the loop. Always pick one and stick with it per code path.

Pitfall 3: Long Context Without num_ctx

Ollama defaults to a 2048-token context window unless you override it. Pass long documents and the model silently truncates the prompt:

response = ollama.chat(
    model="llama3.2",
    messages=msgs,
    options={"num_ctx": 8192},  # explicitly set
)

Pitfall 4: Not Handling Empty Tool Calls

response.message.tool_calls can be None even on a tool-enabled model — the model decides whether to call a tool. Always null-check before iterating.

Pitfall 5: Logging Whole Responses

response.message.content can be megabytes for long generations. Logging the full object on every request fills disks fast. Log a hash and length, not the content.

For broader troubleshooting beyond Python-specific issues, the Ollama troubleshooting guide covers server-side problems.


Final Notes

The Ollama Python API is small enough to learn in an afternoon and capable enough to ship to production. The interesting work is not the API surface — it is what you build on top: streaming UIs that feel instant, retry policies that survive a flaky GPU server, structured-output pipelines that extract data from messy text, embedding stores that turn a folder of PDFs into a searchable knowledge base.

Start with the four-line example. Layer in streaming once the basic call works. Move to AsyncClient when the app gets a real server. Add retries when the first 502 wakes you up. Switch to the OpenAI SDK only if you genuinely need cross-provider portability — most apps do not. By the time you have walked through every section above, you will have a Python integration that is indistinguishable in quality from a paid API client, with the bill set permanently to zero.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: April 23, 2026🔄 Last Updated: April 23, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

Was this helpful?

Practical Local AI Code, Weekly

Join 5,000+ Python developers building on local AI. One real-code example per week — streaming, RAG, agents, function calling.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Related Guides

Continue your local AI journey with these comprehensive guides

Continue Learning

📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators