Ollama Python API Guide: From Hello World to Production
Want to go deeper than this article?
Free account unlocks the first chapter of all 19 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Like this article? The AI Learning Path covers this and more — hands-on chapters, real projects, runs on your hardware.
Published on April 23, 2026 -- 22 min read
The single line pip install ollama is everything most tutorials cover before declaring victory. The interesting questions start two minutes later. How do I stream tokens? Why does my Pydantic schema validate sometimes and not others? Should I keep using requests or switch to the official package? Why does my FastAPI endpoint hang forever? What changes when I move this code from a laptop to a shared Ollama server?
This guide is the answer to those questions, in the order a real Python application encounters them. We start with the smallest working example, then layer on streaming, async, structured output, embeddings, retries, FastAPI integration, and finally a comparison with the OpenAI SDK route. Every snippet runs against a real local Ollama instance, with benchmarks where they matter.
Quick Start:
pip install ollamathen run the four-line example below. If it answers, you are ready for the rest of this guide.
import ollama
response = ollama.chat(model="llama3.2", messages=[
{"role": "user", "content": "What is the capital of Japan?"}
])
print(response.message.content)
Table of Contents
- Setup and First Call
- Streaming Tokens
- Async Client
- Structured Output and JSON Mode
- Embeddings
- Tool Calling and Functions
- Production Patterns: Retries, Timeouts, Pools
- FastAPI Integration
- OpenAI SDK Compatibility
- Common Pitfalls
- Frequently Asked Questions
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 19 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Setup and First Call {#setup}
Install the Package
# Requires Python 3.8+
pip install ollama
# Pin a known-good version for production
pip install ollama==0.4.4
Verify Ollama Is Running
# Should return a JSON list of installed models
curl -s http://localhost:11434/api/tags | python -m json.tool
If the curl fails, Ollama is not running. On Linux: sudo systemctl start ollama. On macOS: brew services start ollama or ollama serve in another terminal. On Windows the Ollama installer adds a tray app — make sure it is running.
Smallest Working Example
import ollama
response = ollama.chat(
model="llama3.2",
messages=[
{"role": "system", "content": "You are a concise assistant."},
{"role": "user", "content": "Three reasons to learn Python."},
],
)
print(response.message.content)
print(f"Tokens: {response.eval_count} in {response.eval_duration / 1e9:.2f}s")
The response object is a ChatResponse with attributes: message.content, message.role, model, created_at, done_reason, total_duration, prompt_eval_count, prompt_eval_duration, eval_count, eval_duration. The duration fields are nanoseconds — divide by 1e9 for seconds.
For a guide on which model to put after model=, the best Ollama models shortlist is the right starting point.
Streaming Tokens {#streaming}
Streaming is non-optional for any user-facing app. Without it, the user stares at a spinner for 8 seconds; with it, they see the first word in 200 ms.
Synchronous Stream
import ollama
stream = ollama.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Explain TLS in three sentences."}],
stream=True,
)
for chunk in stream:
print(chunk.message.content, end="", flush=True)
print()
Each chunk yields one token (or a small group of tokens). chunk.done becomes True on the final iteration, and that final chunk contains the full eval timing fields.
Capture Both Tokens and Final Stats
buffer = []
for chunk in ollama.chat(model="llama3.2", messages=msgs, stream=True):
buffer.append(chunk.message.content)
if chunk.done:
full_text = "".join(buffer)
tps = chunk.eval_count / (chunk.eval_duration / 1e9)
print(f"\n[generated {chunk.eval_count} tokens at {tps:.1f} tok/s]")
This pattern — accumulate while emitting, capture stats on the final chunk — is the right shape for most production code.
Cancelling a Stream Mid-Flight
import ollama
stream = ollama.chat(model="llama3.2", messages=msgs, stream=True)
try:
for chunk in stream:
print(chunk.message.content, end="", flush=True)
if user_cancelled():
break
finally:
stream.close() # Critical: releases the HTTP connection
Forgetting stream.close() is the most common cause of "Ollama keeps using GPU after my script exits." The HTTP connection stays open until Ollama's internal timeout (30-60 seconds) cleans it up.
Async Client {#async}
Anywhere you would use asyncio, FastAPI, or aiohttp, use ollama.AsyncClient instead of the sync API.
import asyncio
import ollama
async def main():
client = ollama.AsyncClient()
response = await client.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Hello in three languages."}],
)
print(response.message.content)
asyncio.run(main())
Async Streaming
async def stream_async():
client = ollama.AsyncClient()
async for chunk in await client.chat(
model="llama3.2",
messages=msgs,
stream=True,
):
print(chunk.message.content, end="", flush=True)
Note the awkward double-await: await client.chat(stream=True) returns the async generator, then async for iterates it. This is the documented signature even though it looks redundant.
Concurrency Caveat
async def fan_out():
client = ollama.AsyncClient()
prompts = ["Question 1", "Question 2", "Question 3"]
tasks = [
client.chat(model="llama3.2", messages=[{"role": "user", "content": p}])
for p in prompts
]
results = await asyncio.gather(*tasks)
return [r.message.content for r in results]
This will run 3 chats. Whether they execute in parallel or sequentially depends on the server, not the client. Set OLLAMA_NUM_PARALLEL=4 in the Ollama environment to actually parallelise. Otherwise the requests queue.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 19 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Structured Output and JSON Mode {#structured}
Real applications need structured output, not free text. Three options exist.
Option 1: format="json"
import json
import ollama
response = ollama.chat(
model="llama3.2",
messages=[{
"role": "user",
"content": "Extract name, age, and email from: 'I am Alice, 32, alice@example.com'. Return JSON."
}],
format="json",
)
data = json.loads(response.message.content)
print(data)
# {'name': 'Alice', 'age': 32, 'email': 'alice@example.com'}
format="json" constrains generation to valid JSON at the sampler. The output always parses, but the schema is whatever the model decides.
Option 2: Pydantic Schema
from pydantic import BaseModel
import ollama
class Person(BaseModel):
name: str
age: int
email: str
response = ollama.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Extract: Alice, 32, alice@example.com"}],
format=Person.model_json_schema(),
)
person = Person.model_validate_json(response.message.content)
Pydantic schema mode constrains generation to the actual schema. This is what you want for production extraction pipelines.
Option 3: Tools (for function-calling models)
Tool calling models (Llama 3.1+, Qwen 2.5, Mistral Large) accept a tools array and return a tool call object. See the tool calling section below for a full example.
Reliability Comparison
| Method | Schema enforcement | Truncation safe | Token overhead |
|---|---|---|---|
| Plain prompt | None | No | 0 |
format="json" | JSON only | Sometimes | ~5% |
| Pydantic schema | Full schema | Sometimes | ~10% |
| Tool call | Function signature | Yes (typed) | ~15% |
For embeddings-driven retrieval pipelines that complement structured extraction, the Ollama semantic search guide covers the surrounding architecture.
Embeddings {#embeddings}
The Ollama Python API supports two endpoints: embeddings() (legacy, single input) and embed() (newer, batched).
import ollama
# Pull an embedding model first
# ollama pull nomic-embed-text
# Single input
result = ollama.embeddings(model="nomic-embed-text", prompt="Hello world")
print(len(result.embedding)) # 768
# Batched (recommended)
result = ollama.embed(
model="nomic-embed-text",
input=["Document A", "Document B", "Document C"],
)
print(len(result.embeddings)) # 3
print(len(result.embeddings[0])) # 768
Throughput Comparison
For 1000 short documents on an RTX 4090:
| Method | Wall time | Tokens/sec |
|---|---|---|
embeddings() looped | 24.3 s | ~330 |
embed() batched (32) | 3.1 s | ~2600 |
embed() batched (128) | 2.4 s | ~3400 |
Always batch when you can. Diminishing returns kick in around batch size 64 on consumer GPUs and 256 on workstation cards.
Storing for Retrieval
import numpy as np
vectors = np.array(result.embeddings, dtype=np.float32)
# Normalise for cosine similarity via dot product
vectors /= np.linalg.norm(vectors, axis=1, keepdims=True)
np.save("embeddings.npy", vectors)
Pair with FAISS, ChromaDB, or pgvector depending on scale.
Tool Calling and Functions {#tool-calling}
Tool calling lets the model decide to invoke a function with structured arguments instead of responding in prose.
import ollama
import json
def get_weather(city: str) -> dict:
return {"city": city, "temp_c": 21, "condition": "clear"}
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}]
response = ollama.chat(
model="llama3.1",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools,
)
if response.message.tool_calls:
for call in response.message.tool_calls:
if call.function.name == "get_weather":
args = call.function.arguments
result = get_weather(**args)
# Send result back to model for final response
follow_up = ollama.chat(
model="llama3.1",
messages=[
{"role": "user", "content": "What's the weather in Tokyo?"},
response.message,
{"role": "tool", "content": json.dumps(result)},
],
)
print(follow_up.message.content)
Tool calling reliability depends heavily on the model. Llama 3.1 8B is hit-or-miss on multi-tool decisions. Llama 3.1 70B and Qwen 2.5 72B are excellent. Smaller 3B models often hallucinate tool names. The official Ollama tool support post tracks which models are tested for tool calling.
Production Patterns: Retries, Timeouts, Pools {#production}
Custom Client with Sane Defaults
import ollama
import httpx
client = ollama.Client(
host="http://ollama.internal:11434",
timeout=httpx.Timeout(60.0, read=600.0, connect=5.0),
headers={"X-Source": "myapp/1.0"},
)
The three-part timeout is critical: 5s to connect, 60s for total request, 600s for read (which is what streaming needs).
Retry with Backoff
import time
import httpx
from ollama import Client, ResponseError
def chat_with_retry(client: Client, messages, model="llama3.2", max_attempts=3):
delay = 1.0
for attempt in range(max_attempts):
try:
return client.chat(model=model, messages=messages)
except (httpx.ConnectError, httpx.ReadTimeout) as e:
if attempt == max_attempts - 1:
raise
time.sleep(delay)
delay *= 2
except ResponseError as e:
# 4xx — bad request, do not retry
if 400 <= e.status_code < 500:
raise
time.sleep(delay)
delay *= 2
Distinguish 4xx (client error, do not retry) from 5xx and connection errors (worth retrying). Exponential backoff with jitter prevents thundering herd when an Ollama instance restarts.
Connection Pooling
The default ollama.Client already pools HTTP connections via httpx. For high-throughput workloads, share one client across the application — do not instantiate per request.
# Bad: creates a new connection pool every call
def chat(msg):
return ollama.chat(model="llama3.2", messages=[{"role": "user", "content": msg}])
# Good: shared client
_client = ollama.Client()
def chat(msg):
return _client.chat(model="llama3.2", messages=[{"role": "user", "content": msg}])
For the wider context of running Ollama under heavy traffic, the Ollama production deployment guide covers the server-side configuration.
FastAPI Integration {#fastapi}
FastAPI is the most common pairing with Ollama in Python production stacks.
from contextlib import asynccontextmanager
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import ollama
class ChatRequest(BaseModel):
prompt: str
model: str = "llama3.2"
@asynccontextmanager
async def lifespan(app: FastAPI):
# Warm up the model on startup
client = ollama.AsyncClient()
await client.chat(model="llama3.2", messages=[{"role": "user", "content": "ping"}])
app.state.ollama = client
yield
app = FastAPI(lifespan=lifespan)
@app.post("/chat/stream")
async def chat_stream(req: ChatRequest):
async def generate():
async for chunk in await app.state.ollama.chat(
model=req.model,
messages=[{"role": "user", "content": req.prompt}],
stream=True,
):
yield chunk.message.content
return StreamingResponse(generate(), media_type="text/plain")
Three things make this production-ready: the lifespan handler warms the model so the first user does not pay the load cost, the AsyncClient is shared in app state, and the response is a true streaming response so users see tokens immediately.
Server-Sent Events
@app.post("/chat/sse")
async def chat_sse(req: ChatRequest):
async def generate():
async for chunk in await app.state.ollama.chat(
model=req.model,
messages=[{"role": "user", "content": req.prompt}],
stream=True,
):
yield f"data: {chunk.message.content}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
SSE is the right protocol for browser-side streaming and integrates cleanly with the Vercel AI SDK or similar frontend libraries.
OpenAI SDK Compatibility {#openai-compat}
If you have existing OpenAI code, you can swap in Ollama with three lines of change.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required by the SDK, value is ignored
)
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)
What Works
- Chat completions (sync and streaming)
- Embeddings (
client.embeddings.create()) - Basic tool calling
- Most prompt parameters (temperature, top_p, max_tokens)
What Does Not Work
- Assistants API (no equivalent)
- File uploads
- Fine-tuning endpoints
- Custom Modelfile creation (use ollama-python for that)
- Some OpenAI-specific parameters (logprobs, n>1)
When to Use Which
| Scenario | Recommended |
|---|---|
| Migrating from OpenAI to local | OpenAI SDK with Ollama base_url |
| Greenfield local-only app | ollama-python |
| Need custom Modelfiles | ollama-python |
| Hybrid local + cloud routing | OpenAI SDK behind LiteLLM gateway |
For more on the gateway pattern, see the AI gateway guide.
The official Python SDK for Ollama is documented at github.com/ollama/ollama-python and stays current with the Ollama server release cycle.
Common Pitfalls {#pitfalls}
Pitfall 1: Forgetting to Pull the Model
ollama.chat(model="llama3.2", messages=msgs)
# ResponseError: model "llama3.2" not found
The Python client does not auto-pull. Always run ollama pull llama3.2 first, or implement a pull-if-missing helper:
def ensure_model(name: str):
installed = {m.model for m in ollama.list().models}
if name not in installed:
for progress in ollama.pull(name, stream=True):
print(progress.status, end="\r")
Pitfall 2: Mixing Sync and Async Clients
The sync ollama.chat() and the AsyncClient are not interchangeable. Calling sync from inside an async event loop blocks the loop. Always pick one and stick with it per code path.
Pitfall 3: Long Context Without num_ctx
Ollama defaults to a 2048-token context window unless you override it. Pass long documents and the model silently truncates the prompt:
response = ollama.chat(
model="llama3.2",
messages=msgs,
options={"num_ctx": 8192}, # explicitly set
)
Pitfall 4: Not Handling Empty Tool Calls
response.message.tool_calls can be None even on a tool-enabled model — the model decides whether to call a tool. Always null-check before iterating.
Pitfall 5: Logging Whole Responses
response.message.content can be megabytes for long generations. Logging the full object on every request fills disks fast. Log a hash and length, not the content.
For broader troubleshooting beyond Python-specific issues, the Ollama troubleshooting guide covers server-side problems.
Final Notes
The Ollama Python API is small enough to learn in an afternoon and capable enough to ship to production. The interesting work is not the API surface — it is what you build on top: streaming UIs that feel instant, retry policies that survive a flaky GPU server, structured-output pipelines that extract data from messy text, embedding stores that turn a folder of PDFs into a searchable knowledge base.
Start with the four-line example. Layer in streaming once the basic call works. Move to AsyncClient when the app gets a real server. Add retries when the first 502 wakes you up. Switch to the OpenAI SDK only if you genuinely need cross-provider portability — most apps do not. By the time you have walked through every section above, you will have a Python integration that is indistinguishable in quality from a paid API client, with the bill set permanently to zero.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 19 courses that take you from reading about AI to building AI.
Want structured AI education?
19 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARBest Ollama Models 2026: 15 Ranked (Coding, Reasoning, Chat)
- 15 Best Free AI Models to Run Locally with Ollama (2026) — No API Key
- Build a Local AI Slack & Discord Bot with Ollama (Full Tutorial)
- Build a Local RAG Pipeline: Ollama + ChromaDB Step-by-Step
- Build a Telegram Bot with Local AI (Ollama + Python Tutorial)
- CodeLlama Instruct 7B: Ollama Setup, HumanEval (2026)
- Complete Ollama Guide: Install, Run & Manage Local AI Models
- Dolphin 2.6 Mistral 7B: Uncensored Ollama Setup (2026)
- First-Time Ollama Setup: 15 Mistakes Everyone Makes
- Flowise + Ollama: Build AI Chatbots Visually
Comments (0)
No comments yet. Be the first to share your thoughts!