Should I use ChatOllama or OllamaLLM in LangChain?

Use ChatOllama for almost everything. It handles message roles (system, user, assistant) properly, supports tool calling on compatible models, and is the path the LangChain team is investing in. OllamaLLM is the legacy completion-style interface and only makes sense if you are calling base (non-chat) models. The chat interface also gives you free token streaming and structured output binding.

Why is my LangChain + Ollama chain so slow on the first request?

Ollama lazy-loads models into VRAM on first invocation. An 8B Q4 model takes 4-8 seconds to load on an RTX 4090, 12-20 seconds on a Mac M2. After that, requests are fast. Two fixes: (1) set OLLAMA_KEEP_ALIVE=24h so the model stays resident, or (2) call the model once at app startup with a "ping" prompt to warm the cache before real users hit it.

Does Ollama support tool/function calling in LangChain?

Yes, on models that support it: llama3.1, llama3.2, qwen2.5, mistral-nemo, command-r, firefunction-v2. Use ChatOllama.bind_tools() with Pydantic schemas. The model returns a tool_calls field that LangChain parses automatically. Older models like llama3 (without the .1) and gemma 2 do not support native tool calling — you have to fall back to JSON-prompt patterns which are unreliable.

What embeddings model should I use with Ollama for RAG in LangChain?

For English-only: nomic-embed-text (137M parameters, 768 dimensions, 8192 token context, ~50ms per chunk). For multilingual: bge-m3 or mxbai-embed-large. For code RAG: nomic-embed-code or jina-embeddings-v2-base-code. Avoid using a generative model for embeddings — they are 5-10x slower and lower quality than purpose-built embedding models.

How do I stream tokens from Ollama through LangChain?

Use astream() on the chain (async) or stream() (sync), and iterate over chunks. Each chunk has a .content field. For LangChain Expression Language pipelines, streaming flows through the entire chain end-to-end as long as every step supports streaming. Avoid PromptTemplate.format() in the middle of a streaming chain — it buffers and breaks the stream.

Can I use LangGraph with Ollama for agentic workflows?

Yes, and it is the recommended pattern for anything more complex than a single chain. LangGraph treats Ollama as just another ChatModel, so all the state machine, persistence, and human-in-the-loop features work identically. Use ChatOllama with a tool-capable model like llama3.1 or qwen2.5, bind your tools, and let LangGraph manage the agent loop. Expect 3-8 second turn latency on consumer GPUs for typical agent steps.

How do I get structured JSON output from Ollama in LangChain?

Three options. (1) ChatOllama.with_structured_output(PydanticSchema) — uses Ollama format=json mode, works on any model. (2) bind_tools with a single Pydantic model — cleaner for tool-capable models. (3) PydanticOutputParser with a format prompt — fallback for older models. Option 1 is the most reliable for non-tool-calling models because Ollama enforces JSON syntax at decoding time.

Is Ollama + LangChain production-ready or just for prototypes?

Production-ready, with caveats. Latency is higher than cloud APIs (200-800ms TTFB vs 50-200ms for GPT-4o), throughput is limited by your GPU count, and tool calling is less reliable than Anthropic or OpenAI. Where it shines: data privacy, predictable cost, and offline operation. Many teams run Ollama for internal tools and a cloud API for customer-facing products until the local stack matures further.

Ollama + LangChain: The Complete Local AI Integration Guide

Published April 23, 2026 • 21 min read

LangChain hides a lot of glue code. Ollama hides a lot of GPU plumbing. Plug them together and you get a private, local AI app stack with the same ergonomics most teams know from OpenAI tutorials — but with the model running on hardware you control. The catch is that 80% of LangChain blog posts assume cloud APIs, and the Ollama-specific gotchas (tool calling support varies by model, streaming buffers in chains, embedding model selection matters) are scattered across GitHub issues.

This guide is what I wish existed when I first wired Ollama into a LangChain RAG pipeline two years ago. Every code block has been run on Ollama 0.5.7 with LangChain 0.3.x as of April 2026, on both Python 3.12 and Node 22.

Quick Start: Hello LangChain in 90 Seconds

pip install -U langchain langchain-ollama
ollama pull llama3.1:8b

from langchain_ollama import ChatOllama

llm = ChatOllama(model="llama3.1:8b", temperature=0)
print(llm.invoke("In one sentence, what is unified memory?").content)

That is the whole hello-world. No API key, no cloud dependency, no per-token billing. The model lives on your disk and answers from your VRAM.

The rest of this article shows what to build after the demo: streaming, RAG, tool calling, agents with LangGraph, structured output, and the production deployment story.

Setup and Versions
ChatOllama vs OllamaLLM
Streaming Tokens
Prompt Templates and LCEL
Structured Output
Tool Calling
Local RAG Pipeline
Agents with LangGraph
JavaScript and TypeScript Path
Performance Benchmarks
Pitfalls and Production Notes
FAQs

Setup and Versions {#setup}

Pin these versions for the rest of this guide. LangChain 0.3.x changed several APIs vs 0.2.x and the Ollama package finally split out of langchain-community.

# Python
pip install -U \
  langchain==0.3.18 \
  langchain-ollama==0.2.3 \
  langchain-core==0.3.34 \
  langchain-community==0.3.16 \
  langgraph==0.2.74 \
  pydantic==2.10.4

# Ollama itself
brew install ollama  # or curl install for Linux
ollama --version  # should be 0.5.7+

# Pull models we will use
ollama pull llama3.1:8b           # general chat + tools
ollama pull qwen2.5-coder:7b      # code tasks
ollama pull nomic-embed-text      # RAG embeddings

Verify the connection from Python:

import requests
r = requests.get("http://localhost:11434/api/tags")
print([m["name"] for m in r.json()["models"]])
# ['llama3.1:8b', 'qwen2.5-coder:7b', 'nomic-embed-text:latest']

If you are running Ollama on a different host (Docker, K8s, remote server), set OLLAMA_HOST or pass base_url="http://your-host:11434" to ChatOllama.

ChatOllama vs OllamaLLM {#chat-vs-llm}

LangChain ships two Ollama interfaces:

Class	Use case	Streaming	Tool calling	Structured output
`ChatOllama`	Chat models, agents, RAG	Yes	Yes (compatible models)	Yes
`OllamaLLM`	Base completion models	Yes	No	Limited

Use ChatOllama for everything modern. OllamaLLM is for legacy completion-style models like llama2-uncensored that do not have a chat template.

from langchain_ollama import ChatOllama

llm = ChatOllama(
    model="llama3.1:8b",
    temperature=0.3,
    num_ctx=8192,           # context window in tokens
    num_predict=512,        # max output tokens
    top_k=40,
    top_p=0.9,
    repeat_penalty=1.1,
    keep_alive="30m",       # how long to keep model in VRAM after last use
    base_url="http://localhost:11434",
)

The keep_alive parameter is the single most impactful flag for app responsiveness. Set it to "24h" if the same model is used by your whole app — the model stays resident, eliminating the 4-8 second cold reload.

Streaming Tokens {#streaming}

Sync streaming:

for chunk in llm.stream("Write a haiku about local AI"):
    print(chunk.content, end="", flush=True)

Async streaming (use this for web servers — FastAPI, Quart, anything ASGI):

import asyncio

async def main():
    async for chunk in llm.astream("Write a haiku about local AI"):
        print(chunk.content, end="", flush=True)

asyncio.run(main())

Streaming through a chain — this is where most beginners hit a wall. The chain only streams end-to-end if every step supports streaming:

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a concise tech writer."),
    ("human", "{question}"),
])

chain = prompt | llm | StrOutputParser()

async def stream_chain():
    async for chunk in chain.astream({"question": "What is RAG?"}):
        print(chunk, end="", flush=True)

StrOutputParser passes chunks through untouched. If you swap it for PydanticOutputParser, the parser buffers the full output before returning — streaming dies. Use PydanticToolsParser or with_structured_output(method="json_schema") instead if you need streaming + structure (more on this below).

Prompt Templates and LCEL {#prompt-templates}

LangChain Expression Language (LCEL) is the pipe operator (|) syntax. It is the recommended way to compose chains in LangChain 0.3+.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

translate_prompt = ChatPromptTemplate.from_template(
    "Translate to {language}: {text}"
)

translate_chain = (
    {"text": RunnablePassthrough(), "language": lambda _: "Spanish"}
    | translate_prompt
    | llm
    | StrOutputParser()
)

print(translate_chain.invoke("Where is the library?"))
# ¿Dónde está la biblioteca?

LCEL gives you batching, async, streaming, and retries for free:

results = await translate_chain.abatch([
    "Where is the library?",
    "What time is it?",
    "I need a coffee.",
])

Ollama serves these in parallel up to OLLAMA_NUM_PARALLEL (default 4). On an RTX 4090 with llama3.1:8b, batch of 4 short translations completes in 1.1 seconds vs 3.8 seconds sequentially.

Structured Output {#structured-output}

Three ways to force JSON output, ranked by reliability on Ollama models:

Method 1: with_structured_output (recommended)

from pydantic import BaseModel, Field

class Recipe(BaseModel):
    name: str = Field(description="Recipe name")
    prep_time_minutes: int = Field(description="Prep time in minutes")
    ingredients: list[str] = Field(description="List of ingredients")
    steps: list[str] = Field(description="Ordered cooking steps")

structured_llm = llm.with_structured_output(Recipe)
result = structured_llm.invoke("Give me a 15-minute pasta recipe")
print(result.name, result.prep_time_minutes)
print(result.ingredients)

Under the hood this uses Ollama's format: "json" mode (or format: <json_schema> on Ollama 0.5+) which constrains decoding to valid JSON. Way more reliable than prompt-based JSON instructions.

Method 2: PydanticOutputParser (no JSON mode)

from langchain_core.output_parsers import PydanticOutputParser

parser = PydanticOutputParser(pydantic_object=Recipe)
prompt = ChatPromptTemplate.from_messages([
    ("system", "{format_instructions}"),
    ("human", "{query}"),
]).partial(format_instructions=parser.get_format_instructions())

chain = prompt | llm | parser
result = chain.invoke({"query": "Give me a 15-minute pasta recipe"})

This dumps a JSON schema in the system prompt and asks the model to comply. Works on every model but fails ~15% of the time on smaller models. Use Method 1 if you can.

Method 3: bind_tools with one tool (cleanest for tool-capable models)

llm_with_tools = llm.bind_tools([Recipe])
response = llm_with_tools.invoke("Give me a 15-minute pasta recipe")
recipe = Recipe(**response.tool_calls[0]["args"])

Only works on tool-capable models (llama3.1, qwen2.5, mistral-nemo, command-r). Most reliable when it does work.

Tool Calling {#tool-calling}

Tool calling is the foundation of agents. Bind a Pydantic-decorated function:

from langchain_core.tools import tool

@tool
def get_weather(city: str, country: str = "US") -> str:
    """Get the current weather for a city."""
    # Real implementation would hit OpenWeather, etc.
    return f"It's 72°F and sunny in {city}, {country}."

@tool
def search_web(query: str) -> str:
    """Search the web for a query and return top 3 results."""
    return f"[mock results for '{query}']"

llm_with_tools = ChatOllama(model="llama3.1:8b", temperature=0).bind_tools([
    get_weather, search_web
])

response = llm_with_tools.invoke("What's the weather in Boston?")
print(response.tool_calls)
# [{'name': 'get_weather', 'args': {'city': 'Boston'}, 'id': '...'}]

Then execute the tool calls and feed results back:

from langchain_core.messages import HumanMessage, ToolMessage

messages = [HumanMessage(content="What's the weather in Boston?")]
response = llm_with_tools.invoke(messages)
messages.append(response)

for tc in response.tool_calls:
    tool_result = {"get_weather": get_weather, "search_web": search_web}[tc["name"]].invoke(tc["args"])
    messages.append(ToolMessage(content=str(tool_result), tool_call_id=tc["id"]))

final = llm_with_tools.invoke(messages)
print(final.content)
# "It's currently 72°F and sunny in Boston."

Reliability ranking from my testing on Ollama (April 2026):

Model	Tool calling reliability	Notes
qwen2.5:14b	96%	Best on Ollama for tools
llama3.1:70b	94%	If you have the VRAM
llama3.1:8b	88%	Good default
qwen2.5:7b	85%	Very competitive
mistral-nemo	80%	12B, occasional schema drift
command-r:35b	78%	Older, still works
firefunction-v2	91%	Specialized for tool calls

Test your specific tools — model performance varies significantly by tool schema complexity.

Local RAG Pipeline {#rag}

End-to-end private RAG using Ollama for both LLM and embeddings, with Chroma as the vector store:

pip install langchain-chroma chromadb pypdf

from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# 1. Load
loader = PyPDFDirectoryLoader("./docs")
documents = loader.load()

# 2. Chunk
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(documents)
print(f"Loaded {len(documents)} docs, split into {len(chunks)} chunks")

# 3. Embed and store
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
)

# 4. Retrieve and generate
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
llm = ChatOllama(model="llama3.1:8b", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a helpful assistant. Answer the question using ONLY the context below. "
     "If the context does not contain the answer, say 'I don't know.'\n\n"
     "Context:\n{context}"),
    ("human", "{question}"),
])

def format_docs(docs):
    return "\n\n".join(f"[source: {d.metadata.get('source')}] {d.page_content}" for d in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print(rag_chain.invoke("What is our company refund policy?"))

This is a complete, private RAG pipeline. Document ingestion, chunking, embedding, retrieval, and generation all run locally. No data leaves the machine.

For deeper RAG patterns, see our local RAG setup guide which covers reranking, hybrid search, and chunk size tuning.

Embedding model benchmarks (Ollama, April 2026)

Model	Dimensions	Context	Speed (RTX 4090)	Quality (MTEB avg)
nomic-embed-text	768	8192	850 chunks/sec	62.4
mxbai-embed-large	1024	512	620 chunks/sec	64.1
bge-m3	1024	8192	480 chunks/sec	66.2
snowflake-arctic-embed	1024	512	720 chunks/sec	63.8

For most English RAG use cases, nomic-embed-text is the right default — best speed-to-quality ratio with the long context window.

Agents with LangGraph {#langgraph}

For anything stateful — multi-turn agents, human-in-the-loop, persistent memory — use LangGraph instead of plain chains.

from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_ollama import ChatOllama

class AgentState(TypedDict):
    messages: Annotated[list, add_messages]

llm = ChatOllama(model="llama3.1:8b", temperature=0).bind_tools([get_weather, search_web])

def call_model(state: AgentState):
    response = llm.invoke(state["messages"])
    return {"messages": [response]}

def call_tool(state: AgentState):
    last = state["messages"][-1]
    results = []
    for tc in last.tool_calls:
        fn = {"get_weather": get_weather, "search_web": search_web}[tc["name"]]
        result = fn.invoke(tc["args"])
        results.append(ToolMessage(content=str(result), tool_call_id=tc["id"]))
    return {"messages": results}

def should_continue(state: AgentState):
    return "call_tool" if state["messages"][-1].tool_calls else END

graph = StateGraph(AgentState)
graph.add_node("call_model", call_model)
graph.add_node("call_tool", call_tool)
graph.set_entry_point("call_model")
graph.add_conditional_edges("call_model", should_continue)
graph.add_edge("call_tool", "call_model")

agent = graph.compile()

result = agent.invoke({
    "messages": [
        SystemMessage(content="You are a helpful assistant with access to weather and web search."),
        HumanMessage(content="What's the weather in Tokyo, and search for tourist spots there."),
    ]
})

for m in result["messages"]:
    print(f"[{type(m).__name__}] {m.content[:200]}")

This is a real ReAct-style agent in 40 lines. LangGraph handles the loop, the tool routing, and the message accumulation. Add a checkpointer (Postgres, SQLite) and you get persistent agent state across requests.

Average turn latency on RTX 4090 with llama3.1:8b: 1.8-3.5 seconds per loop iteration. Plan on 2-4 iterations per real user query.

JavaScript and TypeScript Path {#javascript}

Same patterns work in TypeScript with @langchain/ollama:

npm install @langchain/ollama @langchain/core @langchain/community

import { ChatOllama, OllamaEmbeddings } from "@langchain/ollama";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { RunnablePassthrough } from "@langchain/core/runnables";

const llm = new ChatOllama({
  model: "llama3.1:8b",
  temperature: 0,
  numPredict: 512,
  baseUrl: "http://localhost:11434",
});

const prompt = ChatPromptTemplate.fromMessages([
  ["system", "You are a concise assistant."],
  ["human", "{question}"],
]);

const chain = prompt.pipe(llm).pipe(new StringOutputParser());

const stream = await chain.stream({ question: "What is RAG?" });
for await (const chunk of stream) process.stdout.write(chunk);

Tool calling with Zod schemas:

import { z } from "zod";
import { tool } from "@langchain/core/tools";

const getWeather = tool(
  async ({ city }) => `72F and sunny in ${city}`,
  {
    name: "get_weather",
    description: "Get current weather for a city",
    schema: z.object({ city: z.string() }),
  }
);

const llmWithTools = llm.bindTools([getWeather]);
const response = await llmWithTools.invoke("Weather in Boston?");
console.log(response.tool_calls);

Same RAG pattern, same agent pattern, same caveats. The TypeScript ecosystem is about 2-3 months behind Python for new LangChain features but the Ollama integration is feature-complete.

For a Vercel-AI-SDK alternative if you are building Next.js apps, see our Ollama + Vercel AI SDK guide once it ships.

Performance Benchmarks {#benchmarks}

Real numbers from my dev box: RTX 4090 24GB, Ryzen 9 7950X, 64GB DDR5, Ollama 0.5.7.

Single-request latency (cold model load excluded)

Operation	Model	Latency
Simple chat (50 tok output)	llama3.1:8b	0.7 s
Simple chat (500 tok output)	llama3.1:8b	4.8 s
Tool call (1 round trip)	llama3.1:8b	1.1 s
RAG query (5 chunks, 200 tok answer)	llama3.1:8b + nomic-embed	2.3 s
Structured output (Pydantic)	llama3.1:8b	1.4 s
LangGraph agent (3 loop iterations)	llama3.1:8b	6.2 s
Embedding (1000 chunks)	nomic-embed-text	1.2 s

Throughput (concurrent requests)

OLLAMA_NUM_PARALLEL	Concurrent reqs	Throughput	TTFB p95
1	1	92 tok/s	180 ms
4	4	280 tok/s	320 ms
8	8	410 tok/s	580 ms
16	16	480 tok/s	1100 ms

Sweet spot is OLLAMA_NUM_PARALLEL=4 for most apps. Beyond 8, VRAM contention starts hurting per-request latency more than parallelism helps total throughput.

Pitfalls and Production Notes {#pitfalls}

1. Streaming dies on the wrong parser. PydanticOutputParser buffers. Use with_structured_output() for streaming + structure.

2. Default num_ctx is 2048. Many models support 8k-128k. If you are doing RAG with long contexts, explicitly set num_ctx=8192 (or higher) or your retrieved chunks get truncated silently.

3. Embedding model mismatch between ingestion and retrieval. If you embed with nomic-embed-text and retrieve with mxbai-embed-large, similarity search is meaningless. Pin the embedding model in config.

4. Tool calling on non-tool models. LangChain will accept bind_tools() on any model but the model just ignores it. Always test that response.tool_calls is populated.

5. keep_alive default of 5 minutes. Cold reload kills UX. Set "24h" or "1h" depending on your traffic pattern.

6. Forgetting to handle Ollama unavailable. Wrap calls in try/except for ConnectionError and httpx.RequestError. In production, add a circuit breaker that falls back to cloud API or returns a degraded response.

7. Token limits in multi-turn agents. LangGraph keeps the full message history by default. After 10 turns with tool calls, you can blow the context window. Use a message trimmer or summarization step.

8. Async-sync mixing. ChatOllama supports both, but mixing them in one chain (chain.astream over a sync invoke) silently falls back to sync, killing concurrency. Be consistent.

9. Embedding cost on re-ingestion. Re-embedding a 100k-document corpus takes hours. Use Chroma's add_documents with deduplication keys and only re-embed changed docs.

10. Logs leak prompts. LangChain's verbose mode prints full prompts and outputs. Disable in production or you will accidentally log PII.

For a deeper deep dive on tool calling, see Ollama function calling and tool use. For the production deployment side, Ollama in production covers the SSL, auth, and monitoring layer that LangChain calls into.

The official LangChain Ollama integration docs are the authoritative reference for the latest API surface.

Conclusion

LangChain + Ollama is the most underrated stack in local AI right now. You get LangChain's ecosystem — chains, agents, RAG, structured output, evaluations — without the OpenAI bill or the data-leaving-your-network conversation. The integration is mature enough that the same code works locally with Ollama and in the cloud with OpenAI/Anthropic if you swap one import.

The honest tradeoff: latency is 2-4x cloud APIs, tool calling is less reliable on smaller models, and you maintain the inference infrastructure yourself. For internal apps, prototypes, privacy-critical workloads, and offline use cases, that tradeoff is wildly worth it. For consumer-facing products with sub-second SLAs, mix and match — Ollama for the heavy lifting where users will tolerate 2 seconds, cloud API for the snappy interactions.

Start with the Quick Start above, get streaming working, add a Pydantic structured output, and you have a real foundation. From there, the agent and RAG paths open up the same way they do with cloud LLMs — with the bonus that your data and your model both stay home.

Subscribe to our weekly newsletter for hands-on Ollama playbooks: tool calling reliability tests, RAG retrieval benchmarks, and end-to-end agent recipes that actually run locally.

Ollama + LangChain: The Complete Local AI Integration Guide

Want to go deeper than this article?

Ollama + LangChain: The Complete Local AI Integration Guide

Quick Start: Hello LangChain in 90 Seconds

Table of Contents

Setup and Versions {#setup}

ChatOllama vs OllamaLLM {#chat-vs-llm}

Streaming Tokens {#streaming}

Prompt Templates and LCEL {#prompt-templates}

Structured Output {#structured-output}

Method 1: with_structured_output (recommended)

Method 2: PydanticOutputParser (no JSON mode)

Method 3: bind_tools with one tool (cleanest for tool-capable models)

Tool Calling {#tool-calling}

Local RAG Pipeline {#rag}

Embedding model benchmarks (Ollama, April 2026)

Agents with LangGraph {#langgraph}

JavaScript and TypeScript Path {#javascript}

Performance Benchmarks {#benchmarks}

Single-request latency (cold model load excluded)

Throughput (concurrent requests)

Pitfalls and Production Notes {#pitfalls}

Conclusion

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

🎓 Continue Learning

Related Guides

Get the Local AI Dev Newsletter

Build Real AI on Your Machine

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI