Free course — 2 free chapters of every course. No credit card.Start learning free
Developer Integration

Ollama + LangChain: The Complete Local AI Integration Guide

April 23, 2026
21 min read
LocalAimaster Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

Ollama + LangChain: The Complete Local AI Integration Guide

Published April 23, 2026 • 21 min read

LangChain hides a lot of glue code. Ollama hides a lot of GPU plumbing. Plug them together and you get a private, local AI app stack with the same ergonomics most teams know from OpenAI tutorials — but with the model running on hardware you control. The catch is that 80% of LangChain blog posts assume cloud APIs, and the Ollama-specific gotchas (tool calling support varies by model, streaming buffers in chains, embedding model selection matters) are scattered across GitHub issues.

This guide is what I wish existed when I first wired Ollama into a LangChain RAG pipeline two years ago. Every code block has been run on Ollama 0.5.7 with LangChain 0.3.x as of April 2026, on both Python 3.12 and Node 22.

Quick Start: Hello LangChain in 90 Seconds

pip install -U langchain langchain-ollama
ollama pull llama3.1:8b
from langchain_ollama import ChatOllama

llm = ChatOllama(model="llama3.1:8b", temperature=0)
print(llm.invoke("In one sentence, what is unified memory?").content)

That is the whole hello-world. No API key, no cloud dependency, no per-token billing. The model lives on your disk and answers from your VRAM.

The rest of this article shows what to build after the demo: streaming, RAG, tool calling, agents with LangGraph, structured output, and the production deployment story.

Table of Contents

  1. Setup and Versions
  2. ChatOllama vs OllamaLLM
  3. Streaming Tokens
  4. Prompt Templates and LCEL
  5. Structured Output
  6. Tool Calling
  7. Local RAG Pipeline
  8. Agents with LangGraph
  9. JavaScript and TypeScript Path
  10. Performance Benchmarks
  11. Pitfalls and Production Notes
  12. FAQs

Setup and Versions {#setup}

Pin these versions for the rest of this guide. LangChain 0.3.x changed several APIs vs 0.2.x and the Ollama package finally split out of langchain-community.

# Python
pip install -U \
  langchain==0.3.18 \
  langchain-ollama==0.2.3 \
  langchain-core==0.3.34 \
  langchain-community==0.3.16 \
  langgraph==0.2.74 \
  pydantic==2.10.4

# Ollama itself
brew install ollama  # or curl install for Linux
ollama --version  # should be 0.5.7+

# Pull models we will use
ollama pull llama3.1:8b           # general chat + tools
ollama pull qwen2.5-coder:7b      # code tasks
ollama pull nomic-embed-text      # RAG embeddings

Verify the connection from Python:

import requests
r = requests.get("http://localhost:11434/api/tags")
print([m["name"] for m in r.json()["models"]])
# ['llama3.1:8b', 'qwen2.5-coder:7b', 'nomic-embed-text:latest']

If you are running Ollama on a different host (Docker, K8s, remote server), set OLLAMA_HOST or pass base_url="http://your-host:11434" to ChatOllama.


ChatOllama vs OllamaLLM {#chat-vs-llm}

LangChain ships two Ollama interfaces:

ClassUse caseStreamingTool callingStructured output
ChatOllamaChat models, agents, RAGYesYes (compatible models)Yes
OllamaLLMBase completion modelsYesNoLimited

Use ChatOllama for everything modern. OllamaLLM is for legacy completion-style models like llama2-uncensored that do not have a chat template.

from langchain_ollama import ChatOllama

llm = ChatOllama(
    model="llama3.1:8b",
    temperature=0.3,
    num_ctx=8192,           # context window in tokens
    num_predict=512,        # max output tokens
    top_k=40,
    top_p=0.9,
    repeat_penalty=1.1,
    keep_alive="30m",       # how long to keep model in VRAM after last use
    base_url="http://localhost:11434",
)

The keep_alive parameter is the single most impactful flag for app responsiveness. Set it to "24h" if the same model is used by your whole app — the model stays resident, eliminating the 4-8 second cold reload.


Streaming Tokens {#streaming}

Sync streaming:

for chunk in llm.stream("Write a haiku about local AI"):
    print(chunk.content, end="", flush=True)

Async streaming (use this for web servers — FastAPI, Quart, anything ASGI):

import asyncio

async def main():
    async for chunk in llm.astream("Write a haiku about local AI"):
        print(chunk.content, end="", flush=True)

asyncio.run(main())

Streaming through a chain — this is where most beginners hit a wall. The chain only streams end-to-end if every step supports streaming:

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a concise tech writer."),
    ("human", "{question}"),
])

chain = prompt | llm | StrOutputParser()

async def stream_chain():
    async for chunk in chain.astream({"question": "What is RAG?"}):
        print(chunk, end="", flush=True)

StrOutputParser passes chunks through untouched. If you swap it for PydanticOutputParser, the parser buffers the full output before returning — streaming dies. Use PydanticToolsParser or with_structured_output(method="json_schema") instead if you need streaming + structure (more on this below).


Prompt Templates and LCEL {#prompt-templates}

LangChain Expression Language (LCEL) is the pipe operator (|) syntax. It is the recommended way to compose chains in LangChain 0.3+.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

translate_prompt = ChatPromptTemplate.from_template(
    "Translate to {language}: {text}"
)

translate_chain = (
    {"text": RunnablePassthrough(), "language": lambda _: "Spanish"}
    | translate_prompt
    | llm
    | StrOutputParser()
)

print(translate_chain.invoke("Where is the library?"))
# ¿Dónde está la biblioteca?

LCEL gives you batching, async, streaming, and retries for free:

results = await translate_chain.abatch([
    "Where is the library?",
    "What time is it?",
    "I need a coffee.",
])

Ollama serves these in parallel up to OLLAMA_NUM_PARALLEL (default 4). On an RTX 4090 with llama3.1:8b, batch of 4 short translations completes in 1.1 seconds vs 3.8 seconds sequentially.


Structured Output {#structured-output}

Three ways to force JSON output, ranked by reliability on Ollama models:

from pydantic import BaseModel, Field

class Recipe(BaseModel):
    name: str = Field(description="Recipe name")
    prep_time_minutes: int = Field(description="Prep time in minutes")
    ingredients: list[str] = Field(description="List of ingredients")
    steps: list[str] = Field(description="Ordered cooking steps")

structured_llm = llm.with_structured_output(Recipe)
result = structured_llm.invoke("Give me a 15-minute pasta recipe")
print(result.name, result.prep_time_minutes)
print(result.ingredients)

Under the hood this uses Ollama's format: "json" mode (or format: <json_schema> on Ollama 0.5+) which constrains decoding to valid JSON. Way more reliable than prompt-based JSON instructions.

Method 2: PydanticOutputParser (no JSON mode)

from langchain_core.output_parsers import PydanticOutputParser

parser = PydanticOutputParser(pydantic_object=Recipe)
prompt = ChatPromptTemplate.from_messages([
    ("system", "{format_instructions}"),
    ("human", "{query}"),
]).partial(format_instructions=parser.get_format_instructions())

chain = prompt | llm | parser
result = chain.invoke({"query": "Give me a 15-minute pasta recipe"})

This dumps a JSON schema in the system prompt and asks the model to comply. Works on every model but fails ~15% of the time on smaller models. Use Method 1 if you can.

Method 3: bind_tools with one tool (cleanest for tool-capable models)

llm_with_tools = llm.bind_tools([Recipe])
response = llm_with_tools.invoke("Give me a 15-minute pasta recipe")
recipe = Recipe(**response.tool_calls[0]["args"])

Only works on tool-capable models (llama3.1, qwen2.5, mistral-nemo, command-r). Most reliable when it does work.


Tool Calling {#tool-calling}

Tool calling is the foundation of agents. Bind a Pydantic-decorated function:

from langchain_core.tools import tool

@tool
def get_weather(city: str, country: str = "US") -> str:
    """Get the current weather for a city."""
    # Real implementation would hit OpenWeather, etc.
    return f"It's 72°F and sunny in {city}, {country}."

@tool
def search_web(query: str) -> str:
    """Search the web for a query and return top 3 results."""
    return f"[mock results for '{query}']"

llm_with_tools = ChatOllama(model="llama3.1:8b", temperature=0).bind_tools([
    get_weather, search_web
])

response = llm_with_tools.invoke("What's the weather in Boston?")
print(response.tool_calls)
# [{'name': 'get_weather', 'args': {'city': 'Boston'}, 'id': '...'}]

Then execute the tool calls and feed results back:

from langchain_core.messages import HumanMessage, ToolMessage

messages = [HumanMessage(content="What's the weather in Boston?")]
response = llm_with_tools.invoke(messages)
messages.append(response)

for tc in response.tool_calls:
    tool_result = {"get_weather": get_weather, "search_web": search_web}[tc["name"]].invoke(tc["args"])
    messages.append(ToolMessage(content=str(tool_result), tool_call_id=tc["id"]))

final = llm_with_tools.invoke(messages)
print(final.content)
# "It's currently 72°F and sunny in Boston."

Reliability ranking from my testing on Ollama (April 2026):

ModelTool calling reliabilityNotes
qwen2.5:14b96%Best on Ollama for tools
llama3.1:70b94%If you have the VRAM
llama3.1:8b88%Good default
qwen2.5:7b85%Very competitive
mistral-nemo80%12B, occasional schema drift
command-r:35b78%Older, still works
firefunction-v291%Specialized for tool calls

Test your specific tools — model performance varies significantly by tool schema complexity.


Local RAG Pipeline {#rag}

End-to-end private RAG using Ollama for both LLM and embeddings, with Chroma as the vector store:

pip install langchain-chroma chromadb pypdf
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# 1. Load
loader = PyPDFDirectoryLoader("./docs")
documents = loader.load()

# 2. Chunk
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(documents)
print(f"Loaded {len(documents)} docs, split into {len(chunks)} chunks")

# 3. Embed and store
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
)

# 4. Retrieve and generate
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
llm = ChatOllama(model="llama3.1:8b", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a helpful assistant. Answer the question using ONLY the context below. "
     "If the context does not contain the answer, say 'I don't know.'\n\n"
     "Context:\n{context}"),
    ("human", "{question}"),
])

def format_docs(docs):
    return "\n\n".join(f"[source: {d.metadata.get('source')}] {d.page_content}" for d in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print(rag_chain.invoke("What is our company refund policy?"))

This is a complete, private RAG pipeline. Document ingestion, chunking, embedding, retrieval, and generation all run locally. No data leaves the machine.

For deeper RAG patterns, see our local RAG setup guide which covers reranking, hybrid search, and chunk size tuning.

Embedding model benchmarks (Ollama, April 2026)

ModelDimensionsContextSpeed (RTX 4090)Quality (MTEB avg)
nomic-embed-text7688192850 chunks/sec62.4
mxbai-embed-large1024512620 chunks/sec64.1
bge-m310248192480 chunks/sec66.2
snowflake-arctic-embed1024512720 chunks/sec63.8

For most English RAG use cases, nomic-embed-text is the right default — best speed-to-quality ratio with the long context window.


Agents with LangGraph {#langgraph}

For anything stateful — multi-turn agents, human-in-the-loop, persistent memory — use LangGraph instead of plain chains.

from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_ollama import ChatOllama

class AgentState(TypedDict):
    messages: Annotated[list, add_messages]

llm = ChatOllama(model="llama3.1:8b", temperature=0).bind_tools([get_weather, search_web])

def call_model(state: AgentState):
    response = llm.invoke(state["messages"])
    return {"messages": [response]}

def call_tool(state: AgentState):
    last = state["messages"][-1]
    results = []
    for tc in last.tool_calls:
        fn = {"get_weather": get_weather, "search_web": search_web}[tc["name"]]
        result = fn.invoke(tc["args"])
        results.append(ToolMessage(content=str(result), tool_call_id=tc["id"]))
    return {"messages": results}

def should_continue(state: AgentState):
    return "call_tool" if state["messages"][-1].tool_calls else END

graph = StateGraph(AgentState)
graph.add_node("call_model", call_model)
graph.add_node("call_tool", call_tool)
graph.set_entry_point("call_model")
graph.add_conditional_edges("call_model", should_continue)
graph.add_edge("call_tool", "call_model")

agent = graph.compile()

result = agent.invoke({
    "messages": [
        SystemMessage(content="You are a helpful assistant with access to weather and web search."),
        HumanMessage(content="What's the weather in Tokyo, and search for tourist spots there."),
    ]
})

for m in result["messages"]:
    print(f"[{type(m).__name__}] {m.content[:200]}")

This is a real ReAct-style agent in 40 lines. LangGraph handles the loop, the tool routing, and the message accumulation. Add a checkpointer (Postgres, SQLite) and you get persistent agent state across requests.

Average turn latency on RTX 4090 with llama3.1:8b: 1.8-3.5 seconds per loop iteration. Plan on 2-4 iterations per real user query.


JavaScript and TypeScript Path {#javascript}

Same patterns work in TypeScript with @langchain/ollama:

npm install @langchain/ollama @langchain/core @langchain/community
import { ChatOllama, OllamaEmbeddings } from "@langchain/ollama";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { RunnablePassthrough } from "@langchain/core/runnables";

const llm = new ChatOllama({
  model: "llama3.1:8b",
  temperature: 0,
  numPredict: 512,
  baseUrl: "http://localhost:11434",
});

const prompt = ChatPromptTemplate.fromMessages([
  ["system", "You are a concise assistant."],
  ["human", "{question}"],
]);

const chain = prompt.pipe(llm).pipe(new StringOutputParser());

const stream = await chain.stream({ question: "What is RAG?" });
for await (const chunk of stream) process.stdout.write(chunk);

Tool calling with Zod schemas:

import { z } from "zod";
import { tool } from "@langchain/core/tools";

const getWeather = tool(
  async ({ city }) => `72F and sunny in ${city}`,
  {
    name: "get_weather",
    description: "Get current weather for a city",
    schema: z.object({ city: z.string() }),
  }
);

const llmWithTools = llm.bindTools([getWeather]);
const response = await llmWithTools.invoke("Weather in Boston?");
console.log(response.tool_calls);

Same RAG pattern, same agent pattern, same caveats. The TypeScript ecosystem is about 2-3 months behind Python for new LangChain features but the Ollama integration is feature-complete.

For a Vercel-AI-SDK alternative if you are building Next.js apps, see our Ollama + Vercel AI SDK guide once it ships.


Performance Benchmarks {#benchmarks}

Real numbers from my dev box: RTX 4090 24GB, Ryzen 9 7950X, 64GB DDR5, Ollama 0.5.7.

Single-request latency (cold model load excluded)

OperationModelLatency
Simple chat (50 tok output)llama3.1:8b0.7 s
Simple chat (500 tok output)llama3.1:8b4.8 s
Tool call (1 round trip)llama3.1:8b1.1 s
RAG query (5 chunks, 200 tok answer)llama3.1:8b + nomic-embed2.3 s
Structured output (Pydantic)llama3.1:8b1.4 s
LangGraph agent (3 loop iterations)llama3.1:8b6.2 s
Embedding (1000 chunks)nomic-embed-text1.2 s

Throughput (concurrent requests)

OLLAMA_NUM_PARALLELConcurrent reqsThroughputTTFB p95
1192 tok/s180 ms
44280 tok/s320 ms
88410 tok/s580 ms
1616480 tok/s1100 ms

Sweet spot is OLLAMA_NUM_PARALLEL=4 for most apps. Beyond 8, VRAM contention starts hurting per-request latency more than parallelism helps total throughput.


Pitfalls and Production Notes {#pitfalls}

1. Streaming dies on the wrong parser. PydanticOutputParser buffers. Use with_structured_output() for streaming + structure.

2. Default num_ctx is 2048. Many models support 8k-128k. If you are doing RAG with long contexts, explicitly set num_ctx=8192 (or higher) or your retrieved chunks get truncated silently.

3. Embedding model mismatch between ingestion and retrieval. If you embed with nomic-embed-text and retrieve with mxbai-embed-large, similarity search is meaningless. Pin the embedding model in config.

4. Tool calling on non-tool models. LangChain will accept bind_tools() on any model but the model just ignores it. Always test that response.tool_calls is populated.

5. keep_alive default of 5 minutes. Cold reload kills UX. Set "24h" or "1h" depending on your traffic pattern.

6. Forgetting to handle Ollama unavailable. Wrap calls in try/except for ConnectionError and httpx.RequestError. In production, add a circuit breaker that falls back to cloud API or returns a degraded response.

7. Token limits in multi-turn agents. LangGraph keeps the full message history by default. After 10 turns with tool calls, you can blow the context window. Use a message trimmer or summarization step.

8. Async-sync mixing. ChatOllama supports both, but mixing them in one chain (chain.astream over a sync invoke) silently falls back to sync, killing concurrency. Be consistent.

9. Embedding cost on re-ingestion. Re-embedding a 100k-document corpus takes hours. Use Chroma's add_documents with deduplication keys and only re-embed changed docs.

10. Logs leak prompts. LangChain's verbose mode prints full prompts and outputs. Disable in production or you will accidentally log PII.

For a deeper deep dive on tool calling, see Ollama function calling and tool use. For the production deployment side, Ollama in production covers the SSL, auth, and monitoring layer that LangChain calls into.

The official LangChain Ollama integration docs are the authoritative reference for the latest API surface.


Conclusion

LangChain + Ollama is the most underrated stack in local AI right now. You get LangChain's ecosystem — chains, agents, RAG, structured output, evaluations — without the OpenAI bill or the data-leaving-your-network conversation. The integration is mature enough that the same code works locally with Ollama and in the cloud with OpenAI/Anthropic if you swap one import.

The honest tradeoff: latency is 2-4x cloud APIs, tool calling is less reliable on smaller models, and you maintain the inference infrastructure yourself. For internal apps, prototypes, privacy-critical workloads, and offline use cases, that tradeoff is wildly worth it. For consumer-facing products with sub-second SLAs, mix and match — Ollama for the heavy lifting where users will tolerate 2 seconds, cloud API for the snappy interactions.

Start with the Quick Start above, get streaming working, add a Pydantic structured output, and you have a real foundation. From there, the agent and RAG paths open up the same way they do with cloud LLMs — with the bonus that your data and your model both stay home.


Subscribe to our weekly newsletter for hands-on Ollama playbooks: tool calling reliability tests, RAG retrieval benchmarks, and end-to-end agent recipes that actually run locally.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: April 23, 2026🔄 Last Updated: April 23, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

Was this helpful?

Related Guides

Continue your local AI journey with these comprehensive guides

Get the Local AI Dev Newsletter

Weekly LangChain + Ollama recipes, performance comparisons, and production patterns. No fluff.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators