Ollama with JavaScript and TypeScript: Build a Local AI App
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Ollama with JavaScript and TypeScript: Build a Local AI App
Published on April 23, 2026 • 18 min read
I have shipped two production apps in 2026 that use Ollama as their LLM layer: an internal documentation chat for a 40-person engineering team, and a personal "research notebook" that summarizes papers and saves to Obsidian. Both are TypeScript. Both run private. Both replaced a paid OpenAI bill that would have hit $300/month at our usage volume.
The good news for JavaScript developers: Ollama's API is genuinely easy to use from Node and the browser. The official ollama package is small (no Python ceremony), the streaming patterns map cleanly to fetch+ReadableStream, and the Vercel AI SDK has first-class Ollama support since version 4.
The not-so-good news: most tutorials show you a hello-world chat completion and stop. They skip streaming, error handling, model warm-up, structured outputs, and the deployment patterns that turn a demo into something you can actually ship.
This guide is the production playbook. By the end, you will have a streaming Next.js chat app, a structured-output extractor, and a tool-calling agent — all running locally, all written in TypeScript.
Quick Start: First Token in 60 Seconds {#quick-start}
# 1. Install Ollama and pull a model
brew install ollama
ollama pull llama3.2:8b
# 2. Create a Node project
mkdir local-ai-app && cd local-ai-app
npm init -y
npm install ollama
npm install --save-dev typescript tsx @types/node
npx tsc --init
// hello.ts
import ollama from "ollama";
const res = await ollama.chat({
model: "llama3.2:8b",
messages: [{ role: "user", content: "Write a haiku about TypeScript." }],
});
console.log(res.message.content);
npx tsx hello.ts
# Curly braces wrap
# Types catch my dumb mistakes
# Build pipeline hums
That's a complete local AI app: 12 lines of code, no API keys, no cloud cost, no data leaving your machine. The rest of this guide is what you do after this.
Why Ollama From JavaScript {#why-ollama-js}
Three reasons:
- Surface area matches your stack. If you ship Node services, Next.js apps, or browser tools, you do not want a Python sidecar. The Ollama JS SDK keeps your stack monolingual.
- Streaming maps to web primitives. Ollama's streaming output is a simple async iterable, which converts to a Web ReadableStream in three lines and feeds Next.js, Remix, or any browser fetch consumer.
- No CORS pain on server routes. Browsers block direct calls to
localhost:11434from a different origin. Server routes solve this cleanly. Next.js, Remix, and Express are all easy hosts.
The SDK is the official ollama-js package. It works in Node 18+, Bun 1+, and modern browsers when paired with a server proxy.
SDK Basics {#sdk-basics}
The four functions you will use 95% of the time:
import ollama from "ollama";
// 1. Single-turn chat
const r1 = await ollama.chat({
model: "llama3.2:8b",
messages: [{ role: "user", content: "Hello" }],
});
// 2. Streaming chat
const stream = await ollama.chat({
model: "llama3.2:8b",
messages: [{ role: "user", content: "Write a paragraph about Mars." }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.message.content);
}
// 3. Embeddings (for RAG, search)
const emb = await ollama.embeddings({
model: "nomic-embed-text",
prompt: "The cat sat on the mat",
});
console.log(emb.embedding.length); // 768
// 4. Generate (raw, non-chat completion — useful for prompts)
const r4 = await ollama.generate({
model: "llama3.2:8b",
prompt: "Capital of Australia is",
});
For non-default Ollama hosts (production, Docker, remote), instantiate a client:
import { Ollama } from "ollama";
const client = new Ollama({ host: "http://10.0.0.50:11434" });
const r = await client.chat({ model: "llama3.2:8b", messages: [...] });
Build a Next.js Chat App {#nextjs-app}
A working streaming chat app in Next.js App Router takes about 80 lines.
1. Project setup
npx create-next-app@latest local-chat --typescript --app --tailwind
cd local-chat
npm install ollama
2. Streaming server route
// app/api/chat/route.ts
import { Ollama } from "ollama";
import { NextRequest } from "next/server";
const ollama = new Ollama({ host: process.env.OLLAMA_HOST || "http://127.0.0.1:11434" });
export const runtime = "nodejs";
export async function POST(req: NextRequest) {
const { messages, model = "llama3.2:8b" } = await req.json();
const response = await ollama.chat({
model,
messages,
stream: true,
options: { temperature: 0.4, num_ctx: 4096 },
});
const encoder = new TextEncoder();
const stream = new ReadableStream({
async start(controller) {
try {
for await (const part of response) {
controller.enqueue(encoder.encode(part.message.content));
}
} catch (err: any) {
controller.enqueue(encoder.encode(`\n[error: ${err.message}]`));
} finally {
controller.close();
}
},
});
return new Response(stream, {
headers: { "Content-Type": "text/plain; charset=utf-8" },
});
}
3. Client component
// app/page.tsx
"use client";
import { useState } from "react";
type Msg = { role: "user" | "assistant"; content: string };
export default function Chat() {
const [messages, setMessages] = useState<Msg[]>([]);
const [input, setInput] = useState("");
const [pending, setPending] = useState("");
const [busy, setBusy] = useState(false);
async function send() {
if (!input.trim() || busy) return;
const userMsg: Msg = { role: "user", content: input };
const next = [...messages, userMsg];
setMessages(next);
setInput("");
setBusy(true);
setPending("");
const res = await fetch("/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ messages: next }),
});
const reader = res.body!.getReader();
const dec = new TextDecoder();
let acc = "";
while (true) {
const { value, done } = await reader.read();
if (done) break;
acc += dec.decode(value);
setPending(acc);
}
setMessages([...next, { role: "assistant", content: acc }]);
setPending("");
setBusy(false);
}
return (
<main className="max-w-2xl mx-auto p-6">
<h1 className="text-2xl font-bold mb-4">Local Chat (Ollama)</h1>
<div className="space-y-3 mb-4">
{messages.map((m, i) => (
<div key={i} className={m.role === "user" ? "text-blue-600" : "text-gray-800"}>
<strong>{m.role}: </strong>{m.content}
</div>
))}
{pending && <div className="text-gray-500"><strong>assistant: </strong>{pending}</div>}
</div>
<div className="flex gap-2">
<input
className="flex-1 border rounded p-2"
value={input}
onChange={e => setInput(e.target.value)}
onKeyDown={e => e.key === "Enter" && send()}
disabled={busy}
placeholder="Ask anything..."
/>
<button onClick={send} disabled={busy} className="px-4 py-2 bg-black text-white rounded">
Send
</button>
</div>
</main>
);
}
npm run dev
# Open http://localhost:3000 — fully streaming local chat
That is a complete, private chat app. No environment variables, no third-party API, no data leaving the box.
Add the Vercel AI SDK {#vercel-ai-sdk}
For richer UI primitives — useChat, message lists, tool-call rendering — use the Vercel AI SDK.
npm install ai @ai-sdk/react ollama-ai-provider
// app/api/chat/route.ts (Vercel AI SDK version)
import { streamText } from "ai";
import { createOllama } from "ollama-ai-provider";
const ollama = createOllama({
baseURL: process.env.OLLAMA_HOST + "/api" || "http://127.0.0.1:11434/api",
});
export async function POST(req: Request) {
const { messages } = await req.json();
const result = streamText({
model: ollama("llama3.2:8b"),
messages,
temperature: 0.4,
});
return result.toDataStreamResponse();
}
// app/page.tsx (Vercel AI SDK version)
"use client";
import { useChat } from "@ai-sdk/react";
export default function Chat() {
const { messages, input, handleInputChange, handleSubmit, isLoading } = useChat();
return (
<main className="max-w-2xl mx-auto p-6">
{messages.map(m => (
<div key={m.id}>
<strong>{m.role}: </strong>{m.content}
</div>
))}
<form onSubmit={handleSubmit} className="flex gap-2 mt-4">
<input value={input} onChange={handleInputChange} className="flex-1 border rounded p-2" />
<button disabled={isLoading} className="px-4 py-2 bg-black text-white rounded">Send</button>
</form>
</main>
);
}
The Vercel AI SDK handles streaming, message state, abort signals, and partial tool-call rendering. For non-trivial chat UIs, it saves a couple of weekends of plumbing.
Structured Outputs With Zod {#structured-outputs}
For data extraction tasks, you want JSON, not prose. Combine Ollama's format: "json" with Zod for runtime validation.
// extract.ts
import ollama from "ollama";
import { z } from "zod";
const PersonSchema = z.object({
name: z.string(),
email: z.string().email().nullable(),
role: z.enum(["engineer", "designer", "manager", "other"]),
skills: z.array(z.string()).max(10),
});
async function extractPerson(text: string) {
const res = await ollama.chat({
model: "llama3.2:8b",
messages: [
{
role: "system",
content:
"Extract person info from the user's text. Return JSON matching this shape: " +
'{ name: string, email: string|null, role: "engineer"|"designer"|"manager"|"other", skills: string[] }. ' +
"If a field is unknown, use null or an empty array.",
},
{ role: "user", content: text },
],
format: "json",
options: { temperature: 0.1 },
});
const raw = JSON.parse(res.message.content);
return PersonSchema.parse(raw); // throws if model returned bad shape
}
const out = await extractPerson(
"Maya Patel - senior react developer, knows TypeScript, Next.js, GraphQL. maya@example.com"
);
console.log(out);
Zod's .parse() rejects malformed responses. Catch the ZodError, inspect the issue, and you can retry with a corrective prompt.
For deeper structured-output patterns and the strict tool-calling alternative, see our Ollama function calling guide.
Tool-Calling Agent in TypeScript {#tool-agent}
Function calling works the same way it does in Python — the JSON Schema is identical:
// agent.ts
import ollama from "ollama";
const tools = [
{
type: "function" as const,
function: {
name: "get_time",
description: "Get current ISO timestamp.",
parameters: { type: "object", properties: {} },
},
},
{
type: "function" as const,
function: {
name: "search_repo",
description: "Search code in the user's repository.",
parameters: {
type: "object",
properties: {
query: { type: "string", description: "Search term." },
},
required: ["query"],
},
},
},
];
const TOOLS: Record<string, (args: any) => Promise<string>> = {
get_time: async () => JSON.stringify({ now: new Date().toISOString() }),
search_repo: async ({ query }) =>
JSON.stringify([{ file: "src/index.ts", line: 42, snippet: `// match for ${query}` }]),
};
async function runAgent(question: string) {
const messages: any[] = [{ role: "user", content: question }];
for (let i = 0; i < 6; i++) {
const res = await ollama.chat({ model: "qwen2.5:7b", messages, tools });
messages.push(res.message);
const calls = res.message.tool_calls ?? [];
if (calls.length === 0) return res.message.content;
for (const c of calls) {
const fn = TOOLS[c.function.name];
const result = fn ? await fn(c.function.arguments) : JSON.stringify({ error: "unknown tool" });
messages.push({ role: "tool", name: c.function.name, content: result });
}
}
return "Hit max turns.";
}
console.log(await runAgent("What time is it, and find any results matching 'logger' in the repo?"));
The pattern is the same as in Python: bounded loop, tool registry dispatch, structured error responses.
Bun and Edge Runtimes {#bun-edge}
Bun 1.1+ supports the Ollama SDK natively. No extra configuration:
bun add ollama
bun run hello.ts
For Vercel Edge runtimes (Cloudflare Workers, Vercel Edge Functions), you cannot bundle the Ollama SDK directly — the SDK uses Node-specific APIs. Two options:
- Use Node runtime in Next.js (
export const runtime = "nodejs"). Simplest. Almost always the right answer for Ollama because Ollama itself is on a server you control. - Call Ollama HTTP API directly with
fetchfrom Edge. Works, but you give up the SDK's typing.
// edge-route.ts
export const runtime = "edge";
export async function POST(req: Request) {
const { messages } = await req.json();
const r = await fetch("https://your-ollama-host/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ model: "llama3.2:8b", messages, stream: false }),
});
return new Response(r.body, { headers: { "Content-Type": "text/event-stream" } });
}
For most apps, just use the Node runtime. Ollama is local — you do not need Edge's geographic distribution.
Express / Fastify Patterns {#express-fastify}
For non-Next.js Node services, the patterns are nearly identical:
// express-server.ts
import express from "express";
import ollama from "ollama";
const app = express();
app.use(express.json());
app.post("/api/chat", async (req, res) => {
res.setHeader("Content-Type", "text/plain; charset=utf-8");
const stream = await ollama.chat({
model: "llama3.2:8b",
messages: req.body.messages,
stream: true,
});
for await (const part of stream) {
res.write(part.message.content);
}
res.end();
});
app.listen(8080);
The Ollama SDK's async iterable plays nicely with any streaming HTTP server.
Performance Patterns {#performance}
1. Warm the model. Cold-start of an 8B model takes 2-4 seconds on Apple Silicon, longer on disk-bound systems. On boot, fire a no-op call so the model is in memory:
async function warm() {
await ollama.chat({
model: "llama3.2:8b",
messages: [{ role: "user", content: "warmup" }],
options: { num_predict: 1 },
});
}
2. Tune keep_alive. Ollama unloads models after 5 minutes by default. For chat apps, set keep_alive: "30m" per request to keep the model warm.
3. Right-size context. num_ctx: 4096 is fast. num_ctx: 32768 is much slower and rarely needed for chat. Choose per workload.
4. Streaming over single response. Even if you do not show partial output, streaming lets you abort early on user navigation. Use AbortController.
5. Concurrent requests. Ollama 0.4+ handles ~3-5 concurrent inference requests on consumer hardware before slowing significantly. For higher concurrency, scale to multiple Ollama hosts behind a load balancer.
For deeper production scaling, see Ollama production deployment.
Error Handling {#error-handling}
Three failure modes you will see in production:
1. Ollama not running.
try {
await ollama.chat({ ... });
} catch (err: any) {
if (err.code === "ECONNREFUSED") {
return new Response("Local AI is not running. Start with 'ollama serve'.", { status: 503 });
}
throw err;
}
2. Model not pulled.
// Detect "model not found" and auto-pull
catch (err: any) {
if (err.message?.includes("model") && err.message?.includes("not found")) {
await ollama.pull({ model: "llama3.2:8b" });
return retryOnce();
}
throw err;
}
3. OOM on big context. Ollama returns a generic error. Catch it, halve num_ctx, retry once.
A robust wrapper:
async function safeChat(req: any, retries = 1): Promise<any> {
try {
return await ollama.chat(req);
} catch (err: any) {
if (retries > 0 && err.message?.includes("memory")) {
const halved = { ...req, options: { ...req.options, num_ctx: Math.max(1024, (req.options?.num_ctx ?? 4096) / 2) } };
return safeChat(halved, retries - 1);
}
throw err;
}
}
Deployment Patterns {#deployment}
Three common shapes:
Local-only desktop tool. Ship as an Electron or Tauri app. Bundle nothing — assume Ollama runs locally. The simplest case.
Self-hosted on your own server. Run Ollama on a small VPS (Hetzner, Linode) or a home server. Put it behind Nginx with TLS. Restrict access by IP allowlist or token.
server {
listen 443 ssl http2;
server_name ai.example.com;
location /api/ {
proxy_pass http://127.0.0.1:11434;
proxy_buffering off;
proxy_read_timeout 600s;
if ($http_authorization != "Bearer YOUR_TOKEN") { return 401; }
}
}
Hybrid. Ollama for daily traffic, OpenAI/Anthropic API as fallback when the user wants a frontier model. Use a feature flag in your route to switch backends.
The full hardening checklist (TLS, auth, rate limits, monitoring) is in our Ollama production deployment guide.
Pitfalls and Gotchas {#pitfalls}
1. fetch from the browser to localhost:11434 is blocked by CORS. Always proxy through a server route. Do not try to "fix" Ollama's CORS — it is intentional.
2. The SDK's streaming async iterable cannot be replayed. If you want to log responses while streaming to clients, use a Tee or accumulate into a string at the same time you forward chunks.
3. Model names are case-sensitive and tag-strict. llama3.2:8b and llama3.2 are different aliases, sometimes resolving to different files. Pin exact tags in production.
4. ollama-ai-provider lags ahead of upstream Ollama by 1-2 weeks on new features. For bleeding-edge features (new tool fields, JSON mode tweaks), fall back to the official SDK.
5. Long context = slow first token. Users notice latency on the first token, not throughput. Keep context tight — most chat workloads work fine at 4K.
6. stream and format: "json" together can produce ill-formed partial JSON. Buffer the full response before parsing. JSON mode is not safe to render mid-stream.
7. Bundling the SDK for the browser fails. It uses Node http and stream internally. Always import on the server.
8. Process exit while streaming hangs the SDK. Always pass an AbortSignal so the SDK closes cleanly when the request is cancelled.
Reading Material and References {#references}
- Official ollama-js GitHub repo — SDK source, types, examples
- Vercel AI SDK Ollama provider docs — useChat, useCompletion, tool rendering
- Companion guide: Ollama function calling and tools
- Companion guide: Ollama + ChromaDB RAG pipeline
- Production layer: Ollama production deployment
Closing Take {#closing}
The best thing about building local AI in JavaScript is how boring it is. The SDK works like any other HTTP client. Streaming maps to ReadableStream. Tool calling maps to a switch statement. Next.js App Router handles the rest. There is no exotic infrastructure, no GPU drivers (on a Mac), no API key rotation, no rate limit pages.
That is exactly why I keep using it. Local AI in TypeScript is a quiet, productive experience. You ship features faster because you stop fighting cloud quirks. You sleep better because no customer data is leaving your servers. And you can hand the project to a colleague who has never seen Ollama before and they will be productive in an hour.
If you build only one thing this weekend, build the Next.js streaming chat in this guide. It is genuinely the fastest path from "no app" to "private AI assistant running in my browser." Everything else is iteration on top of that foundation.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!