Can I call Ollama directly from the browser?

No. Browsers block direct fetch calls to localhost:11434 due to CORS. The pattern is to proxy through a server route — Next.js App Router, Express, Fastify, or any Node server. The browser calls your route, the route calls Ollama. This also gives you a clean place to add auth, logging, and rate limiting.

Does the Ollama SDK work with Bun and Deno?

Bun 1.1+ supports the official ollama npm package natively. Deno supports it through the npm: specifier (npm:ollama). Both work for chat, streaming, embeddings, and tool calling. For Edge runtimes (Vercel Edge, Cloudflare Workers), the SDK does not bundle cleanly — call the HTTP API directly with fetch from Edge.

How do I stream Ollama responses in Next.js?

Set stream: true on the chat call, get an async iterable, and forward each chunk through a ReadableStream in a Response. Use export const runtime = "nodejs" on your route. The full pattern is in this guide. The Vercel AI SDK with ollama-ai-provider handles streaming, message state, and abort signals if you want a higher-level abstraction.

Should I use the Vercel AI SDK or the official Ollama SDK?

For chat UIs in Next.js or React, the Vercel AI SDK saves real time on streaming, message state, and tool-call rendering. For server-side scripts, data extraction, or tools that do not need useChat, the official SDK is simpler and follows upstream Ollama features more closely. Many production apps use both — the official SDK for backend logic, Vercel AI SDK for the chat UI.

How do I keep the model warm so first responses are fast?

Two things. First, fire a tiny warmup call on app boot: ollama.chat with num_predict: 1. Second, set keep_alive: "30m" or longer on each request to prevent unloading. By default Ollama unloads models after 5 minutes of inactivity. Keeping a model warm uses RAM but eliminates 2-4 second cold-start latency on every cold call.

Can I get JSON output reliably from Ollama in TypeScript?

Yes. Set format: "json" in the chat call to force JSON output, then parse with JSON.parse and validate at runtime with Zod. If validation fails, retry with a corrective prompt that includes the validation error. This pattern is robust across llama3.1, llama3.2, qwen2.5, and most modern open models. Use temperature 0.1-0.3 for reliable structured output.

How many concurrent users can a single Ollama instance handle?

Roughly 3-5 concurrent inference requests on consumer hardware (16GB Mac or RTX 3080) before per-request latency degrades noticeably. For 10+ concurrent users, scale to multiple Ollama hosts behind a load balancer. Embedding requests are much cheaper and Ollama can serve 50+ concurrent embedding requests on the same hardware.

Do I need TypeScript types for Ollama responses?

The official ollama package ships with full TypeScript types. The chat function returns ChatResponse with typed message, tool_calls, and metadata fields. Streaming returns AsyncIterable . The Vercel AI SDK adds richer types for tool calls and partial messages. You should not need to declare any types manually for normal use.

Ollama with JavaScript and TypeScript: Build a Local AI App

Published on April 23, 2026 • 18 min read

I have shipped two production apps in 2026 that use Ollama as their LLM layer: an internal documentation chat for a 40-person engineering team, and a personal "research notebook" that summarizes papers and saves to Obsidian. Both are TypeScript. Both run private. Both replaced a paid OpenAI bill that would have hit $300/month at our usage volume.

The good news for JavaScript developers: Ollama's API is genuinely easy to use from Node and the browser. The official ollama package is small (no Python ceremony), the streaming patterns map cleanly to fetch+ReadableStream, and the Vercel AI SDK has first-class Ollama support since version 4.

The not-so-good news: most tutorials show you a hello-world chat completion and stop. They skip streaming, error handling, model warm-up, structured outputs, and the deployment patterns that turn a demo into something you can actually ship.

This guide is the production playbook. By the end, you will have a streaming Next.js chat app, a structured-output extractor, and a tool-calling agent — all running locally, all written in TypeScript.

Quick Start: First Token in 60 Seconds {#quick-start}

# 1. Install Ollama and pull a model
brew install ollama
ollama pull llama3.2:8b

# 2. Create a Node project
mkdir local-ai-app && cd local-ai-app
npm init -y
npm install ollama
npm install --save-dev typescript tsx @types/node
npx tsc --init

// hello.ts
import ollama from "ollama";

const res = await ollama.chat({
  model: "llama3.2:8b",
  messages: [{ role: "user", content: "Write a haiku about TypeScript." }],
});

console.log(res.message.content);

npx tsx hello.ts
# Curly braces wrap
# Types catch my dumb mistakes
# Build pipeline hums

That's a complete local AI app: 12 lines of code, no API keys, no cloud cost, no data leaving your machine. The rest of this guide is what you do after this.

Why Ollama From JavaScript {#why-ollama-js}

Three reasons:

Surface area matches your stack. If you ship Node services, Next.js apps, or browser tools, you do not want a Python sidecar. The Ollama JS SDK keeps your stack monolingual.
Streaming maps to web primitives. Ollama's streaming output is a simple async iterable, which converts to a Web ReadableStream in three lines and feeds Next.js, Remix, or any browser fetch consumer.
No CORS pain on server routes. Browsers block direct calls to localhost:11434 from a different origin. Server routes solve this cleanly. Next.js, Remix, and Express are all easy hosts.

The SDK is the official ollama-js package. It works in Node 18+, Bun 1+, and modern browsers when paired with a server proxy.

SDK Basics {#sdk-basics}

The four functions you will use 95% of the time:

import ollama from "ollama";

// 1. Single-turn chat
const r1 = await ollama.chat({
  model: "llama3.2:8b",
  messages: [{ role: "user", content: "Hello" }],
});

// 2. Streaming chat
const stream = await ollama.chat({
  model: "llama3.2:8b",
  messages: [{ role: "user", content: "Write a paragraph about Mars." }],
  stream: true,
});
for await (const chunk of stream) {
  process.stdout.write(chunk.message.content);
}

// 3. Embeddings (for RAG, search)
const emb = await ollama.embeddings({
  model: "nomic-embed-text",
  prompt: "The cat sat on the mat",
});
console.log(emb.embedding.length); // 768

// 4. Generate (raw, non-chat completion — useful for prompts)
const r4 = await ollama.generate({
  model: "llama3.2:8b",
  prompt: "Capital of Australia is",
});

For non-default Ollama hosts (production, Docker, remote), instantiate a client:

import { Ollama } from "ollama";

const client = new Ollama({ host: "http://10.0.0.50:11434" });
const r = await client.chat({ model: "llama3.2:8b", messages: [...] });

Build a Next.js Chat App {#nextjs-app}

A working streaming chat app in Next.js App Router takes about 80 lines.

1. Project setup

npx create-next-app@latest local-chat --typescript --app --tailwind
cd local-chat
npm install ollama

2. Streaming server route

// app/api/chat/route.ts
import { Ollama } from "ollama";
import { NextRequest } from "next/server";

const ollama = new Ollama({ host: process.env.OLLAMA_HOST || "http://127.0.0.1:11434" });

export const runtime = "nodejs";

export async function POST(req: NextRequest) {
  const { messages, model = "llama3.2:8b" } = await req.json();

  const response = await ollama.chat({
    model,
    messages,
    stream: true,
    options: { temperature: 0.4, num_ctx: 4096 },
  });

  const encoder = new TextEncoder();
  const stream = new ReadableStream({
    async start(controller) {
      try {
        for await (const part of response) {
          controller.enqueue(encoder.encode(part.message.content));
        }
      } catch (err: any) {
        controller.enqueue(encoder.encode(`\n[error: ${err.message}]`));
      } finally {
        controller.close();
      }
    },
  });

  return new Response(stream, {
    headers: { "Content-Type": "text/plain; charset=utf-8" },
  });
}

3. Client component

// app/page.tsx
"use client";
import { useState } from "react";

type Msg = { role: "user" | "assistant"; content: string };

export default function Chat() {
  const [messages, setMessages] = useState<Msg[]>([]);
  const [input, setInput] = useState("");
  const [pending, setPending] = useState("");
  const [busy, setBusy] = useState(false);

  async function send() {
    if (!input.trim() || busy) return;
    const userMsg: Msg = { role: "user", content: input };
    const next = [...messages, userMsg];
    setMessages(next);
    setInput("");
    setBusy(true);
    setPending("");

    const res = await fetch("/api/chat", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ messages: next }),
    });
    const reader = res.body!.getReader();
    const dec = new TextDecoder();
    let acc = "";
    while (true) {
      const { value, done } = await reader.read();
      if (done) break;
      acc += dec.decode(value);
      setPending(acc);
    }
    setMessages([...next, { role: "assistant", content: acc }]);
    setPending("");
    setBusy(false);
  }

  return (
    <main className="max-w-2xl mx-auto p-6">
      <h1 className="text-2xl font-bold mb-4">Local Chat (Ollama)</h1>
      <div className="space-y-3 mb-4">
        {messages.map((m, i) => (
          <div key={i} className={m.role === "user" ? "text-blue-600" : "text-gray-800"}>
            <strong>{m.role}: </strong>{m.content}
          </div>
        ))}
        {pending && <div className="text-gray-500"><strong>assistant: </strong>{pending}</div>}
      </div>
      <div className="flex gap-2">
        <input
          className="flex-1 border rounded p-2"
          value={input}
          onChange={e => setInput(e.target.value)}
          onKeyDown={e => e.key === "Enter" && send()}
          disabled={busy}
          placeholder="Ask anything..."
        />
        <button onClick={send} disabled={busy} className="px-4 py-2 bg-black text-white rounded">
          Send
        </button>
      </div>
    </main>
  );
}

npm run dev
# Open http://localhost:3000 — fully streaming local chat

That is a complete, private chat app. No environment variables, no third-party API, no data leaving the box.

Add the Vercel AI SDK {#vercel-ai-sdk}

For richer UI primitives — useChat, message lists, tool-call rendering — use the Vercel AI SDK.

npm install ai @ai-sdk/react ollama-ai-provider

// app/api/chat/route.ts (Vercel AI SDK version)
import { streamText } from "ai";
import { createOllama } from "ollama-ai-provider";

const ollama = createOllama({
  baseURL: process.env.OLLAMA_HOST + "/api" || "http://127.0.0.1:11434/api",
});

export async function POST(req: Request) {
  const { messages } = await req.json();
  const result = streamText({
    model: ollama("llama3.2:8b"),
    messages,
    temperature: 0.4,
  });
  return result.toDataStreamResponse();
}

// app/page.tsx (Vercel AI SDK version)
"use client";
import { useChat } from "@ai-sdk/react";

export default function Chat() {
  const { messages, input, handleInputChange, handleSubmit, isLoading } = useChat();

  return (
    <main className="max-w-2xl mx-auto p-6">
      {messages.map(m => (
        <div key={m.id}>
          <strong>{m.role}: </strong>{m.content}
        </div>
      ))}
      <form onSubmit={handleSubmit} className="flex gap-2 mt-4">
        <input value={input} onChange={handleInputChange} className="flex-1 border rounded p-2" />
        <button disabled={isLoading} className="px-4 py-2 bg-black text-white rounded">Send</button>
      </form>
    </main>
  );
}

The Vercel AI SDK handles streaming, message state, abort signals, and partial tool-call rendering. For non-trivial chat UIs, it saves a couple of weekends of plumbing.

Structured Outputs With Zod {#structured-outputs}

For data extraction tasks, you want JSON, not prose. Combine Ollama's format: "json" with Zod for runtime validation.

// extract.ts
import ollama from "ollama";
import { z } from "zod";

const PersonSchema = z.object({
  name: z.string(),
  email: z.string().email().nullable(),
  role: z.enum(["engineer", "designer", "manager", "other"]),
  skills: z.array(z.string()).max(10),
});

async function extractPerson(text: string) {
  const res = await ollama.chat({
    model: "llama3.2:8b",
    messages: [
      {
        role: "system",
        content:
          "Extract person info from the user's text. Return JSON matching this shape: " +
          '{ name: string, email: string|null, role: "engineer"|"designer"|"manager"|"other", skills: string[] }. ' +
          "If a field is unknown, use null or an empty array.",
      },
      { role: "user", content: text },
    ],
    format: "json",
    options: { temperature: 0.1 },
  });

  const raw = JSON.parse(res.message.content);
  return PersonSchema.parse(raw); // throws if model returned bad shape
}

const out = await extractPerson(
  "Maya Patel - senior react developer, knows TypeScript, Next.js, GraphQL. maya@example.com"
);
console.log(out);

Zod's .parse() rejects malformed responses. Catch the ZodError, inspect the issue, and you can retry with a corrective prompt.

For deeper structured-output patterns and the strict tool-calling alternative, see our Ollama function calling guide.

Tool-Calling Agent in TypeScript {#tool-agent}

Function calling works the same way it does in Python — the JSON Schema is identical:

// agent.ts
import ollama from "ollama";

const tools = [
  {
    type: "function" as const,
    function: {
      name: "get_time",
      description: "Get current ISO timestamp.",
      parameters: { type: "object", properties: {} },
    },
  },
  {
    type: "function" as const,
    function: {
      name: "search_repo",
      description: "Search code in the user's repository.",
      parameters: {
        type: "object",
        properties: {
          query: { type: "string", description: "Search term." },
        },
        required: ["query"],
      },
    },
  },
];

const TOOLS: Record<string, (args: any) => Promise<string>> = {
  get_time: async () => JSON.stringify({ now: new Date().toISOString() }),
  search_repo: async ({ query }) =>
    JSON.stringify([{ file: "src/index.ts", line: 42, snippet: `// match for ${query}` }]),
};

async function runAgent(question: string) {
  const messages: any[] = [{ role: "user", content: question }];

  for (let i = 0; i < 6; i++) {
    const res = await ollama.chat({ model: "qwen2.5:7b", messages, tools });
    messages.push(res.message);
    const calls = res.message.tool_calls ?? [];
    if (calls.length === 0) return res.message.content;

    for (const c of calls) {
      const fn = TOOLS[c.function.name];
      const result = fn ? await fn(c.function.arguments) : JSON.stringify({ error: "unknown tool" });
      messages.push({ role: "tool", name: c.function.name, content: result });
    }
  }
  return "Hit max turns.";
}

console.log(await runAgent("What time is it, and find any results matching 'logger' in the repo?"));

The pattern is the same as in Python: bounded loop, tool registry dispatch, structured error responses.

Bun and Edge Runtimes {#bun-edge}

Bun 1.1+ supports the Ollama SDK natively. No extra configuration:

bun add ollama
bun run hello.ts

For Vercel Edge runtimes (Cloudflare Workers, Vercel Edge Functions), you cannot bundle the Ollama SDK directly — the SDK uses Node-specific APIs. Two options:

Use Node runtime in Next.js (export const runtime = "nodejs"). Simplest. Almost always the right answer for Ollama because Ollama itself is on a server you control.
Call Ollama HTTP API directly with fetch from Edge. Works, but you give up the SDK's typing.

// edge-route.ts
export const runtime = "edge";

export async function POST(req: Request) {
  const { messages } = await req.json();
  const r = await fetch("https://your-ollama-host/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ model: "llama3.2:8b", messages, stream: false }),
  });
  return new Response(r.body, { headers: { "Content-Type": "text/event-stream" } });
}

For most apps, just use the Node runtime. Ollama is local — you do not need Edge's geographic distribution.

Express / Fastify Patterns {#express-fastify}

For non-Next.js Node services, the patterns are nearly identical:

// express-server.ts
import express from "express";
import ollama from "ollama";

const app = express();
app.use(express.json());

app.post("/api/chat", async (req, res) => {
  res.setHeader("Content-Type", "text/plain; charset=utf-8");
  const stream = await ollama.chat({
    model: "llama3.2:8b",
    messages: req.body.messages,
    stream: true,
  });
  for await (const part of stream) {
    res.write(part.message.content);
  }
  res.end();
});

app.listen(8080);

The Ollama SDK's async iterable plays nicely with any streaming HTTP server.

Performance Patterns {#performance}

1. Warm the model. Cold-start of an 8B model takes 2-4 seconds on Apple Silicon, longer on disk-bound systems. On boot, fire a no-op call so the model is in memory:

async function warm() {
  await ollama.chat({
    model: "llama3.2:8b",
    messages: [{ role: "user", content: "warmup" }],
    options: { num_predict: 1 },
  });
}

2. Tune keep_alive. Ollama unloads models after 5 minutes by default. For chat apps, set keep_alive: "30m" per request to keep the model warm.

3. Right-size context. num_ctx: 4096 is fast. num_ctx: 32768 is much slower and rarely needed for chat. Choose per workload.

4. Streaming over single response. Even if you do not show partial output, streaming lets you abort early on user navigation. Use AbortController.

5. Concurrent requests. Ollama 0.4+ handles ~3-5 concurrent inference requests on consumer hardware before slowing significantly. For higher concurrency, scale to multiple Ollama hosts behind a load balancer.

For deeper production scaling, see Ollama production deployment.

Error Handling {#error-handling}

Three failure modes you will see in production:

1. Ollama not running.

try {
  await ollama.chat({ ... });
} catch (err: any) {
  if (err.code === "ECONNREFUSED") {
    return new Response("Local AI is not running. Start with 'ollama serve'.", { status: 503 });
  }
  throw err;
}

2. Model not pulled.

// Detect "model not found" and auto-pull
catch (err: any) {
  if (err.message?.includes("model") && err.message?.includes("not found")) {
    await ollama.pull({ model: "llama3.2:8b" });
    return retryOnce();
  }
  throw err;
}

3. OOM on big context. Ollama returns a generic error. Catch it, halve num_ctx, retry once.

A robust wrapper:

async function safeChat(req: any, retries = 1): Promise<any> {
  try {
    return await ollama.chat(req);
  } catch (err: any) {
    if (retries > 0 && err.message?.includes("memory")) {
      const halved = { ...req, options: { ...req.options, num_ctx: Math.max(1024, (req.options?.num_ctx ?? 4096) / 2) } };
      return safeChat(halved, retries - 1);
    }
    throw err;
  }
}

Deployment Patterns {#deployment}

Three common shapes:

Local-only desktop tool. Ship as an Electron or Tauri app. Bundle nothing — assume Ollama runs locally. The simplest case.

Self-hosted on your own server. Run Ollama on a small VPS (Hetzner, Linode) or a home server. Put it behind Nginx with TLS. Restrict access by IP allowlist or token.

server {
    listen 443 ssl http2;
    server_name ai.example.com;
    location /api/ {
        proxy_pass http://127.0.0.1:11434;
        proxy_buffering off;
        proxy_read_timeout 600s;
        if ($http_authorization != "Bearer YOUR_TOKEN") { return 401; }
    }
}

Hybrid. Ollama for daily traffic, OpenAI/Anthropic API as fallback when the user wants a frontier model. Use a feature flag in your route to switch backends.

The full hardening checklist (TLS, auth, rate limits, monitoring) is in our Ollama production deployment guide.

Pitfalls and Gotchas {#pitfalls}

1. fetch from the browser to localhost:11434 is blocked by CORS. Always proxy through a server route. Do not try to "fix" Ollama's CORS — it is intentional.

2. The SDK's streaming async iterable cannot be replayed. If you want to log responses while streaming to clients, use a Tee or accumulate into a string at the same time you forward chunks.

3. Model names are case-sensitive and tag-strict. llama3.2:8b and llama3.2 are different aliases, sometimes resolving to different files. Pin exact tags in production.

4. ollama-ai-provider lags ahead of upstream Ollama by 1-2 weeks on new features. For bleeding-edge features (new tool fields, JSON mode tweaks), fall back to the official SDK.

5. Long context = slow first token. Users notice latency on the first token, not throughput. Keep context tight — most chat workloads work fine at 4K.

6. stream and format: "json" together can produce ill-formed partial JSON. Buffer the full response before parsing. JSON mode is not safe to render mid-stream.

7. Bundling the SDK for the browser fails. It uses Node http and stream internally. Always import on the server.

8. Process exit while streaming hangs the SDK. Always pass an AbortSignal so the SDK closes cleanly when the request is cancelled.

Reading Material and References {#references}

Official ollama-js GitHub repo — SDK source, types, examples
Vercel AI SDK Ollama provider docs — useChat, useCompletion, tool rendering
Companion guide: Ollama function calling and tools
Companion guide: Ollama + ChromaDB RAG pipeline
Production layer: Ollama production deployment

Closing Take {#closing}

The best thing about building local AI in JavaScript is how boring it is. The SDK works like any other HTTP client. Streaming maps to ReadableStream. Tool calling maps to a switch statement. Next.js App Router handles the rest. There is no exotic infrastructure, no GPU drivers (on a Mac), no API key rotation, no rate limit pages.

That is exactly why I keep using it. Local AI in TypeScript is a quiet, productive experience. You ship features faster because you stop fighting cloud quirks. You sleep better because no customer data is leaving your servers. And you can hand the project to a colleague who has never seen Ollama before and they will be productive in an hour.

If you build only one thing this weekend, build the Next.js streaming chat in this guide. It is genuinely the fastest path from "no app" to "private AI assistant running in my browser." Everything else is iteration on top of that foundation.

Ollama with JavaScript and TypeScript: Build a Local AI App

Want to go deeper than this article?

Ollama with JavaScript and TypeScript: Build a Local AI App

Quick Start: First Token in 60 Seconds {#quick-start}

Why Ollama From JavaScript {#why-ollama-js}

SDK Basics {#sdk-basics}

Build a Next.js Chat App {#nextjs-app}

1. Project setup

2. Streaming server route

3. Client component

Add the Vercel AI SDK {#vercel-ai-sdk}

Structured Outputs With Zod {#structured-outputs}

Tool-Calling Agent in TypeScript {#tool-agent}

Bun and Edge Runtimes {#bun-edge}

Express / Fastify Patterns {#express-fastify}

Performance Patterns {#performance}

Error Handling {#error-handling}

Deployment Patterns {#deployment}

Pitfalls and Gotchas {#pitfalls}

Reading Material and References {#references}

Closing Take {#closing}

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

Ship Local AI Apps Faster

Build Real AI on Your Machine

🎓 Continue Learning

Related Guides

Ollama Function Calling and Tool Use

Ollama + ChromaDB RAG Pipeline

Ollama Vercel AI SDK

Ollama Production Deployment

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI