Does the Vercel AI SDK work with Ollama out of the box?

The official @ai-sdk/openai provider can talk to Ollama because Ollama exposes an OpenAI-compatible /v1/chat/completions endpoint. Point baseURL at http://localhost:11434/v1, pass any string for the apiKey, and call streamText or generateText as usual. For richer features like Modelfile-level options or pull progress, use the community ai-sdk-provider-ollama package.

Can I deploy a Vercel AI SDK + Ollama app to Vercel?

Not directly. Vercel serverless functions cannot reach a local Ollama daemon on your laptop. You either: (1) host Ollama on a public VPS or your own machine behind Cloudflare Tunnel, then point OLLAMA_BASE_URL at that hostname, or (2) deploy the Next.js app to the same machine that runs Ollama. Both work — only the URL changes.

Should I use the Edge runtime or Node runtime for Ollama routes?

Use the Node runtime. The Edge runtime cannot make connections to private/local IP addresses like 127.0.0.1, and Ollama keeps long-lived streaming connections that benefit from Node's undici. Add export const runtime = "nodejs" to the route file. Switch to Edge only after you put Ollama behind a public TLS endpoint.

How do I stream tokens from Ollama into useChat?

Return result.toDataStreamResponse() from your route handler — that produces the data-stream protocol the SDK expects. On the client, useChat({ api: "/api/chat" }) handles parsing and incremental updates automatically. Do not return a plain ReadableStream or you will see the chat populate only after the full response arrives.

Does tool calling work with Ollama through the AI SDK?

Yes, but only on models trained for tool use. Llama 3.1, Llama 3.2, Qwen 2.5, and Mistral Nemo handle tool calls reliably. Define tools with the tool() helper from the SDK, and Ollama returns OpenAI-style tool_calls that the SDK forwards. Smaller 1B–3B models tend to hallucinate arguments — keep tool calling on 7B+ for production.

Why is my first request very slow?

Ollama loads the model into memory (or VRAM) on the first request. A cold start for Llama 3.1 8B takes 4–10 seconds on a 16GB Mac. After that, requests stream in under 200ms. Set OLLAMA_KEEP_ALIVE=24h in your environment so the model stays resident between requests.

How do I switch between Ollama in dev and OpenAI in production?

Build a thin provider factory that reads process.env.AI_PROVIDER. Return ollamaProvider(...) when it equals "ollama" and openai(...) otherwise. Both expose the same LanguageModelV1 interface, so streamText, generateText, and the React hooks work unchanged. This pattern keeps cloud and local paths in one codebase.

Can I use structured output (Zod schemas) with Ollama?

Yes. generateObject and streamObject both work as long as the model is large enough to follow the schema. Llama 3.1 8B and Qwen 2.5 7B are reliable for simple objects. For nested schemas with 10+ fields, use 14B or 32B models or you will get malformed JSON every fourth or fifth request.

Ollama + Vercel AI SDK: Build a Streaming Local AI Web App in Next.js

Published April 23, 2026 — 18 min read by the LocalAimaster Research Team

I have shipped four production Next.js apps backed by Ollama in the last twelve months. Every one of them needed the same scaffolding: streaming chat, a tidy hook for the UI, a tool-calling layer, and a switch to fall back to a hosted model when the local box goes down. The Vercel AI SDK gives you all of that without reinventing the protocol — but only if you wire it correctly. Most blog posts gloss over the actual gotchas. This guide does not.

By the end you will have a Next.js 15 app with App Router running a streaming chat against llama3.1:8b on your own machine, plus tool calls, structured output, and a deployment plan that actually survives traffic.

Why pair Ollama with the Vercel AI SDK {#why-pair}

The Vercel AI SDK (v4.x as of April 2026) is the de facto standard for building AI UI in TypeScript. It handles the data stream protocol, parses tool calls, manages the React state for incremental updates, and exposes useChat, useCompletion, and useObject hooks that handle 90% of what a chat UI needs.

Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1. That means you can point the SDK's standard OpenAI provider at Ollama and most things just work — but a handful of subtle behaviors (streaming tool calls, model option pass-through, embeddings) are smoother with the community Ollama provider. We will use both.

If you have not run a model locally yet, our first-time Ollama setup mistakes guide covers installation pitfalls that bite Vercel AI SDK users specifically (port conflicts, IPv6 fallback, missing GPU detection).

Quick Start: 5-minute setup {#quick-start}

# 1. Pull a tool-capable model
ollama pull llama3.1:8b

# 2. Verify Ollama exposes the OpenAI-compatible endpoint
curl http://localhost:11434/v1/models

# 3. Scaffold a Next.js app
npx create-next-app@latest my-local-ai --typescript --app --tailwind
cd my-local-ai

# 4. Install the AI SDK + the OpenAI provider
npm install ai @ai-sdk/openai @ai-sdk/react zod

Add to .env.local:

OLLAMA_BASE_URL=http://localhost:11434/v1
OLLAMA_MODEL=llama3.1:8b

That is the entire bootstrap. The rest of this guide makes it production-ready.

Step-by-step: Next.js chat with streaming {#how-to}

1. Create the API route

Create app/api/chat/route.ts:

import { createOpenAI } from '@ai-sdk/openai'
import { streamText, type CoreMessage } from 'ai'

export const runtime = 'nodejs'           // critical — see pitfalls
export const maxDuration = 60             // allow long generations

const ollama = createOpenAI({
  baseURL: process.env.OLLAMA_BASE_URL!,
  apiKey: 'ollama',                       // any non-empty string works
})

export async function POST(req: Request) {
  const { messages }: { messages: CoreMessage[] } = await req.json()

  const result = await streamText({
    model: ollama(process.env.OLLAMA_MODEL!),
    system: 'You are a concise senior engineer. Cite versions where relevant.',
    messages,
    temperature: 0.4,
    maxTokens: 1024,
  })

  return result.toDataStreamResponse()
}

The two non-obvious lines are runtime = 'nodejs' (Edge cannot reach localhost) and return result.toDataStreamResponse() (anything else breaks the protocol useChat expects).

2. Build the chat UI

Create app/page.tsx:

'use client'
import { useChat } from '@ai-sdk/react'

export default function Home() {
  const { messages, input, handleInputChange, handleSubmit, status } = useChat({
    api: '/api/chat',
  })

  return (
    <main className="mx-auto max-w-2xl p-6">
      <div className="space-y-4 mb-6">
        {messages.map(m => (
          <div key={m.id} className="rounded border p-3">
            <strong>{m.role}:</strong>
            <div className="whitespace-pre-wrap">{m.content}</div>
          </div>
        ))}
        {status === 'streaming' && <div className="text-gray-500">…</div>}
      </div>
      <form onSubmit={handleSubmit} className="flex gap-2">
        <input
          className="flex-1 border rounded p-2"
          value={input}
          onChange={handleInputChange}
          placeholder="Ask Llama 3.1 anything"
        />
        <button className="px-4 py-2 bg-black text-white rounded">Send</button>
      </form>
    </main>
  )
}

npm run dev and the tokens stream into the page. On a Mac M2 with 16GB RAM I see the first token in 380ms and a sustained 32 tokens/sec for Llama 3.1 8B Q4_K_M. On an RTX 4090 the same model hits 105 tokens/sec.

3. Add a system prompt and conversation memory

useChat already maintains the message array in React state, so multi-turn memory is free. To inject system context per-request without exposing it to the client, do it server-side as shown above. Never trust client-supplied system prompts in production.

Tool calling and structured output {#tools}

Llama 3.1 and 3.2 ship with native tool-calling support. Here is a tool that calls a real weather API and one that searches a vector store you already host. Replace the URLs with your own.

import { z } from 'zod'
import { tool } from 'ai'

const tools = {
  weather: tool({
    description: 'Get current temperature for a city.',
    parameters: z.object({ city: z.string() }),
    execute: async ({ city }) => {
      const r = await fetch(`https://wttr.in/${city}?format=j1`)
      const j = await r.json()
      return { tempC: j.current_condition[0].temp_C }
    },
  }),
  searchDocs: tool({
    description: 'Search the internal knowledge base.',
    parameters: z.object({ query: z.string(), k: z.number().default(3) }),
    execute: async ({ query, k }) => {
      const r = await fetch(`http://localhost:8000/search?q=${query}&k=${k}`)
      return r.json()
    },
  }),
}

const result = await streamText({
  model: ollama(process.env.OLLAMA_MODEL!),
  messages,
  tools,
  maxSteps: 4,                  // let the model chain tool calls
})

For strict structured output, use generateObject with a Zod schema:

import { generateObject } from 'ai'
const { object } = await generateObject({
  model: ollama('llama3.1:8b'),
  schema: z.object({
    summary: z.string(),
    tags: z.array(z.string()).max(5),
    sentiment: z.enum(['positive', 'neutral', 'negative']),
  }),
  prompt: 'Summarize: ' + articleText,
})

Caveats from production: 1B and 3B models routinely return malformed JSON. Run the schema validator in a retry loop bounded to 3 attempts, then fall back to a 14B model. Our Ollama tool calling guide covers the underlying protocol if you need to debug raw responses.

Ollama vs OpenAI provider configurations {#comparison}

Concern	OpenAI provider against Ollama	Community ai-sdk-provider-ollama
Install	`@ai-sdk/openai` (already vendored)	`npm i ai-sdk-provider-ollama`
Streaming text	Works	Works
Tool calls	Works for compatible models	Works, with native param exposure
Embeddings	Use `openai.embedding('nomic-embed-text')`	First-class `createOllama().embedding()`
Modelfile options (mirostat, num_ctx)	Pass via `providerOptions.openai`	Native `options` object
Pull progress / list models	Not exposed	Native helpers
Risk	Stable	Tracks SDK breaking changes a release behind

For 80% of apps, the OpenAI provider is enough. Reach for the community provider when you need embeddings, custom num_ctx, or programmatic model management.

Deployment and production hardening {#deploy}

You cannot deploy this to vanilla Vercel and have it call localhost:11434. The serverless function runs in us-east-1; your Ollama daemon does not. Three deployment patterns work:

Option A — Self-host the whole stack. Put both the Next.js app and Ollama on one VPS or homelab box. pm2 for the Node app, systemctl for Ollama, nginx for TLS. This is the simplest path. Our Ollama in production guide walks through the nginx + Let's Encrypt + monitoring setup.

Option B — Vercel for the app, public Ollama endpoint. Run Ollama on a beefy machine, expose it through Cloudflare Tunnel or a Tailscale Funnel, set OLLAMA_BASE_URL=https://ollama.yourdomain.com/v1, and keep the API route on runtime = 'nodejs'. Add an auth header on the tunnel — the OpenAI-compatible endpoint has no built-in auth.

Option C — Hybrid with an environment switch.

import { createOpenAI } from '@ai-sdk/openai'

export const model = process.env.AI_PROVIDER === 'ollama'
  ? createOpenAI({ baseURL: process.env.OLLAMA_BASE_URL!, apiKey: 'ollama' })(
      process.env.OLLAMA_MODEL!,
    )
  : createOpenAI({ apiKey: process.env.OPENAI_API_KEY! })('gpt-4o-mini')

Dev runs free against Llama 3.1 8B. Production switches to GPT-4o-mini, or vice versa. The SDK abstracts the difference completely.

For high-traffic deployments, put a queue in front of Ollama. A single 24GB GPU running Llama 3.1 8B serves about 8 concurrent streams comfortably; beyond that, response times degrade. Bull, BullMQ, or even SQS with a worker on the same box handles backpressure cleanly. If you outgrow one box, put nginx in front of multiple Ollama instances — see the load balancing Ollama guide.

For a comparison of cost between this setup and a pure cloud one, see Ollama vs ChatGPT API cost at scale — at ~150K daily tokens a self-hosted RTX 4090 pays back in 2.4 months.

Pitfalls and how to avoid them {#pitfalls}

Edge runtime gotcha. export const runtime = 'edge' cannot reach 127.0.0.1. The error is silent — fetch just hangs. Use 'nodejs' until Ollama is on a public hostname.
maxDuration default. Hobby Vercel projects time out at 10 seconds. Long generations need export const maxDuration = 60. Self-hosted Node is uncapped.
Ollama keep-alive. Without OLLAMA_KEEP_ALIVE=24h, the model unloads after 5 minutes idle. Every cold call adds a 4–10s reload. Set the env var on the Ollama process, not the Next.js app.
Model pull at runtime. Calling a model that is not pulled returns a 404 from Ollama, which the SDK surfaces as a generic error. Add a startup check that runs ollama list and pulls missing models.
Streaming response shape. new Response(stream) does not work with useChat. Always use result.toDataStreamResponse().
CORS. If you call Ollama from the browser directly (do not), Ollama refuses cross-origin requests by default. Set OLLAMA_ORIGINS=http://localhost:3000 for local debugging only.
Tool calling on tiny models. Phi-3 mini and Llama 3.2 1B will happily invent tool arguments. Limit tool use to 7B+ models in production.

For a full diagnostic flowchart, the Ollama troubleshooting guide lists every error code we have seen.

Reference: tested model and runtime combinations

Model	Size on disk	Min RAM	Tokens/sec (M2 16GB)	Tokens/sec (RTX 4090)	Tool calls?
llama3.2:1b	1.3 GB	4 GB	95	220	unreliable
llama3.2:3b	2.0 GB	6 GB	58	165	usable
llama3.1:8b	4.7 GB	8 GB	32	105	reliable
qwen2.5:7b	4.4 GB	8 GB	35	110	reliable
mistral-nemo:12b	7.1 GB	16 GB	18	72	reliable
qwen2.5:14b	8.4 GB	16 GB	16	64	reliable
qwen2.5:32b	19 GB	24 GB	—	28	excellent

All numbers measured April 2026 with Q4_K_M quantization, default num_ctx=4096, single user, warm cache. Multi-user serving cuts per-stream throughput roughly proportionally.

The official Vercel AI SDK reference is at sdk.vercel.ai, and the Ollama OpenAI-compatibility surface is documented in the Ollama OpenAI compatibility doc.

What to ship next

Your chat works, streams, calls tools, and degrades gracefully to a hosted provider. Three things separate a demo from a real product:

Auth and rate limits. Wrap the route with NextAuth and Upstash Ratelimit. Without rate limits, one bad actor drains your GPU.
Observability. Log token counts, latency, and tool calls to PostHog or your warehouse. A surprising fraction of model failures are silent.
Evals. Set up a 50-prompt regression suite with our model evaluation framework so you catch quality drops when you swap models.

This is the foundation we use for every internal tool at LocalAIMaster — from our writing assistant to our internal RAG search. The combination of @ai-sdk/react on the front and Ollama on the back gives you a developer experience close to using OpenAI, with zero per-token cost and full data ownership.

Ollama + Vercel AI SDK: Build a Streaming Local AI Web App

Want to go deeper than this article?

Ollama + Vercel AI SDK: Build a Streaming Local AI Web App in Next.js

Why pair Ollama with the Vercel AI SDK {#why-pair}

Quick Start: 5-minute setup {#quick-start}

Step-by-step: Next.js chat with streaming {#how-to}

1. Create the API route

2. Build the chat UI

3. Add a system prompt and conversation memory

Tool calling and structured output {#tools}

Ollama vs OpenAI provider configurations {#comparison}

Deployment and production hardening {#deploy}

Pitfalls and how to avoid them {#pitfalls}

Reference: tested model and runtime combinations

What to ship next

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

🎓 Continue Learning

Ship local-first AI apps faster

Related Guides

Build Real AI on Your Machine

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI