Ollama + Vercel AI SDK: Build a Streaming Local AI Web App
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Ollama + Vercel AI SDK: Build a Streaming Local AI Web App in Next.js
Published April 23, 2026 — 18 min read by the LocalAimaster Research Team
I have shipped four production Next.js apps backed by Ollama in the last twelve months. Every one of them needed the same scaffolding: streaming chat, a tidy hook for the UI, a tool-calling layer, and a switch to fall back to a hosted model when the local box goes down. The Vercel AI SDK gives you all of that without reinventing the protocol — but only if you wire it correctly. Most blog posts gloss over the actual gotchas. This guide does not.
By the end you will have a Next.js 15 app with App Router running a streaming chat against llama3.1:8b on your own machine, plus tool calls, structured output, and a deployment plan that actually survives traffic.
Why pair Ollama with the Vercel AI SDK {#why-pair}
The Vercel AI SDK (v4.x as of April 2026) is the de facto standard for building AI UI in TypeScript. It handles the data stream protocol, parses tool calls, manages the React state for incremental updates, and exposes useChat, useCompletion, and useObject hooks that handle 90% of what a chat UI needs.
Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1. That means you can point the SDK's standard OpenAI provider at Ollama and most things just work — but a handful of subtle behaviors (streaming tool calls, model option pass-through, embeddings) are smoother with the community Ollama provider. We will use both.
If you have not run a model locally yet, our first-time Ollama setup mistakes guide covers installation pitfalls that bite Vercel AI SDK users specifically (port conflicts, IPv6 fallback, missing GPU detection).
Quick Start: 5-minute setup {#quick-start}
# 1. Pull a tool-capable model
ollama pull llama3.1:8b
# 2. Verify Ollama exposes the OpenAI-compatible endpoint
curl http://localhost:11434/v1/models
# 3. Scaffold a Next.js app
npx create-next-app@latest my-local-ai --typescript --app --tailwind
cd my-local-ai
# 4. Install the AI SDK + the OpenAI provider
npm install ai @ai-sdk/openai @ai-sdk/react zod
Add to .env.local:
OLLAMA_BASE_URL=http://localhost:11434/v1
OLLAMA_MODEL=llama3.1:8b
That is the entire bootstrap. The rest of this guide makes it production-ready.
Step-by-step: Next.js chat with streaming {#how-to}
1. Create the API route
Create app/api/chat/route.ts:
import { createOpenAI } from '@ai-sdk/openai'
import { streamText, type CoreMessage } from 'ai'
export const runtime = 'nodejs' // critical — see pitfalls
export const maxDuration = 60 // allow long generations
const ollama = createOpenAI({
baseURL: process.env.OLLAMA_BASE_URL!,
apiKey: 'ollama', // any non-empty string works
})
export async function POST(req: Request) {
const { messages }: { messages: CoreMessage[] } = await req.json()
const result = await streamText({
model: ollama(process.env.OLLAMA_MODEL!),
system: 'You are a concise senior engineer. Cite versions where relevant.',
messages,
temperature: 0.4,
maxTokens: 1024,
})
return result.toDataStreamResponse()
}
The two non-obvious lines are runtime = 'nodejs' (Edge cannot reach localhost) and return result.toDataStreamResponse() (anything else breaks the protocol useChat expects).
2. Build the chat UI
Create app/page.tsx:
'use client'
import { useChat } from '@ai-sdk/react'
export default function Home() {
const { messages, input, handleInputChange, handleSubmit, status } = useChat({
api: '/api/chat',
})
return (
<main className="mx-auto max-w-2xl p-6">
<div className="space-y-4 mb-6">
{messages.map(m => (
<div key={m.id} className="rounded border p-3">
<strong>{m.role}:</strong>
<div className="whitespace-pre-wrap">{m.content}</div>
</div>
))}
{status === 'streaming' && <div className="text-gray-500">…</div>}
</div>
<form onSubmit={handleSubmit} className="flex gap-2">
<input
className="flex-1 border rounded p-2"
value={input}
onChange={handleInputChange}
placeholder="Ask Llama 3.1 anything"
/>
<button className="px-4 py-2 bg-black text-white rounded">Send</button>
</form>
</main>
)
}
npm run dev and the tokens stream into the page. On a Mac M2 with 16GB RAM I see the first token in 380ms and a sustained 32 tokens/sec for Llama 3.1 8B Q4_K_M. On an RTX 4090 the same model hits 105 tokens/sec.
3. Add a system prompt and conversation memory
useChat already maintains the message array in React state, so multi-turn memory is free. To inject system context per-request without exposing it to the client, do it server-side as shown above. Never trust client-supplied system prompts in production.
Tool calling and structured output {#tools}
Llama 3.1 and 3.2 ship with native tool-calling support. Here is a tool that calls a real weather API and one that searches a vector store you already host. Replace the URLs with your own.
import { z } from 'zod'
import { tool } from 'ai'
const tools = {
weather: tool({
description: 'Get current temperature for a city.',
parameters: z.object({ city: z.string() }),
execute: async ({ city }) => {
const r = await fetch(`https://wttr.in/${city}?format=j1`)
const j = await r.json()
return { tempC: j.current_condition[0].temp_C }
},
}),
searchDocs: tool({
description: 'Search the internal knowledge base.',
parameters: z.object({ query: z.string(), k: z.number().default(3) }),
execute: async ({ query, k }) => {
const r = await fetch(`http://localhost:8000/search?q=${query}&k=${k}`)
return r.json()
},
}),
}
const result = await streamText({
model: ollama(process.env.OLLAMA_MODEL!),
messages,
tools,
maxSteps: 4, // let the model chain tool calls
})
For strict structured output, use generateObject with a Zod schema:
import { generateObject } from 'ai'
const { object } = await generateObject({
model: ollama('llama3.1:8b'),
schema: z.object({
summary: z.string(),
tags: z.array(z.string()).max(5),
sentiment: z.enum(['positive', 'neutral', 'negative']),
}),
prompt: 'Summarize: ' + articleText,
})
Caveats from production: 1B and 3B models routinely return malformed JSON. Run the schema validator in a retry loop bounded to 3 attempts, then fall back to a 14B model. Our Ollama tool calling guide covers the underlying protocol if you need to debug raw responses.
Ollama vs OpenAI provider configurations {#comparison}
| Concern | OpenAI provider against Ollama | Community ai-sdk-provider-ollama |
|---|---|---|
| Install | @ai-sdk/openai (already vendored) | npm i ai-sdk-provider-ollama |
| Streaming text | Works | Works |
| Tool calls | Works for compatible models | Works, with native param exposure |
| Embeddings | Use openai.embedding('nomic-embed-text') | First-class createOllama().embedding() |
| Modelfile options (mirostat, num_ctx) | Pass via providerOptions.openai | Native options object |
| Pull progress / list models | Not exposed | Native helpers |
| Risk | Stable | Tracks SDK breaking changes a release behind |
For 80% of apps, the OpenAI provider is enough. Reach for the community provider when you need embeddings, custom num_ctx, or programmatic model management.
Deployment and production hardening {#deploy}
You cannot deploy this to vanilla Vercel and have it call localhost:11434. The serverless function runs in us-east-1; your Ollama daemon does not. Three deployment patterns work:
Option A — Self-host the whole stack. Put both the Next.js app and Ollama on one VPS or homelab box. pm2 for the Node app, systemctl for Ollama, nginx for TLS. This is the simplest path. Our Ollama in production guide walks through the nginx + Let's Encrypt + monitoring setup.
Option B — Vercel for the app, public Ollama endpoint. Run Ollama on a beefy machine, expose it through Cloudflare Tunnel or a Tailscale Funnel, set OLLAMA_BASE_URL=https://ollama.yourdomain.com/v1, and keep the API route on runtime = 'nodejs'. Add an auth header on the tunnel — the OpenAI-compatible endpoint has no built-in auth.
Option C — Hybrid with an environment switch.
import { createOpenAI } from '@ai-sdk/openai'
export const model = process.env.AI_PROVIDER === 'ollama'
? createOpenAI({ baseURL: process.env.OLLAMA_BASE_URL!, apiKey: 'ollama' })(
process.env.OLLAMA_MODEL!,
)
: createOpenAI({ apiKey: process.env.OPENAI_API_KEY! })('gpt-4o-mini')
Dev runs free against Llama 3.1 8B. Production switches to GPT-4o-mini, or vice versa. The SDK abstracts the difference completely.
For high-traffic deployments, put a queue in front of Ollama. A single 24GB GPU running Llama 3.1 8B serves about 8 concurrent streams comfortably; beyond that, response times degrade. Bull, BullMQ, or even SQS with a worker on the same box handles backpressure cleanly. If you outgrow one box, put nginx in front of multiple Ollama instances — see the load balancing Ollama guide.
For a comparison of cost between this setup and a pure cloud one, see Ollama vs ChatGPT API cost at scale — at ~150K daily tokens a self-hosted RTX 4090 pays back in 2.4 months.
Pitfalls and how to avoid them {#pitfalls}
- Edge runtime gotcha.
export const runtime = 'edge'cannot reach 127.0.0.1. The error is silent — fetch just hangs. Use'nodejs'until Ollama is on a public hostname. - maxDuration default. Hobby Vercel projects time out at 10 seconds. Long generations need
export const maxDuration = 60. Self-hosted Node is uncapped. - Ollama keep-alive. Without
OLLAMA_KEEP_ALIVE=24h, the model unloads after 5 minutes idle. Every cold call adds a 4–10s reload. Set the env var on the Ollama process, not the Next.js app. - Model pull at runtime. Calling a model that is not pulled returns a 404 from Ollama, which the SDK surfaces as a generic error. Add a startup check that runs
ollama listand pulls missing models. - Streaming response shape.
new Response(stream)does not work withuseChat. Always useresult.toDataStreamResponse(). - CORS. If you call Ollama from the browser directly (do not), Ollama refuses cross-origin requests by default. Set
OLLAMA_ORIGINS=http://localhost:3000for local debugging only. - Tool calling on tiny models. Phi-3 mini and Llama 3.2 1B will happily invent tool arguments. Limit tool use to 7B+ models in production.
For a full diagnostic flowchart, the Ollama troubleshooting guide lists every error code we have seen.
Reference: tested model and runtime combinations
| Model | Size on disk | Min RAM | Tokens/sec (M2 16GB) | Tokens/sec (RTX 4090) | Tool calls? |
|---|---|---|---|---|---|
| llama3.2:1b | 1.3 GB | 4 GB | 95 | 220 | unreliable |
| llama3.2:3b | 2.0 GB | 6 GB | 58 | 165 | usable |
| llama3.1:8b | 4.7 GB | 8 GB | 32 | 105 | reliable |
| qwen2.5:7b | 4.4 GB | 8 GB | 35 | 110 | reliable |
| mistral-nemo:12b | 7.1 GB | 16 GB | 18 | 72 | reliable |
| qwen2.5:14b | 8.4 GB | 16 GB | 16 | 64 | reliable |
| qwen2.5:32b | 19 GB | 24 GB | — | 28 | excellent |
All numbers measured April 2026 with Q4_K_M quantization, default num_ctx=4096, single user, warm cache. Multi-user serving cuts per-stream throughput roughly proportionally.
The official Vercel AI SDK reference is at sdk.vercel.ai, and the Ollama OpenAI-compatibility surface is documented in the Ollama OpenAI compatibility doc.
What to ship next
Your chat works, streams, calls tools, and degrades gracefully to a hosted provider. Three things separate a demo from a real product:
- Auth and rate limits. Wrap the route with NextAuth and Upstash Ratelimit. Without rate limits, one bad actor drains your GPU.
- Observability. Log token counts, latency, and tool calls to PostHog or your warehouse. A surprising fraction of model failures are silent.
- Evals. Set up a 50-prompt regression suite with our model evaluation framework so you catch quality drops when you swap models.
This is the foundation we use for every internal tool at LocalAIMaster — from our writing assistant to our internal RAG search. The combination of @ai-sdk/react on the front and Ollama on the back gives you a developer experience close to using OpenAI, with zero per-token cost and full data ownership.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!