Free course — 2 free chapters of every course. No credit card.Start learning free
Developer Integration

Ollama + Vercel AI SDK: Build a Streaming Local AI Web App

April 23, 2026
18 min read
LocalAimaster Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

Ollama + Vercel AI SDK: Build a Streaming Local AI Web App in Next.js

Published April 23, 2026 — 18 min read by the LocalAimaster Research Team

I have shipped four production Next.js apps backed by Ollama in the last twelve months. Every one of them needed the same scaffolding: streaming chat, a tidy hook for the UI, a tool-calling layer, and a switch to fall back to a hosted model when the local box goes down. The Vercel AI SDK gives you all of that without reinventing the protocol — but only if you wire it correctly. Most blog posts gloss over the actual gotchas. This guide does not.

By the end you will have a Next.js 15 app with App Router running a streaming chat against llama3.1:8b on your own machine, plus tool calls, structured output, and a deployment plan that actually survives traffic.

Why pair Ollama with the Vercel AI SDK {#why-pair}

The Vercel AI SDK (v4.x as of April 2026) is the de facto standard for building AI UI in TypeScript. It handles the data stream protocol, parses tool calls, manages the React state for incremental updates, and exposes useChat, useCompletion, and useObject hooks that handle 90% of what a chat UI needs.

Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1. That means you can point the SDK's standard OpenAI provider at Ollama and most things just work — but a handful of subtle behaviors (streaming tool calls, model option pass-through, embeddings) are smoother with the community Ollama provider. We will use both.

If you have not run a model locally yet, our first-time Ollama setup mistakes guide covers installation pitfalls that bite Vercel AI SDK users specifically (port conflicts, IPv6 fallback, missing GPU detection).

Quick Start: 5-minute setup {#quick-start}

# 1. Pull a tool-capable model
ollama pull llama3.1:8b

# 2. Verify Ollama exposes the OpenAI-compatible endpoint
curl http://localhost:11434/v1/models

# 3. Scaffold a Next.js app
npx create-next-app@latest my-local-ai --typescript --app --tailwind
cd my-local-ai

# 4. Install the AI SDK + the OpenAI provider
npm install ai @ai-sdk/openai @ai-sdk/react zod

Add to .env.local:

OLLAMA_BASE_URL=http://localhost:11434/v1
OLLAMA_MODEL=llama3.1:8b

That is the entire bootstrap. The rest of this guide makes it production-ready.

Step-by-step: Next.js chat with streaming {#how-to}

1. Create the API route

Create app/api/chat/route.ts:

import { createOpenAI } from '@ai-sdk/openai'
import { streamText, type CoreMessage } from 'ai'

export const runtime = 'nodejs'           // critical — see pitfalls
export const maxDuration = 60             // allow long generations

const ollama = createOpenAI({
  baseURL: process.env.OLLAMA_BASE_URL!,
  apiKey: 'ollama',                       // any non-empty string works
})

export async function POST(req: Request) {
  const { messages }: { messages: CoreMessage[] } = await req.json()

  const result = await streamText({
    model: ollama(process.env.OLLAMA_MODEL!),
    system: 'You are a concise senior engineer. Cite versions where relevant.',
    messages,
    temperature: 0.4,
    maxTokens: 1024,
  })

  return result.toDataStreamResponse()
}

The two non-obvious lines are runtime = 'nodejs' (Edge cannot reach localhost) and return result.toDataStreamResponse() (anything else breaks the protocol useChat expects).

2. Build the chat UI

Create app/page.tsx:

'use client'
import { useChat } from '@ai-sdk/react'

export default function Home() {
  const { messages, input, handleInputChange, handleSubmit, status } = useChat({
    api: '/api/chat',
  })

  return (
    <main className="mx-auto max-w-2xl p-6">
      <div className="space-y-4 mb-6">
        {messages.map(m => (
          <div key={m.id} className="rounded border p-3">
            <strong>{m.role}:</strong>
            <div className="whitespace-pre-wrap">{m.content}</div>
          </div>
        ))}
        {status === 'streaming' && <div className="text-gray-500">…</div>}
      </div>
      <form onSubmit={handleSubmit} className="flex gap-2">
        <input
          className="flex-1 border rounded p-2"
          value={input}
          onChange={handleInputChange}
          placeholder="Ask Llama 3.1 anything"
        />
        <button className="px-4 py-2 bg-black text-white rounded">Send</button>
      </form>
    </main>
  )
}

npm run dev and the tokens stream into the page. On a Mac M2 with 16GB RAM I see the first token in 380ms and a sustained 32 tokens/sec for Llama 3.1 8B Q4_K_M. On an RTX 4090 the same model hits 105 tokens/sec.

3. Add a system prompt and conversation memory

useChat already maintains the message array in React state, so multi-turn memory is free. To inject system context per-request without exposing it to the client, do it server-side as shown above. Never trust client-supplied system prompts in production.

Tool calling and structured output {#tools}

Llama 3.1 and 3.2 ship with native tool-calling support. Here is a tool that calls a real weather API and one that searches a vector store you already host. Replace the URLs with your own.

import { z } from 'zod'
import { tool } from 'ai'

const tools = {
  weather: tool({
    description: 'Get current temperature for a city.',
    parameters: z.object({ city: z.string() }),
    execute: async ({ city }) => {
      const r = await fetch(`https://wttr.in/${city}?format=j1`)
      const j = await r.json()
      return { tempC: j.current_condition[0].temp_C }
    },
  }),
  searchDocs: tool({
    description: 'Search the internal knowledge base.',
    parameters: z.object({ query: z.string(), k: z.number().default(3) }),
    execute: async ({ query, k }) => {
      const r = await fetch(`http://localhost:8000/search?q=${query}&k=${k}`)
      return r.json()
    },
  }),
}

const result = await streamText({
  model: ollama(process.env.OLLAMA_MODEL!),
  messages,
  tools,
  maxSteps: 4,                  // let the model chain tool calls
})

For strict structured output, use generateObject with a Zod schema:

import { generateObject } from 'ai'
const { object } = await generateObject({
  model: ollama('llama3.1:8b'),
  schema: z.object({
    summary: z.string(),
    tags: z.array(z.string()).max(5),
    sentiment: z.enum(['positive', 'neutral', 'negative']),
  }),
  prompt: 'Summarize: ' + articleText,
})

Caveats from production: 1B and 3B models routinely return malformed JSON. Run the schema validator in a retry loop bounded to 3 attempts, then fall back to a 14B model. Our Ollama tool calling guide covers the underlying protocol if you need to debug raw responses.

Ollama vs OpenAI provider configurations {#comparison}

ConcernOpenAI provider against OllamaCommunity ai-sdk-provider-ollama
Install@ai-sdk/openai (already vendored)npm i ai-sdk-provider-ollama
Streaming textWorksWorks
Tool callsWorks for compatible modelsWorks, with native param exposure
EmbeddingsUse openai.embedding('nomic-embed-text')First-class createOllama().embedding()
Modelfile options (mirostat, num_ctx)Pass via providerOptions.openaiNative options object
Pull progress / list modelsNot exposedNative helpers
RiskStableTracks SDK breaking changes a release behind

For 80% of apps, the OpenAI provider is enough. Reach for the community provider when you need embeddings, custom num_ctx, or programmatic model management.

Deployment and production hardening {#deploy}

You cannot deploy this to vanilla Vercel and have it call localhost:11434. The serverless function runs in us-east-1; your Ollama daemon does not. Three deployment patterns work:

Option A — Self-host the whole stack. Put both the Next.js app and Ollama on one VPS or homelab box. pm2 for the Node app, systemctl for Ollama, nginx for TLS. This is the simplest path. Our Ollama in production guide walks through the nginx + Let's Encrypt + monitoring setup.

Option B — Vercel for the app, public Ollama endpoint. Run Ollama on a beefy machine, expose it through Cloudflare Tunnel or a Tailscale Funnel, set OLLAMA_BASE_URL=https://ollama.yourdomain.com/v1, and keep the API route on runtime = 'nodejs'. Add an auth header on the tunnel — the OpenAI-compatible endpoint has no built-in auth.

Option C — Hybrid with an environment switch.

import { createOpenAI } from '@ai-sdk/openai'

export const model = process.env.AI_PROVIDER === 'ollama'
  ? createOpenAI({ baseURL: process.env.OLLAMA_BASE_URL!, apiKey: 'ollama' })(
      process.env.OLLAMA_MODEL!,
    )
  : createOpenAI({ apiKey: process.env.OPENAI_API_KEY! })('gpt-4o-mini')

Dev runs free against Llama 3.1 8B. Production switches to GPT-4o-mini, or vice versa. The SDK abstracts the difference completely.

For high-traffic deployments, put a queue in front of Ollama. A single 24GB GPU running Llama 3.1 8B serves about 8 concurrent streams comfortably; beyond that, response times degrade. Bull, BullMQ, or even SQS with a worker on the same box handles backpressure cleanly. If you outgrow one box, put nginx in front of multiple Ollama instances — see the load balancing Ollama guide.

For a comparison of cost between this setup and a pure cloud one, see Ollama vs ChatGPT API cost at scale — at ~150K daily tokens a self-hosted RTX 4090 pays back in 2.4 months.

Pitfalls and how to avoid them {#pitfalls}

  • Edge runtime gotcha. export const runtime = 'edge' cannot reach 127.0.0.1. The error is silent — fetch just hangs. Use 'nodejs' until Ollama is on a public hostname.
  • maxDuration default. Hobby Vercel projects time out at 10 seconds. Long generations need export const maxDuration = 60. Self-hosted Node is uncapped.
  • Ollama keep-alive. Without OLLAMA_KEEP_ALIVE=24h, the model unloads after 5 minutes idle. Every cold call adds a 4–10s reload. Set the env var on the Ollama process, not the Next.js app.
  • Model pull at runtime. Calling a model that is not pulled returns a 404 from Ollama, which the SDK surfaces as a generic error. Add a startup check that runs ollama list and pulls missing models.
  • Streaming response shape. new Response(stream) does not work with useChat. Always use result.toDataStreamResponse().
  • CORS. If you call Ollama from the browser directly (do not), Ollama refuses cross-origin requests by default. Set OLLAMA_ORIGINS=http://localhost:3000 for local debugging only.
  • Tool calling on tiny models. Phi-3 mini and Llama 3.2 1B will happily invent tool arguments. Limit tool use to 7B+ models in production.

For a full diagnostic flowchart, the Ollama troubleshooting guide lists every error code we have seen.

Reference: tested model and runtime combinations

ModelSize on diskMin RAMTokens/sec (M2 16GB)Tokens/sec (RTX 4090)Tool calls?
llama3.2:1b1.3 GB4 GB95220unreliable
llama3.2:3b2.0 GB6 GB58165usable
llama3.1:8b4.7 GB8 GB32105reliable
qwen2.5:7b4.4 GB8 GB35110reliable
mistral-nemo:12b7.1 GB16 GB1872reliable
qwen2.5:14b8.4 GB16 GB1664reliable
qwen2.5:32b19 GB24 GB28excellent

All numbers measured April 2026 with Q4_K_M quantization, default num_ctx=4096, single user, warm cache. Multi-user serving cuts per-stream throughput roughly proportionally.

The official Vercel AI SDK reference is at sdk.vercel.ai, and the Ollama OpenAI-compatibility surface is documented in the Ollama OpenAI compatibility doc.

What to ship next

Your chat works, streams, calls tools, and degrades gracefully to a hosted provider. Three things separate a demo from a real product:

  1. Auth and rate limits. Wrap the route with NextAuth and Upstash Ratelimit. Without rate limits, one bad actor drains your GPU.
  2. Observability. Log token counts, latency, and tool calls to PostHog or your warehouse. A surprising fraction of model failures are silent.
  3. Evals. Set up a 50-prompt regression suite with our model evaluation framework so you catch quality drops when you swap models.

This is the foundation we use for every internal tool at LocalAIMaster — from our writing assistant to our internal RAG search. The combination of @ai-sdk/react on the front and Ollama on the back gives you a developer experience close to using OpenAI, with zero per-token cost and full data ownership.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: April 23, 2026🔄 Last Updated: April 23, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

Was this helpful?

Ship local-first AI apps faster

Get one battle-tested Next.js + local AI pattern per week, written by the LocalAIMaster team.

Related Guides

Continue your local AI journey with these comprehensive guides

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators