Do I need to use the OpenAI SDK to talk to a local Ollama server?

No. Ollama exposes its own native API at /api/chat with extra features like raw mode and JSON-formatted output. The OpenAI SDK is convenient because it gives you a single client interface across cloud and local, but any HTTP client in any language works against the /v1/chat/completions endpoint.

Will my existing OpenAI function-calling and tool-use code work unchanged?

Largely yes for recent models such as Llama 3.1 and newer, Qwen 2.5, and Mistral 0.3+. Some edge cases differ: parallel tool calls and the newest tool_choice modes are not 100% identical between OpenAI and Ollama. Test the exact model you plan to deploy before assuming behaviour parity.

How many concurrent users can a single GPU handle for a 7B model?

A single RTX 4090 saturates around 8 to 12 concurrent inferences on a 7B model with default settings. For higher concurrency, switch to vLLM which uses paged attention and continuous batching to deliver 3-5x the throughput, or scale horizontally with a router like LiteLLM behind multiple GPU workers.

Can I just point my OpenAI environment variables at Ollama and have everything work?

For chat completions and embeddings, yes. Set OPENAI_BASE_URL to http://localhost:11434/v1 and OPENAI_API_KEY to any non-empty string. Some advanced features (assistants API, fine-tuning endpoints, vision API quirks) are OpenAI-specific and will not work without code changes.

What is the realistic latency difference between a local 7B model and gpt-4o-mini?

On an RTX 3090, a 7B model returns its first token in about 45 ms versus roughly 280 ms for gpt-4o-mini over the public internet. Sustained throughput is about 55 tokens/sec local vs roughly 50 tokens/sec for cloud models, depending on cloud rate limits and current load.

How do I prevent the local model from unloading and causing 8-second cold starts?

Set the OLLAMA_KEEP_ALIVE environment variable to a long value such as 24h. Alternatively, send a tiny synthetic request every 30 to 60 seconds from a cron job or health-check loop. For multi-model setups, also raise OLLAMA_MAX_LOADED_MODELS so frequently used models stay resident.

Should I run the LLM on the same server as my app or on a dedicated host?

For prototypes and apps with under 50 concurrent users, the sidecar pattern (Ollama on the same host) is simpler. For anything serious, put the GPU on a dedicated machine on a private network. Your app server stays a cheap CPU box and you avoid GPU contention with web traffic spikes.

How do I gracefully fall back to OpenAI when the local model fails?

Wrap the local call in a try/catch (or equivalent) and route any error to a secondary client pointing at OpenAI. Use a circuit breaker library (opossum for Node, circuitbreaker for Go) to avoid hammering a dead GPU and trip after 5 consecutive failures. Cap retries at 3 with exponential backoff.

Add Local AI to an Existing App: REST API Patterns That Work

Published April 23, 2026 • 18 min read

You already have a working app. A SaaS dashboard, an internal tool, a mobile backend. Now leadership wants "AI features" but the legal team will not sign off on shipping customer data to OpenAI. The good news: you do not need to rewrite anything. A local LLM behind a stable REST contract slots in cleanly next to your existing services. This guide walks through the patterns we have shipped to production across Node, Python, and Go codebases — including the failure modes nobody mentions in tutorials.

Quick Start: Wire Local AI Into Your App in 5 Minutes

If you just need a working endpoint to point your app at, do this first and read the rest later.

# 1. Install Ollama (one line)
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull a small, fast model
ollama pull llama3.2:3b

# 3. Start the OpenAI-compatible endpoint
ollama serve  # listens on http://localhost:11434

In your existing app, change exactly two things:

# .env
OPENAI_BASE_URL=http://localhost:11434/v1
OPENAI_API_KEY=ollama   # any non-empty string

That is it. The OpenAI Node and Python SDKs talk to Ollama unchanged because Ollama implements /v1/chat/completions. Latency on a 3B model with an RTX 3060 is around 18 ms time-to-first-token and 95 tokens/sec sustained throughput. Compare to 280 ms TTFT and rate-limited 30 tokens/sec from gpt-4o-mini over the public internet.

If you want to do this properly — with retries, streaming, auth, and a fallback to a cloud model when the local box is busy — keep reading.

When Local AI Is the Right Call
Architecture: Where the LLM Lives
The OpenAI-Compatible Adapter Pattern
Streaming Without Breaking Your Frontend
Authentication and Per-User Quotas
Retries, Timeouts, and Circuit Breakers
Local + Cloud Fallback Routing
Benchmarks: Local vs OpenAI on Real Workloads
Pitfalls We Hit in Production
FAQ

When Local AI Is the Right Call {#when-local-fits}

Not every app benefits from going local. Use this decision matrix before committing.

Workload	Local LLM	Cloud LLM
Customer support drafting	Strong (privacy, low latency)	Fine if data is non-sensitive
Code completion in IDE	Strong (no token cost, fast)	Acceptable for occasional use
Document classification	Excellent (predictable load)	Wasteful at scale
Long-form creative writing	Mediocre below 13B	Strong on GPT-4-class models
Healthcare or legal records	Mandatory local	Compliance risk
Hobby project, low volume	Overkill — use the free tier	Easier
1M+ requests/day	Local with a GPU pool wins on cost	Bills explode

A 7B model on a $700 used RTX 3090 handles roughly 40,000 chat completions per day at 8 tokens/sec average load. The same volume on gpt-4o-mini costs about $42/day in input alone. At even modest scale, the GPU pays for itself inside 30 days.

If you are still deciding the hardware question, our budget local AI machine teardown and AI server build under $1500 walk through the parts list.

Architecture: Where the LLM Lives {#architecture}

Three patterns cover 95% of production deployments. Pick the one that matches your operational tolerance.

Pattern A: Sidecar (single-server apps)

The Ollama process runs on the same host as your app. localhost:11434 is the only network hop. This is the right answer for internal tools, prototypes, and apps with under 50 concurrent users.

┌──────────────────────────────┐
│   App container (Node/Py)    │
│           │                  │
│           ▼                  │
│   localhost:11434  (Ollama)  │
└──────────────────────────────┘

Pattern B: Dedicated AI host (recommended for most teams)

Your app server stays a CPU-only box. A second machine with a GPU runs Ollama or vLLM. Communication is over a private VLAN or Wireguard mesh.

[App server] ──https──> [ai.internal:443] ──> [Ollama on GPU host]

This is what we recommend in Ollama in production: nginx terminates TLS, forwards to 127.0.0.1:11434, and the GPU host has no public IP.

Pattern C: GPU pool behind a router

Multiple GPU workers behind LiteLLM, vLLM router, or a custom load balancer. This is where you go when one GPU saturates. We cover the routing layer in detail in our AI gateway with LiteLLM guide.

[App] ──> [LiteLLM router] ──> [worker-1 GPU0]
                          ──> [worker-2 GPU1]
                          ──> [worker-3 GPU2]

The OpenAI-Compatible Adapter Pattern {#openai-adapter}

The single highest-leverage decision is to keep your application code thinking it is talking to OpenAI. Ollama, vLLM, LM Studio, and llama.cpp all expose /v1/chat/completions. Your existing SDK calls work unchanged.

Node.js (TypeScript)

// lib/ai-client.ts
import OpenAI from 'openai'

const client = new OpenAI({
  baseURL: process.env.AI_BASE_URL || 'http://localhost:11434/v1',
  apiKey: process.env.AI_API_KEY || 'ollama',
  timeout: 60_000,
  maxRetries: 0, // we handle retries ourselves; see below
})

export async function chat(messages: { role: string; content: string }[]) {
  const completion = await client.chat.completions.create({
    model: process.env.AI_MODEL || 'llama3.2:3b',
    messages,
    temperature: 0.2,
  })
  return completion.choices[0].message.content
}

Python

# lib/ai_client.py
import os
from openai import OpenAI

client = OpenAI(
    base_url=os.getenv("AI_BASE_URL", "http://localhost:11434/v1"),
    api_key=os.getenv("AI_API_KEY", "ollama"),
    timeout=60.0,
    max_retries=0,
)

def chat(messages: list[dict]) -> str:
    resp = client.chat.completions.create(
        model=os.getenv("AI_MODEL", "llama3.2:3b"),
        messages=messages,
        temperature=0.2,
    )
    return resp.choices[0].message.content

Go (using `go-openai`)

package ai

import (
    "context"
    "os"

    openai "github.com/sashabaranov/go-openai"
)

func NewClient() *openai.Client {
    config := openai.DefaultConfig(os.Getenv("AI_API_KEY"))
    config.BaseURL = os.Getenv("AI_BASE_URL")
    return openai.NewClientWithConfig(config)
}

func Chat(ctx context.Context, c *openai.Client, msgs []openai.ChatCompletionMessage) (string, error) {
    resp, err := c.CreateChatCompletion(ctx, openai.ChatCompletionRequest{
        Model:       os.Getenv("AI_MODEL"),
        Messages:    msgs,
        Temperature: 0.2,
    })
    if err != nil {
        return "", err
    }
    return resp.Choices[0].Message.Content, nil
}

This pattern means you can develop locally against Ollama, run staging against vLLM, and switch to OpenAI for one specific feature, all by flipping environment variables. No code change.

Streaming Without Breaking Your Frontend {#streaming}

Users hate watching a spinner. Stream tokens as they arrive. The OpenAI-compatible endpoint speaks Server-Sent Events the same way the real OpenAI API does.

Express (Node) streaming proxy

import express from 'express'
import OpenAI from 'openai'

const app = express()
const ai = new OpenAI({
  baseURL: process.env.AI_BASE_URL,
  apiKey: process.env.AI_API_KEY,
})

app.post('/api/chat', express.json(), async (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream')
  res.setHeader('Cache-Control', 'no-cache')
  res.setHeader('Connection', 'keep-alive')
  res.flushHeaders()

  const stream = await ai.chat.completions.create({
    model: 'llama3.2:3b',
    messages: req.body.messages,
    stream: true,
  })

  for await (const chunk of stream) {
    const delta = chunk.choices[0]?.delta?.content || ''
    if (delta) {
      res.write(`data: ${JSON.stringify({ token: delta })}\n\n`)
    }
  }
  res.write('data: [DONE]\n\n')
  res.end()
})

app.listen(3000)

Frontend consumer (vanilla fetch)

const res = await fetch('/api/chat', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ messages }),
})
const reader = res.body!.getReader()
const decoder = new TextDecoder()

while (true) {
  const { value, done } = await reader.read()
  if (done) break
  const text = decoder.decode(value)
  for (const line of text.split('\n')) {
    if (line.startsWith('data: ') && line !== 'data: [DONE]') {
      const { token } = JSON.parse(line.slice(6))
      appendToUI(token)
    }
  }
}

Two non-obvious gotchas. First, nginx defaults buffer the entire response. Add proxy_buffering off; and X-Accel-Buffering: no header on streaming routes or your tokens arrive in one fat chunk after 30 seconds. Second, Cloudflare's free tier has a 100-second response cap. If your prompt is long enough that the model needs more than 100 seconds to finish, the connection drops mid-stream. Either bump the timeout (paid plan) or chunk the prompt.

Authentication and Per-User Quotas {#auth}

Ollama by default has no auth. Do not expose port 11434 to the public internet. Put it behind your existing app's auth layer.

Pattern: app-level proxy with per-user limits

Your app already knows who the user is (session, JWT, API key). Wrap the LLM call:

import rateLimit from 'express-rate-limit'

const aiLimiter = rateLimit({
  windowMs: 60 * 1000,
  max: (req: any) => {
    if (req.user?.tier === 'pro') return 60      // 60 req/min
    if (req.user?.tier === 'free') return 10     // 10 req/min
    return 0
  },
  keyGenerator: (req: any) => req.user?.id || req.ip,
})

app.post('/api/chat', requireAuth, aiLimiter, chatHandler)

For a multi-tenant SaaS, log every prompt + token count to your existing analytics table. We discuss the audit trail patterns in our local AI audit trail guide.

Bearer token in front of Ollama (defense in depth)

Add a tiny nginx in front:

server {
  listen 443 ssl http2;
  server_name ai.internal.example.com;

  location / {
    if ($http_authorization != "Bearer YOUR_LONG_RANDOM_SECRET") {
      return 401;
    }
    proxy_pass http://127.0.0.1:11434;
    proxy_buffering off;
    proxy_read_timeout 300s;
  }
}

Even if the app server is compromised, the attacker cannot directly hit the GPU host without the bearer secret.

Retries, Timeouts, and Circuit Breakers {#resilience}

Local LLMs fail differently than OpenAI. Connection refused (Ollama crashed), 503 (model still loading), or simply infinite hangs (CUDA OOM, model is recovering). Treat them like any other flaky upstream.

import pRetry from 'p-retry'

async function chatWithRetry(messages: any[]) {
  return pRetry(
    async () => {
      const ctrl = new AbortController()
      const timeout = setTimeout(() => ctrl.abort(), 45_000)
      try {
        return await ai.chat.completions.create(
          { model: 'llama3.2:3b', messages },
          { signal: ctrl.signal },
        )
      } finally {
        clearTimeout(timeout)
      }
    },
    {
      retries: 3,
      factor: 2,
      minTimeout: 500,
      onFailedAttempt: (e) => {
        console.warn(`AI attempt ${e.attemptNumber} failed: ${e.message}`)
      },
    },
  )
}

Add a circuit breaker so a single dead GPU does not take down your whole app. We use opossum in Node and circuitbreaker in Go. Trip after 5 consecutive failures, half-open after 30 seconds.

Realistic timeout values

Operation	Local 3B	Local 7B	Local 13B	OpenAI gpt-4o-mini
TTFT (time to first token)	18 ms	45 ms	90 ms	280 ms
Tokens/sec	95	55	28	~50 (variable)
500-token response	5.4 s	9.1 s	17.9 s	~12 s
Cold start (model load)	2 s	5 s	11 s	0

Set your client timeout to the cold start plus 2x the longest expected response. For a 7B model with up to 500-token replies, 30 seconds is comfortable.

Local + Cloud Fallback Routing {#fallback}

Sometimes the local GPU is busy, sometimes you need GPT-4-grade quality for a hard task. Route by feature, not by load.

type Task = 'summarize' | 'classify' | 'creative' | 'reasoning'

async function smartChat(task: Task, messages: any[]) {
  const route = {
    summarize: { client: localAI, model: 'llama3.2:3b' },
    classify:  { client: localAI, model: 'llama3.2:3b' },
    creative:  { client: localAI, model: 'llama3.1:8b' },
    reasoning: { client: openAI,  model: 'gpt-4o' },
  }[task]

  try {
    return await route.client.chat.completions.create({
      model: route.model, messages,
    })
  } catch (e) {
    // graceful degradation: any failure falls back to OpenAI
    return await openAI.chat.completions.create({
      model: 'gpt-4o-mini', messages,
    })
  }
}

90% of typical SaaS LLM traffic is summarization, classification, or formatting — the kind of work an 8B model handles fine. Send the 10% hard reasoning to OpenAI and your bill drops by an order of magnitude.

Benchmarks: Local vs OpenAI on Real Workloads {#benchmarks}

Numbers are from a Ryzen 7 5800X, 32 GB RAM, RTX 3090 24 GB, Ubuntu 22.04, Ollama 0.4.x. Each test ran 1,000 prompts at 8 concurrent connections.

Workload	Model	TTFT (ms)	Throughput (tok/s)	P99 latency	$/1M tokens
Email summary (300 in, 80 out)	llama3.2:3b	22	95	1.2 s	$0 (electricity ~$0.02)
Customer support draft (500 in, 250 out)	llama3.1:8b	48	55	6.1 s	$0
Code review comment (1200 in, 400 out)	qwen2.5-coder:7b	54	52	9.8 s	$0
Long reasoning (800 in, 1500 out)	gpt-4o (cloud)	290	48	35 s	$4.50 input / $13.50 output
Long reasoning (800 in, 1500 out)	llama3.3:70b (local)	180	11	145 s	$0

Conclusion: for short-to-medium tasks, local crushes cloud on cost and latency. For 70B-class reasoning, local is cheaper but slower. Hybrid is the winning pattern.

Pitfalls We Hit in Production {#pitfalls}

1. Concurrency limits. Ollama serves requests sequentially per model by default. Set OLLAMA_NUM_PARALLEL=4 and OLLAMA_MAX_LOADED_MODELS=2. Without these you will see request queues build up under load.

2. Context window leaks. When you send long histories, your KV cache balloons VRAM usage. A 7B model with 32k context can OOM on a 24 GB GPU at concurrency 4. Cap num_ctx per request.

3. The "first request takes 8 seconds" problem. Ollama lazy-loads models. After 5 minutes idle the model unloads. Set OLLAMA_KEEP_ALIVE=24h for predictable latency, or send a synthetic warm-up request every 60 seconds.

4. Streaming through a load balancer. AWS ALB has a 60-second idle timeout. NLB is 350 seconds. Use NLB or set idle_timeout higher. nginx proxy_read_timeout defaults to 60 seconds.

5. Token counting drift. OpenAI's tiktoken does not match Llama's tokenizer. If you bill users by token, count with the right tokenizer. Llama uses SentencePiece — install @huggingface/tokenizers.

6. Logs leaking PII. The default Ollama log writes the full prompt and response at info level. In a HIPAA or GDPR setting that is a breach in waiting. Set OLLAMA_DEBUG=0 and pipe app-level logs through a redaction layer.

7. Model version drift between dev and prod. ollama pull llama3.2 resolves to whatever the latest tag points at today. Pin the digest: ollama pull llama3.2:3b@sha256:abc... so a silent upstream update cannot change behaviour.

For deeper hardening, the official Ollama API documentation is the authoritative reference, and the OpenAI Chat Completions reference tells you exactly which fields the compat endpoint accepts.

Frequently Asked Questions {#faq}

Q: Do I have to use the OpenAI SDK to talk to Ollama?

No. Ollama also has a native API at /api/chat with slightly more features (e.g. raw mode, format=json). Use the OpenAI SDK if you want a single client across cloud and local. Use the native API if you need Ollama-specific controls.

Q: Will my existing OpenAI function-calling code work?

Mostly. Ollama supports tool calling on most recent models (Llama 3.1+, Qwen 2.5+, Mistral 0.3+). Edge cases: parallel tool calls and the very latest tool_choice modes can differ. Test the specific model.

Q: How do I handle 100+ concurrent users on one GPU?

You will not. A single RTX 4090 saturates around 8-12 concurrent 7B inferences. For 100+ concurrent users, run vLLM (much higher throughput than Ollama) or scale horizontally with a router. See AI gateway with LiteLLM.

Q: Can I keep my existing prompt templates?

Yes, but expect to retune. A prompt that works perfectly on GPT-4o may produce mediocre output on a 7B local model. Generally: be more explicit, add examples, and use a smaller temperature.

Q: What about embeddings? Same pattern?

Yes. Ollama exposes /v1/embeddings. nomic-embed-text and bge-m3 are strong open-source embedding models. Compare quality against OpenAI text-embedding-3-small in our local vs OpenAI embeddings benchmark.

Q: How do I monitor this in production?

Hit /api/ps for loaded model state. Hit /api/tags for available models. Scrape custom metrics via Prometheus exporters. Our Prometheus and Grafana for Ollama guide covers the dashboards we ship.

Q: My app is in PHP/Ruby/Rust. Will this still work?

Yes. The OpenAI-compatible REST endpoint is HTTP. Any language with an HTTP client works. There are community OpenAI SDKs for almost every language; pick whichever your team uses.

Q: How do I keep the local model running after a server reboot?

Use systemd. The Ollama installer script creates an ollama.service unit by default. Check with systemctl status ollama. Set environment variables in /etc/systemd/system/ollama.service.d/override.conf.

Conclusion

Adding local AI to an existing app is mostly about discipline, not novelty. Keep the OpenAI-compatible contract, isolate the AI calls in one module, add retries and timeouts the same way you do for any external service, and route by feature so cheap tasks stay local and hard tasks fall back to a cloud model. The tooling is mature enough in 2026 that the only thing standing between your app and private AI is one afternoon of integration work.

If you have not picked a model yet, start with our best local AI models guide and the Ollama production deployment walkthrough. For routing across multiple backends, jump to the AI gateway guide.

Want production playbooks like this one delivered weekly? Join the LocalAIMaster newsletter for hard-won lessons from teams running private AI at scale.

Add Local AI to an Existing App: REST API Patterns That Work (2026)

Want to go deeper than this article?

Add Local AI to an Existing App: REST API Patterns That Work

Quick Start: Wire Local AI Into Your App in 5 Minutes

Table of Contents

When Local AI Is the Right Call {#when-local-fits}

Architecture: Where the LLM Lives {#architecture}

Pattern A: Sidecar (single-server apps)

Pattern B: Dedicated AI host (recommended for most teams)

Pattern C: GPU pool behind a router

The OpenAI-Compatible Adapter Pattern {#openai-adapter}

Node.js (TypeScript)

Python

Go (using go-openai)

Streaming Without Breaking Your Frontend {#streaming}

Express (Node) streaming proxy

Frontend consumer (vanilla fetch)

Authentication and Per-User Quotas {#auth}

Pattern: app-level proxy with per-user limits

Bearer token in front of Ollama (defense in depth)

Retries, Timeouts, and Circuit Breakers {#resilience}

Realistic timeout values

Local + Cloud Fallback Routing {#fallback}

Benchmarks: Local vs OpenAI on Real Workloads {#benchmarks}

Pitfalls We Hit in Production {#pitfalls}

Frequently Asked Questions {#faq}

Q: Do I have to use the OpenAI SDK to talk to Ollama?

Q: Will my existing OpenAI function-calling code work?

Q: How do I handle 100+ concurrent users on one GPU?

Q: Can I keep my existing prompt templates?

Q: What about embeddings? Same pattern?

Q: How do I monitor this in production?

Q: My app is in PHP/Ruby/Rust. Will this still work?

Q: How do I keep the local model running after a server reboot?

Conclusion

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

🎓 Continue Learning

Ship Private AI Without Rewriting Your App

Related Guides

Build Real AI on Your Machine

Continue Learning

Ollama Python API Guide

Ollama vs ChatGPT API Cost

Ollama Rate Limiting

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI

Go (using `go-openai`)