Add Local AI to an Existing App: REST API Patterns That Work (2026)
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Add Local AI to an Existing App: REST API Patterns That Work
Published April 23, 2026 • 18 min read
You already have a working app. A SaaS dashboard, an internal tool, a mobile backend. Now leadership wants "AI features" but the legal team will not sign off on shipping customer data to OpenAI. The good news: you do not need to rewrite anything. A local LLM behind a stable REST contract slots in cleanly next to your existing services. This guide walks through the patterns we have shipped to production across Node, Python, and Go codebases — including the failure modes nobody mentions in tutorials.
Quick Start: Wire Local AI Into Your App in 5 Minutes
If you just need a working endpoint to point your app at, do this first and read the rest later.
# 1. Install Ollama (one line)
curl -fsSL https://ollama.com/install.sh | sh
# 2. Pull a small, fast model
ollama pull llama3.2:3b
# 3. Start the OpenAI-compatible endpoint
ollama serve # listens on http://localhost:11434
In your existing app, change exactly two things:
# .env
OPENAI_BASE_URL=http://localhost:11434/v1
OPENAI_API_KEY=ollama # any non-empty string
That is it. The OpenAI Node and Python SDKs talk to Ollama unchanged because Ollama implements /v1/chat/completions. Latency on a 3B model with an RTX 3060 is around 18 ms time-to-first-token and 95 tokens/sec sustained throughput. Compare to 280 ms TTFT and rate-limited 30 tokens/sec from gpt-4o-mini over the public internet.
If you want to do this properly — with retries, streaming, auth, and a fallback to a cloud model when the local box is busy — keep reading.
Table of Contents
- When Local AI Is the Right Call
- Architecture: Where the LLM Lives
- The OpenAI-Compatible Adapter Pattern
- Streaming Without Breaking Your Frontend
- Authentication and Per-User Quotas
- Retries, Timeouts, and Circuit Breakers
- Local + Cloud Fallback Routing
- Benchmarks: Local vs OpenAI on Real Workloads
- Pitfalls We Hit in Production
- FAQ
When Local AI Is the Right Call {#when-local-fits}
Not every app benefits from going local. Use this decision matrix before committing.
| Workload | Local LLM | Cloud LLM |
|---|---|---|
| Customer support drafting | Strong (privacy, low latency) | Fine if data is non-sensitive |
| Code completion in IDE | Strong (no token cost, fast) | Acceptable for occasional use |
| Document classification | Excellent (predictable load) | Wasteful at scale |
| Long-form creative writing | Mediocre below 13B | Strong on GPT-4-class models |
| Healthcare or legal records | Mandatory local | Compliance risk |
| Hobby project, low volume | Overkill — use the free tier | Easier |
| 1M+ requests/day | Local with a GPU pool wins on cost | Bills explode |
A 7B model on a $700 used RTX 3090 handles roughly 40,000 chat completions per day at 8 tokens/sec average load. The same volume on gpt-4o-mini costs about $42/day in input alone. At even modest scale, the GPU pays for itself inside 30 days.
If you are still deciding the hardware question, our budget local AI machine teardown and AI server build under $1500 walk through the parts list.
Architecture: Where the LLM Lives {#architecture}
Three patterns cover 95% of production deployments. Pick the one that matches your operational tolerance.
Pattern A: Sidecar (single-server apps)
The Ollama process runs on the same host as your app. localhost:11434 is the only network hop. This is the right answer for internal tools, prototypes, and apps with under 50 concurrent users.
┌──────────────────────────────┐
│ App container (Node/Py) │
│ │ │
│ ▼ │
│ localhost:11434 (Ollama) │
└──────────────────────────────┘
Pattern B: Dedicated AI host (recommended for most teams)
Your app server stays a CPU-only box. A second machine with a GPU runs Ollama or vLLM. Communication is over a private VLAN or Wireguard mesh.
[App server] ──https──> [ai.internal:443] ──> [Ollama on GPU host]
This is what we recommend in Ollama in production: nginx terminates TLS, forwards to 127.0.0.1:11434, and the GPU host has no public IP.
Pattern C: GPU pool behind a router
Multiple GPU workers behind LiteLLM, vLLM router, or a custom load balancer. This is where you go when one GPU saturates. We cover the routing layer in detail in our AI gateway with LiteLLM guide.
[App] ──> [LiteLLM router] ──> [worker-1 GPU0]
──> [worker-2 GPU1]
──> [worker-3 GPU2]
The OpenAI-Compatible Adapter Pattern {#openai-adapter}
The single highest-leverage decision is to keep your application code thinking it is talking to OpenAI. Ollama, vLLM, LM Studio, and llama.cpp all expose /v1/chat/completions. Your existing SDK calls work unchanged.
Node.js (TypeScript)
// lib/ai-client.ts
import OpenAI from 'openai'
const client = new OpenAI({
baseURL: process.env.AI_BASE_URL || 'http://localhost:11434/v1',
apiKey: process.env.AI_API_KEY || 'ollama',
timeout: 60_000,
maxRetries: 0, // we handle retries ourselves; see below
})
export async function chat(messages: { role: string; content: string }[]) {
const completion = await client.chat.completions.create({
model: process.env.AI_MODEL || 'llama3.2:3b',
messages,
temperature: 0.2,
})
return completion.choices[0].message.content
}
Python
# lib/ai_client.py
import os
from openai import OpenAI
client = OpenAI(
base_url=os.getenv("AI_BASE_URL", "http://localhost:11434/v1"),
api_key=os.getenv("AI_API_KEY", "ollama"),
timeout=60.0,
max_retries=0,
)
def chat(messages: list[dict]) -> str:
resp = client.chat.completions.create(
model=os.getenv("AI_MODEL", "llama3.2:3b"),
messages=messages,
temperature=0.2,
)
return resp.choices[0].message.content
Go (using go-openai)
package ai
import (
"context"
"os"
openai "github.com/sashabaranov/go-openai"
)
func NewClient() *openai.Client {
config := openai.DefaultConfig(os.Getenv("AI_API_KEY"))
config.BaseURL = os.Getenv("AI_BASE_URL")
return openai.NewClientWithConfig(config)
}
func Chat(ctx context.Context, c *openai.Client, msgs []openai.ChatCompletionMessage) (string, error) {
resp, err := c.CreateChatCompletion(ctx, openai.ChatCompletionRequest{
Model: os.Getenv("AI_MODEL"),
Messages: msgs,
Temperature: 0.2,
})
if err != nil {
return "", err
}
return resp.Choices[0].Message.Content, nil
}
This pattern means you can develop locally against Ollama, run staging against vLLM, and switch to OpenAI for one specific feature, all by flipping environment variables. No code change.
Streaming Without Breaking Your Frontend {#streaming}
Users hate watching a spinner. Stream tokens as they arrive. The OpenAI-compatible endpoint speaks Server-Sent Events the same way the real OpenAI API does.
Express (Node) streaming proxy
import express from 'express'
import OpenAI from 'openai'
const app = express()
const ai = new OpenAI({
baseURL: process.env.AI_BASE_URL,
apiKey: process.env.AI_API_KEY,
})
app.post('/api/chat', express.json(), async (req, res) => {
res.setHeader('Content-Type', 'text/event-stream')
res.setHeader('Cache-Control', 'no-cache')
res.setHeader('Connection', 'keep-alive')
res.flushHeaders()
const stream = await ai.chat.completions.create({
model: 'llama3.2:3b',
messages: req.body.messages,
stream: true,
})
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content || ''
if (delta) {
res.write(`data: ${JSON.stringify({ token: delta })}\n\n`)
}
}
res.write('data: [DONE]\n\n')
res.end()
})
app.listen(3000)
Frontend consumer (vanilla fetch)
const res = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ messages }),
})
const reader = res.body!.getReader()
const decoder = new TextDecoder()
while (true) {
const { value, done } = await reader.read()
if (done) break
const text = decoder.decode(value)
for (const line of text.split('\n')) {
if (line.startsWith('data: ') && line !== 'data: [DONE]') {
const { token } = JSON.parse(line.slice(6))
appendToUI(token)
}
}
}
Two non-obvious gotchas. First, nginx defaults buffer the entire response. Add proxy_buffering off; and X-Accel-Buffering: no header on streaming routes or your tokens arrive in one fat chunk after 30 seconds. Second, Cloudflare's free tier has a 100-second response cap. If your prompt is long enough that the model needs more than 100 seconds to finish, the connection drops mid-stream. Either bump the timeout (paid plan) or chunk the prompt.
Authentication and Per-User Quotas {#auth}
Ollama by default has no auth. Do not expose port 11434 to the public internet. Put it behind your existing app's auth layer.
Pattern: app-level proxy with per-user limits
Your app already knows who the user is (session, JWT, API key). Wrap the LLM call:
import rateLimit from 'express-rate-limit'
const aiLimiter = rateLimit({
windowMs: 60 * 1000,
max: (req: any) => {
if (req.user?.tier === 'pro') return 60 // 60 req/min
if (req.user?.tier === 'free') return 10 // 10 req/min
return 0
},
keyGenerator: (req: any) => req.user?.id || req.ip,
})
app.post('/api/chat', requireAuth, aiLimiter, chatHandler)
For a multi-tenant SaaS, log every prompt + token count to your existing analytics table. We discuss the audit trail patterns in our local AI audit trail guide.
Bearer token in front of Ollama (defense in depth)
Add a tiny nginx in front:
server {
listen 443 ssl http2;
server_name ai.internal.example.com;
location / {
if ($http_authorization != "Bearer YOUR_LONG_RANDOM_SECRET") {
return 401;
}
proxy_pass http://127.0.0.1:11434;
proxy_buffering off;
proxy_read_timeout 300s;
}
}
Even if the app server is compromised, the attacker cannot directly hit the GPU host without the bearer secret.
Retries, Timeouts, and Circuit Breakers {#resilience}
Local LLMs fail differently than OpenAI. Connection refused (Ollama crashed), 503 (model still loading), or simply infinite hangs (CUDA OOM, model is recovering). Treat them like any other flaky upstream.
import pRetry from 'p-retry'
async function chatWithRetry(messages: any[]) {
return pRetry(
async () => {
const ctrl = new AbortController()
const timeout = setTimeout(() => ctrl.abort(), 45_000)
try {
return await ai.chat.completions.create(
{ model: 'llama3.2:3b', messages },
{ signal: ctrl.signal },
)
} finally {
clearTimeout(timeout)
}
},
{
retries: 3,
factor: 2,
minTimeout: 500,
onFailedAttempt: (e) => {
console.warn(`AI attempt ${e.attemptNumber} failed: ${e.message}`)
},
},
)
}
Add a circuit breaker so a single dead GPU does not take down your whole app. We use opossum in Node and circuitbreaker in Go. Trip after 5 consecutive failures, half-open after 30 seconds.
Realistic timeout values
| Operation | Local 3B | Local 7B | Local 13B | OpenAI gpt-4o-mini |
|---|---|---|---|---|
| TTFT (time to first token) | 18 ms | 45 ms | 90 ms | 280 ms |
| Tokens/sec | 95 | 55 | 28 | ~50 (variable) |
| 500-token response | 5.4 s | 9.1 s | 17.9 s | ~12 s |
| Cold start (model load) | 2 s | 5 s | 11 s | 0 |
Set your client timeout to the cold start plus 2x the longest expected response. For a 7B model with up to 500-token replies, 30 seconds is comfortable.
Local + Cloud Fallback Routing {#fallback}
Sometimes the local GPU is busy, sometimes you need GPT-4-grade quality for a hard task. Route by feature, not by load.
type Task = 'summarize' | 'classify' | 'creative' | 'reasoning'
async function smartChat(task: Task, messages: any[]) {
const route = {
summarize: { client: localAI, model: 'llama3.2:3b' },
classify: { client: localAI, model: 'llama3.2:3b' },
creative: { client: localAI, model: 'llama3.1:8b' },
reasoning: { client: openAI, model: 'gpt-4o' },
}[task]
try {
return await route.client.chat.completions.create({
model: route.model, messages,
})
} catch (e) {
// graceful degradation: any failure falls back to OpenAI
return await openAI.chat.completions.create({
model: 'gpt-4o-mini', messages,
})
}
}
90% of typical SaaS LLM traffic is summarization, classification, or formatting — the kind of work an 8B model handles fine. Send the 10% hard reasoning to OpenAI and your bill drops by an order of magnitude.
Benchmarks: Local vs OpenAI on Real Workloads {#benchmarks}
Numbers are from a Ryzen 7 5800X, 32 GB RAM, RTX 3090 24 GB, Ubuntu 22.04, Ollama 0.4.x. Each test ran 1,000 prompts at 8 concurrent connections.
| Workload | Model | TTFT (ms) | Throughput (tok/s) | P99 latency | $/1M tokens |
|---|---|---|---|---|---|
| Email summary (300 in, 80 out) | llama3.2:3b | 22 | 95 | 1.2 s | $0 (electricity ~$0.02) |
| Customer support draft (500 in, 250 out) | llama3.1:8b | 48 | 55 | 6.1 s | $0 |
| Code review comment (1200 in, 400 out) | qwen2.5-coder:7b | 54 | 52 | 9.8 s | $0 |
| Long reasoning (800 in, 1500 out) | gpt-4o (cloud) | 290 | 48 | 35 s | $4.50 input / $13.50 output |
| Long reasoning (800 in, 1500 out) | llama3.3:70b (local) | 180 | 11 | 145 s | $0 |
Conclusion: for short-to-medium tasks, local crushes cloud on cost and latency. For 70B-class reasoning, local is cheaper but slower. Hybrid is the winning pattern.
Pitfalls We Hit in Production {#pitfalls}
1. Concurrency limits. Ollama serves requests sequentially per model by default. Set OLLAMA_NUM_PARALLEL=4 and OLLAMA_MAX_LOADED_MODELS=2. Without these you will see request queues build up under load.
2. Context window leaks. When you send long histories, your KV cache balloons VRAM usage. A 7B model with 32k context can OOM on a 24 GB GPU at concurrency 4. Cap num_ctx per request.
3. The "first request takes 8 seconds" problem. Ollama lazy-loads models. After 5 minutes idle the model unloads. Set OLLAMA_KEEP_ALIVE=24h for predictable latency, or send a synthetic warm-up request every 60 seconds.
4. Streaming through a load balancer. AWS ALB has a 60-second idle timeout. NLB is 350 seconds. Use NLB or set idle_timeout higher. nginx proxy_read_timeout defaults to 60 seconds.
5. Token counting drift. OpenAI's tiktoken does not match Llama's tokenizer. If you bill users by token, count with the right tokenizer. Llama uses SentencePiece — install @huggingface/tokenizers.
6. Logs leaking PII. The default Ollama log writes the full prompt and response at info level. In a HIPAA or GDPR setting that is a breach in waiting. Set OLLAMA_DEBUG=0 and pipe app-level logs through a redaction layer.
7. Model version drift between dev and prod. ollama pull llama3.2 resolves to whatever the latest tag points at today. Pin the digest: ollama pull llama3.2:3b@sha256:abc... so a silent upstream update cannot change behaviour.
For deeper hardening, the official Ollama API documentation is the authoritative reference, and the OpenAI Chat Completions reference tells you exactly which fields the compat endpoint accepts.
Frequently Asked Questions {#faq}
Q: Do I have to use the OpenAI SDK to talk to Ollama?
No. Ollama also has a native API at /api/chat with slightly more features (e.g. raw mode, format=json). Use the OpenAI SDK if you want a single client across cloud and local. Use the native API if you need Ollama-specific controls.
Q: Will my existing OpenAI function-calling code work?
Mostly. Ollama supports tool calling on most recent models (Llama 3.1+, Qwen 2.5+, Mistral 0.3+). Edge cases: parallel tool calls and the very latest tool_choice modes can differ. Test the specific model.
Q: How do I handle 100+ concurrent users on one GPU?
You will not. A single RTX 4090 saturates around 8-12 concurrent 7B inferences. For 100+ concurrent users, run vLLM (much higher throughput than Ollama) or scale horizontally with a router. See AI gateway with LiteLLM.
Q: Can I keep my existing prompt templates?
Yes, but expect to retune. A prompt that works perfectly on GPT-4o may produce mediocre output on a 7B local model. Generally: be more explicit, add examples, and use a smaller temperature.
Q: What about embeddings? Same pattern?
Yes. Ollama exposes /v1/embeddings. nomic-embed-text and bge-m3 are strong open-source embedding models. Compare quality against OpenAI text-embedding-3-small in our local vs OpenAI embeddings benchmark.
Q: How do I monitor this in production?
Hit /api/ps for loaded model state. Hit /api/tags for available models. Scrape custom metrics via Prometheus exporters. Our Prometheus and Grafana for Ollama guide covers the dashboards we ship.
Q: My app is in PHP/Ruby/Rust. Will this still work?
Yes. The OpenAI-compatible REST endpoint is HTTP. Any language with an HTTP client works. There are community OpenAI SDKs for almost every language; pick whichever your team uses.
Q: How do I keep the local model running after a server reboot?
Use systemd. The Ollama installer script creates an ollama.service unit by default. Check with systemctl status ollama. Set environment variables in /etc/systemd/system/ollama.service.d/override.conf.
Conclusion
Adding local AI to an existing app is mostly about discipline, not novelty. Keep the OpenAI-compatible contract, isolate the AI calls in one module, add retries and timeouts the same way you do for any external service, and route by feature so cheap tasks stay local and hard tasks fall back to a cloud model. The tooling is mature enough in 2026 that the only thing standing between your app and private AI is one afternoon of integration work.
If you have not picked a model yet, start with our best local AI models guide and the Ollama production deployment walkthrough. For routing across multiple backends, jump to the AI gateway guide.
Want production playbooks like this one delivered weekly? Join the LocalAIMaster newsletter for hard-won lessons from teams running private AI at scale.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!