Free course — 2 free chapters of every course. No credit card.Start learning free
Architecture

Build a Private OpenAI-Compatible API on Your Own Hardware

April 23, 2026
19 min read
LocalAimaster Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

Build a Private OpenAI-Compatible API on Your Own Hardware

Published April 23, 2026 — 19 min read by the LocalAimaster Research Team

Almost every team I work with hits the same wall around the time their OpenAI bill crosses $4,000 a month or their security review lands on a tester's desk. The codebase is full of openai.chat.completions.create(...) calls. Replacing OpenAI everywhere would take weeks. The real fix is to point those calls at an endpoint they control — same protocol, same SDK, different host.

This guide is the playbook I run with those teams. By the end you will have an API at https://ai.yourcompany.com/v1 that any OpenAI client can hit, backed by your hardware, with per-user keys, rate limits, and an audit trail. We cover three deployment shapes — Ollama-only, LiteLLM in front of Ollama, and LiteLLM in front of vLLM — and the exact tradeoffs between them.

Why a private OpenAI-compatible API {#why}

Three reasons keep coming up:

  1. Cost. A team running 5M tokens/day through GPT-4o is paying about $1,800/month. The same workload on a $1,500 RTX 4090 (electricity included) breaks even in 2.4 months and is free thereafter. We tracked the math in Ollama vs ChatGPT API cost at scale.
  2. Compliance. GDPR, HIPAA, and SOC 2 all flag third-party processors as risk. A self-hosted endpoint inside your VPC removes the data egress entirely. See GDPR-compliant local AI for the legal shape of that argument.
  3. Vendor independence. OpenAI's pricing, deprecations, and rate limits are not yours to control. Your endpoint is.

What makes this practical in 2026 is the OpenAI HTTP contract being treated as a de facto standard. Ollama, vLLM, llama.cpp's server, LM Studio, Together's stack, Groq, Anthropic's compatibility shim, and even AWS Bedrock all expose a path that mostly behaves like /v1/chat/completions. Building on that contract means client code does not care which backend wins next year.

Architecture: three layers {#architecture}

A production-grade private API has three distinct layers. Skipping any one of them creates an outage waiting to happen.

┌──────────────────────┐
│ Edge layer           │  TLS + WAF + per-key auth
│ nginx / Caddy        │  rate limiting, IP allowlist
└──────────┬───────────┘
           │
┌──────────▼───────────┐
│ Gateway layer        │  Virtual keys, per-team budgets
│ LiteLLM              │  Routing, fallbacks, retries
│                      │  Audit logging to Postgres
└──────────┬───────────┘
           │
┌──────────▼───────────┐
│ Inference layer      │  Ollama  (general purpose)
│                      │  vLLM    (high throughput)
│                      │  llama.cpp (max efficiency)
└──────────────────────┘

The edge layer is non-optional. The gateway layer is optional for a single-team setup but mandatory the moment two teams share infrastructure. The inference layer is where you make the cost/throughput tradeoff.

Quick Start: Ollama only {#quick-start}

Smallest viable private API — single host, single user, useful for learning the contract.

# 1. Install Ollama on a server you control
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull a model
ollama pull llama3.1:8b

# 3. Bind to all interfaces (NOT public — see below)
sudo systemctl edit ollama.service
# Add:
#   [Service]
#   Environment="OLLAMA_HOST=0.0.0.0:11434"
#   Environment="OLLAMA_KEEP_ALIVE=24h"
sudo systemctl restart ollama

# 4. Test the OpenAI-compatible endpoint
curl http://YOUR_HOST:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role":"user","content":"hello"}]
  }'

That endpoint speaks the OpenAI dialect. Any SDK or tool that accepts a baseURL connects to it. Do not expose port 11434 to the internet directly. Bind to a private network or put nginx in front with TLS and a token check before opening anything externally. This Quick Start is the inner core; the next section wraps it correctly.

How-To: Production stack with LiteLLM + Ollama {#how-to}

This is the configuration I run for teams up to ~50 users sharing a couple of GPU boxes.

1. Run Ollama on a private interface

Edit /etc/systemd/system/ollama.service.d/override.conf:

[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"

OLLAMA_NUM_PARALLEL=4 lets a single model serve four concurrent streams; OLLAMA_MAX_LOADED_MODELS=2 keeps two models hot in VRAM if you have headroom.

2. Install and configure LiteLLM

pip install "litellm[proxy]"

Create config.yaml:

model_list:
  - model_name: gpt-4o-mini   # the alias clients use
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://127.0.0.1:11434

  - model_name: gpt-4o        # route the heavier alias to a bigger model
    litellm_params:
      model: ollama/qwen2.5:32b
      api_base: http://127.0.0.1:11434

  - model_name: text-embedding-3-small
    litellm_params:
      model: ollama/nomic-embed-text
      api_base: http://127.0.0.1:11434

litellm_settings:
  drop_params: true            # silently drop unsupported params
  set_verbose: false
  cache: true
  cache_params:
    type: redis
    host: 127.0.0.1
    port: 6379

general_settings:
  master_key: sk-master-CHANGE-ME
  database_url: "postgresql://litellm:pass@127.0.0.1:5432/litellm"
  ui_username: admin
  ui_password: CHANGE-ME

The model_name field is what clients send. By aliasing gpt-4o-minillama3.1:8b, every existing line of code that says model="gpt-4o-mini" keeps working. No client edits required.

litellm --config config.yaml --port 4000

3. Issue per-user keys

curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-CHANGE-ME" \
  -d '{
    "models": ["gpt-4o-mini", "text-embedding-3-small"],
    "max_budget": 5.00,
    "duration": "30d",
    "metadata": {"user_id": "engineer-42", "team": "platform"}
  }'

You get back sk-... keys that look exactly like OpenAI keys. Distribute them through your secrets manager.

4. Put nginx in front with TLS

server {
  listen 443 ssl http2;
  server_name ai.yourcompany.com;
  ssl_certificate     /etc/letsencrypt/live/ai.yourcompany.com/fullchain.pem;
  ssl_certificate_key /etc/letsencrypt/live/ai.yourcompany.com/privkey.pem;

  # 60s headers; long body for streaming completions
  proxy_read_timeout 600s;
  proxy_buffering off;

  location / {
    proxy_pass http://127.0.0.1:4000;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
  }
}

Issue the cert with certbot --nginx -d ai.yourcompany.com. The full hardening recipe (fail2ban, log rotation, healthcheck) is in Ollama in production.

5. Point any OpenAI client at it

from openai import OpenAI
client = OpenAI(
    base_url="https://ai.yourcompany.com/v1",
    api_key="sk-engineer-42-XXXXXXXX",
)
r = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role":"user","content":"summarize: ..."}],
    stream=True,
)
for chunk in r:
    print(chunk.choices[0].delta.content or "", end="")

That same code worked against api.openai.com yesterday. Today it runs on your hardware.

When to swap Ollama for vLLM {#vllm}

Ollama is excellent for development and small teams. It serializes most requests through llama.cpp internally, so concurrent throughput plateaus quickly. For high-concurrency production (50+ simultaneous users on one model), vLLM is dramatically better — its PagedAttention scheduler handles 5–20× more concurrent requests on the same GPU.

Stand it up beside Ollama:

pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --host 127.0.0.1 --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92

Add it to LiteLLM:

  - model_name: gpt-4o-mini-fast
    litellm_params:
      model: openai/Meta-Llama-3.1-8B-Instruct
      api_base: http://127.0.0.1:8000/v1
      api_key: any

We measured 380 requests/minute single-GPU vLLM (RTX 4090, Llama 3.1 8B) vs 65 requests/minute single-GPU Ollama on the same model. vLLM also requires the model in HuggingFace format and full GPU residency — no CPU offload — which is why Ollama remains the default for mixed workloads.

Auth, rate limits, and audit {#auth}

The three controls a security review will ask about:

Authentication. LiteLLM virtual keys (sk-team-...) bound to specific models, expirations, and budgets. Master key for admin only, never in client code. MFA for the LiteLLM admin UI (proxy through Cloudflare Access or Tailscale).

Rate limiting. Per-key requests-per-minute and tokens-per-minute, configurable in config.yaml:

  - model_name: gpt-4o-mini
    litellm_params: { model: ollama/llama3.1:8b, api_base: http://127.0.0.1:11434 }
    rpm: 60                  # 60 requests per minute per key
    tpm: 100000              # 100K tokens per minute per key

For burst control beyond what LiteLLM offers, layer the nginx limit_req module in front. Our Ollama rate limiting guide goes deeper on multi-tenant patterns.

Audit logging. LiteLLM writes to Postgres by default. Enable Langfuse for prompt-level tracing:

litellm_settings:
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]

Set LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST (self-hosted Langfuse for full data residency). Every request, response, latency, token count, and user id ends up queryable in one place.

Comparison: Ollama, LiteLLM, vLLM {#comparison}

CapabilityOllama (alone)LiteLLM + OllamaLiteLLM + vLLM
OpenAI-compatible endpointyesyesyes
Per-user API keysnoyesyes
Per-key rate limitsnoyesyes
Per-key budgetsnoyesyes
Audit log to Postgresmanualbuilt-inbuilt-in
Model aliasing (gpt-4o-mini → local)noyesyes
Concurrent streams (RTX 4090, 8B)~8~8~60
Embeddingsyesyesyes
Multi-modal (vision)yes (LLaVA, Llama 3.2 Vision)yespartial
CPU fallbackyesyesno
Setup time10 min30 min1–2 hours
Best forSolo dev, prototypesTeams up to 50High-traffic prod

Pitfalls and how to avoid them {#pitfalls}

  • Binding 0.0.0.0 without a firewall. I have audited three companies that exposed Ollama directly. All three were being used by random scrapers within a week.
  • Forgetting model aliases. If your client sends gpt-4o-mini and your config does not alias it, LiteLLM returns a 400 by default. Set drop_params: true and define every alias your code paths use.
  • Streaming buffering at nginx. Without proxy_buffering off the user sees the entire response at once instead of streaming. Always disable buffering for AI endpoints.
  • No connection limits in nginx. A single client opening 200 streaming connections will starve everyone else. Add limit_conn_zone and limit_conn perip 10.
  • Underprovisioned VRAM for embeddings. Embedding models share GPU memory with chat models. Either reserve a CPU-only model for embeddings (nomic-embed-text runs fine on CPU), or use a second GPU.
  • Cache leaks between users. LiteLLM's cache key by default does not include the API key. For multi-tenant deployments, set cache_params.namespace: "{api_key}" so a search query from user A never returns user B's cached response.
  • Token accounting drift. Self-hosted servers count tokens with their own tokenizer; OpenAI counts with tiktoken. Budgets enforced server-side will not match client-side estimates. Document this delta or you will field tickets every Friday.

Performance sanity check

Numbers we measured for a 4-week pilot at a 30-engineer fintech, replacing GPT-4o-mini for code review and ticket triage:

MetricOpenAI baselineLiteLLM + Ollama (RTX 4090)LiteLLM + vLLM (RTX 4090)
Avg requests/day28,40028,40028,400
p50 latency to first token410 ms220 ms180 ms
p95 latency to full response (300 tok)4.8 s3.6 s3.1 s
Concurrent streams sustainedunlimited848
Monthly cost$1,820$0 (cap-ex amortized)$0 (cap-ex amortized)
Quality (internal eval, 200 prompts)92.4%88.1% (Llama 3.1 8B)89.3% (Llama 3.1 8B FP8)

The 4-point quality gap closes if you bump to Qwen 2.5 32B (which fits on the same GPU at Q4_K_M and scored 91.7%). The full eval methodology is in our AI benchmarks framework.

The OpenAI HTTP contract is documented at platform.openai.com/docs/api-reference — every endpoint LiteLLM and Ollama implement matches that schema.

What you actually get

  • A single base URL that any OpenAI client speaks, including tools you did not write (Continue.dev, Cursor, Cline, OpenWebUI, LangChain, LlamaIndex, the Vercel AI SDK).
  • A virtual key system with budgets, expirations, and an admin UI.
  • A complete audit trail of who asked what, with token counts and latencies.
  • The option to swap models behind the scenes — replace llama3.1:8b with qwen2.5:14b tomorrow and zero clients break.
  • Compliance-friendly architecture: no data leaves your boundary, every request is logged, every key is revocable.

If your team is currently sharing one OpenAI key, this stack is a strict upgrade. If your team is paying $5K+/month to OpenAI, this stack pays for the GPU in a quarter. And if your team needs to pass a security review, this stack is the only one that gives you the controls auditors actually ask for.

The next thing to build on top is RAG over your private documents — once the API surface is yours, embeddings, retrieval, and prompt assembly all happen inside your network. Pair this with our private AI knowledge base guide and you have replaced ChatGPT Enterprise without the per-seat fee.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: April 23, 2026🔄 Last Updated: April 23, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

Was this helpful?

Stop renting your AI

We publish one production-ready local AI architecture per week. No fluff, just the configs we actually run.

Related Guides

Continue your local AI journey with these comprehensive guides

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators