What does "OpenAI-compatible" actually mean?

It means the server speaks the same HTTP contract as api.openai.com — specifically /v1/chat/completions, /v1/completions, /v1/embeddings, and /v1/models, with the same JSON shapes and SSE streaming format. Any client written for OpenAI (the official SDKs, LangChain, LlamaIndex, Continue.dev, the Vercel AI SDK) connects by changing only the baseURL and key.

Should I use Ollama, LiteLLM, or vLLM as my server?

Use Ollama for single-machine, single-team setups under ~10 users. Use vLLM if you have a dedicated GPU server and need maximum throughput (PagedAttention gives 5–20× more concurrent requests). Use LiteLLM as a gateway that fronts both — it adds auth, rate limits, virtual keys, and per-team budgets. Most production deployments end up running LiteLLM in front of Ollama or vLLM, not as a replacement.

Does Ollama require LiteLLM to be OpenAI-compatible?

No. Ollama exposes /v1/chat/completions natively at port 11434. LiteLLM is only needed when you want a single endpoint that fronts multiple model backends, multi-tenant API keys, or per-user rate limits. For a hobby setup, Ollama alone is enough.

How do I add real authentication?

Ollama itself ships no auth. Put nginx (or Caddy) in front with an Authorization header check, or run LiteLLM which provides per-key authentication, expirations, and budgets out of the box. Never expose port 11434 to the internet directly — bots will find it within hours and start mining tokens at your expense.

Can I keep audit logs of every prompt and response?

Yes. LiteLLM logs every request to Postgres, MongoDB, OpenTelemetry, Langfuse, or Helicone. For an Ollama-only setup, drop a logging middleware in nginx that captures the request/response bodies (be careful about disk space — at 1KB per request, 10K requests per day fills 10GB in three months).

How fast is this compared to OpenAI?

For 7B models on an RTX 4090, you will see 80–110 tokens/sec single-stream — faster than GPT-4o for short prompts. For 70B models on dual 3090s with vLLM, expect 25–40 tokens/sec. Latency to first token is usually lower than OpenAI because there is no network hop. Throughput across many concurrent users is where OpenAI still wins unless you scale GPUs.

Will my existing OpenAI-based app work without changes?

In 95% of cases, yes — change baseURL and api_key in your client init. Things that break: organization headers, dedicated capacity features, vision/audio modalities only on specific OpenAI models, and any code that depends on exact GPT-4 reasoning. Use the Mistral, Llama 3.1, or Qwen 2.5 family as drop-in replacements for GPT-4o-mini-class workloads.

Is this safe for SOC 2 / HIPAA / GDPR workloads?

A self-hosted, OpenAI-compatible API is the only practical way to keep prompts and responses inside your compliance boundary. You still need encryption in transit (TLS), encryption at rest (LUKS, ZFS), access logging, MFA on the bastion, and a documented data retention policy. The compliance-relevant pieces are below the model — your infra controls.

Build a Private OpenAI-Compatible API on Your Own Hardware

Published April 23, 2026 — 19 min read by the LocalAimaster Research Team

Almost every team I work with hits the same wall around the time their OpenAI bill crosses $4,000 a month or their security review lands on a tester's desk. The codebase is full of openai.chat.completions.create(...) calls. Replacing OpenAI everywhere would take weeks. The real fix is to point those calls at an endpoint they control — same protocol, same SDK, different host.

This guide is the playbook I run with those teams. By the end you will have an API at https://ai.yourcompany.com/v1 that any OpenAI client can hit, backed by your hardware, with per-user keys, rate limits, and an audit trail. We cover three deployment shapes — Ollama-only, LiteLLM in front of Ollama, and LiteLLM in front of vLLM — and the exact tradeoffs between them.

Why a private OpenAI-compatible API {#why}

Three reasons keep coming up:

Cost. A team running 5M tokens/day through GPT-4o is paying about $1,800/month. The same workload on a $1,500 RTX 4090 (electricity included) breaks even in 2.4 months and is free thereafter. We tracked the math in Ollama vs ChatGPT API cost at scale.
Compliance. GDPR, HIPAA, and SOC 2 all flag third-party processors as risk. A self-hosted endpoint inside your VPC removes the data egress entirely. See GDPR-compliant local AI for the legal shape of that argument.
Vendor independence. OpenAI's pricing, deprecations, and rate limits are not yours to control. Your endpoint is.

What makes this practical in 2026 is the OpenAI HTTP contract being treated as a de facto standard. Ollama, vLLM, llama.cpp's server, LM Studio, Together's stack, Groq, Anthropic's compatibility shim, and even AWS Bedrock all expose a path that mostly behaves like /v1/chat/completions. Building on that contract means client code does not care which backend wins next year.

Architecture: three layers {#architecture}

A production-grade private API has three distinct layers. Skipping any one of them creates an outage waiting to happen.

┌──────────────────────┐
│ Edge layer           │  TLS + WAF + per-key auth
│ nginx / Caddy        │  rate limiting, IP allowlist
└──────────┬───────────┘
           │
┌──────────▼───────────┐
│ Gateway layer        │  Virtual keys, per-team budgets
│ LiteLLM              │  Routing, fallbacks, retries
│                      │  Audit logging to Postgres
└──────────┬───────────┘
           │
┌──────────▼───────────┐
│ Inference layer      │  Ollama  (general purpose)
│                      │  vLLM    (high throughput)
│                      │  llama.cpp (max efficiency)
└──────────────────────┘

The edge layer is non-optional. The gateway layer is optional for a single-team setup but mandatory the moment two teams share infrastructure. The inference layer is where you make the cost/throughput tradeoff.

Quick Start: Ollama only {#quick-start}

Smallest viable private API — single host, single user, useful for learning the contract.

# 1. Install Ollama on a server you control
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull a model
ollama pull llama3.1:8b

# 3. Bind to all interfaces (NOT public — see below)
sudo systemctl edit ollama.service
# Add:
#   [Service]
#   Environment="OLLAMA_HOST=0.0.0.0:11434"
#   Environment="OLLAMA_KEEP_ALIVE=24h"
sudo systemctl restart ollama

# 4. Test the OpenAI-compatible endpoint
curl http://YOUR_HOST:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role":"user","content":"hello"}]
  }'

That endpoint speaks the OpenAI dialect. Any SDK or tool that accepts a baseURL connects to it. Do not expose port 11434 to the internet directly. Bind to a private network or put nginx in front with TLS and a token check before opening anything externally. This Quick Start is the inner core; the next section wraps it correctly.

How-To: Production stack with LiteLLM + Ollama {#how-to}

This is the configuration I run for teams up to ~50 users sharing a couple of GPU boxes.

1. Run Ollama on a private interface

Edit /etc/systemd/system/ollama.service.d/override.conf:

[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"

OLLAMA_NUM_PARALLEL=4 lets a single model serve four concurrent streams; OLLAMA_MAX_LOADED_MODELS=2 keeps two models hot in VRAM if you have headroom.

2. Install and configure LiteLLM

pip install "litellm[proxy]"

Create config.yaml:

model_list:
  - model_name: gpt-4o-mini   # the alias clients use
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://127.0.0.1:11434

  - model_name: gpt-4o        # route the heavier alias to a bigger model
    litellm_params:
      model: ollama/qwen2.5:32b
      api_base: http://127.0.0.1:11434

  - model_name: text-embedding-3-small
    litellm_params:
      model: ollama/nomic-embed-text
      api_base: http://127.0.0.1:11434

litellm_settings:
  drop_params: true            # silently drop unsupported params
  set_verbose: false
  cache: true
  cache_params:
    type: redis
    host: 127.0.0.1
    port: 6379

general_settings:
  master_key: sk-master-CHANGE-ME
  database_url: "postgresql://litellm:pass@127.0.0.1:5432/litellm"
  ui_username: admin
  ui_password: CHANGE-ME

The model_name field is what clients send. By aliasing gpt-4o-mini → llama3.1:8b, every existing line of code that says model="gpt-4o-mini" keeps working. No client edits required.

litellm --config config.yaml --port 4000

3. Issue per-user keys

curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-master-CHANGE-ME" \
  -d '{
    "models": ["gpt-4o-mini", "text-embedding-3-small"],
    "max_budget": 5.00,
    "duration": "30d",
    "metadata": {"user_id": "engineer-42", "team": "platform"}
  }'

You get back sk-... keys that look exactly like OpenAI keys. Distribute them through your secrets manager.

4. Put nginx in front with TLS

server {
  listen 443 ssl http2;
  server_name ai.yourcompany.com;
  ssl_certificate     /etc/letsencrypt/live/ai.yourcompany.com/fullchain.pem;
  ssl_certificate_key /etc/letsencrypt/live/ai.yourcompany.com/privkey.pem;

  # 60s headers; long body for streaming completions
  proxy_read_timeout 600s;
  proxy_buffering off;

  location / {
    proxy_pass http://127.0.0.1:4000;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
  }
}

Issue the cert with certbot --nginx -d ai.yourcompany.com. The full hardening recipe (fail2ban, log rotation, healthcheck) is in Ollama in production.

5. Point any OpenAI client at it

from openai import OpenAI
client = OpenAI(
    base_url="https://ai.yourcompany.com/v1",
    api_key="sk-engineer-42-XXXXXXXX",
)
r = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role":"user","content":"summarize: ..."}],
    stream=True,
)
for chunk in r:
    print(chunk.choices[0].delta.content or "", end="")

That same code worked against api.openai.com yesterday. Today it runs on your hardware.

When to swap Ollama for vLLM {#vllm}

Ollama is excellent for development and small teams. It serializes most requests through llama.cpp internally, so concurrent throughput plateaus quickly. For high-concurrency production (50+ simultaneous users on one model), vLLM is dramatically better — its PagedAttention scheduler handles 5–20× more concurrent requests on the same GPU.

Stand it up beside Ollama:

pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --host 127.0.0.1 --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92

Add it to LiteLLM:

  - model_name: gpt-4o-mini-fast
    litellm_params:
      model: openai/Meta-Llama-3.1-8B-Instruct
      api_base: http://127.0.0.1:8000/v1
      api_key: any

We measured 380 requests/minute single-GPU vLLM (RTX 4090, Llama 3.1 8B) vs 65 requests/minute single-GPU Ollama on the same model. vLLM also requires the model in HuggingFace format and full GPU residency — no CPU offload — which is why Ollama remains the default for mixed workloads.

Auth, rate limits, and audit {#auth}

The three controls a security review will ask about:

Authentication. LiteLLM virtual keys (sk-team-...) bound to specific models, expirations, and budgets. Master key for admin only, never in client code. MFA for the LiteLLM admin UI (proxy through Cloudflare Access or Tailscale).

Rate limiting. Per-key requests-per-minute and tokens-per-minute, configurable in config.yaml:

  - model_name: gpt-4o-mini
    litellm_params: { model: ollama/llama3.1:8b, api_base: http://127.0.0.1:11434 }
    rpm: 60                  # 60 requests per minute per key
    tpm: 100000              # 100K tokens per minute per key

For burst control beyond what LiteLLM offers, layer the nginx limit_req module in front. Our Ollama rate limiting guide goes deeper on multi-tenant patterns.

Audit logging. LiteLLM writes to Postgres by default. Enable Langfuse for prompt-level tracing:

litellm_settings:
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]

Set LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST (self-hosted Langfuse for full data residency). Every request, response, latency, token count, and user id ends up queryable in one place.

Comparison: Ollama, LiteLLM, vLLM {#comparison}

Capability	Ollama (alone)	LiteLLM + Ollama	LiteLLM + vLLM
OpenAI-compatible endpoint	yes	yes	yes
Per-user API keys	no	yes	yes
Per-key rate limits	no	yes	yes
Per-key budgets	no	yes	yes
Audit log to Postgres	manual	built-in	built-in
Model aliasing (gpt-4o-mini → local)	no	yes	yes
Concurrent streams (RTX 4090, 8B)	~8	~8	~60
Embeddings	yes	yes	yes
Multi-modal (vision)	yes (LLaVA, Llama 3.2 Vision)	yes	partial
CPU fallback	yes	yes	no
Setup time	10 min	30 min	1–2 hours
Best for	Solo dev, prototypes	Teams up to 50	High-traffic prod

Pitfalls and how to avoid them {#pitfalls}

Binding 0.0.0.0 without a firewall. I have audited three companies that exposed Ollama directly. All three were being used by random scrapers within a week.
Forgetting model aliases. If your client sends gpt-4o-mini and your config does not alias it, LiteLLM returns a 400 by default. Set drop_params: true and define every alias your code paths use.
Streaming buffering at nginx. Without proxy_buffering off the user sees the entire response at once instead of streaming. Always disable buffering for AI endpoints.
No connection limits in nginx. A single client opening 200 streaming connections will starve everyone else. Add limit_conn_zone and limit_conn perip 10.
Underprovisioned VRAM for embeddings. Embedding models share GPU memory with chat models. Either reserve a CPU-only model for embeddings (nomic-embed-text runs fine on CPU), or use a second GPU.
Cache leaks between users. LiteLLM's cache key by default does not include the API key. For multi-tenant deployments, set cache_params.namespace: "{api_key}" so a search query from user A never returns user B's cached response.
Token accounting drift. Self-hosted servers count tokens with their own tokenizer; OpenAI counts with tiktoken. Budgets enforced server-side will not match client-side estimates. Document this delta or you will field tickets every Friday.

Performance sanity check

Numbers we measured for a 4-week pilot at a 30-engineer fintech, replacing GPT-4o-mini for code review and ticket triage:

Metric	OpenAI baseline	LiteLLM + Ollama (RTX 4090)	LiteLLM + vLLM (RTX 4090)
Avg requests/day	28,400	28,400	28,400
p50 latency to first token	410 ms	220 ms	180 ms
p95 latency to full response (300 tok)	4.8 s	3.6 s	3.1 s
Concurrent streams sustained	unlimited	8	48
Monthly cost	$1,820	$0 (cap-ex amortized)	$0 (cap-ex amortized)
Quality (internal eval, 200 prompts)	92.4%	88.1% (Llama 3.1 8B)	89.3% (Llama 3.1 8B FP8)

The 4-point quality gap closes if you bump to Qwen 2.5 32B (which fits on the same GPU at Q4_K_M and scored 91.7%). The full eval methodology is in our AI benchmarks framework.

The OpenAI HTTP contract is documented at platform.openai.com/docs/api-reference — every endpoint LiteLLM and Ollama implement matches that schema.

What you actually get

A single base URL that any OpenAI client speaks, including tools you did not write (Continue.dev, Cursor, Cline, OpenWebUI, LangChain, LlamaIndex, the Vercel AI SDK).
A virtual key system with budgets, expirations, and an admin UI.
A complete audit trail of who asked what, with token counts and latencies.
The option to swap models behind the scenes — replace llama3.1:8b with qwen2.5:14b tomorrow and zero clients break.
Compliance-friendly architecture: no data leaves your boundary, every request is logged, every key is revocable.

If your team is currently sharing one OpenAI key, this stack is a strict upgrade. If your team is paying $5K+/month to OpenAI, this stack pays for the GPU in a quarter. And if your team needs to pass a security review, this stack is the only one that gives you the controls auditors actually ask for.

The next thing to build on top is RAG over your private documents — once the API surface is yours, embeddings, retrieval, and prompt assembly all happen inside your network. Pair this with our private AI knowledge base guide and you have replaced ChatGPT Enterprise without the per-seat fee.

Build a Private OpenAI-Compatible API on Your Own Hardware

Want to go deeper than this article?

Build a Private OpenAI-Compatible API on Your Own Hardware

Why a private OpenAI-compatible API {#why}

Architecture: three layers {#architecture}

Quick Start: Ollama only {#quick-start}

How-To: Production stack with LiteLLM + Ollama {#how-to}

1. Run Ollama on a private interface

2. Install and configure LiteLLM

3. Issue per-user keys

4. Put nginx in front with TLS

5. Point any OpenAI client at it

When to swap Ollama for vLLM {#vllm}

Auth, rate limits, and audit {#auth}

Comparison: Ollama, LiteLLM, vLLM {#comparison}

Pitfalls and how to avoid them {#pitfalls}

Performance sanity check

What you actually get

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

🎓 Continue Learning

Stop renting your AI

Related Guides

Build Real AI on Your Machine

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI