Build a Private OpenAI-Compatible API on Your Own Hardware
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Build a Private OpenAI-Compatible API on Your Own Hardware
Published April 23, 2026 — 19 min read by the LocalAimaster Research Team
Almost every team I work with hits the same wall around the time their OpenAI bill crosses $4,000 a month or their security review lands on a tester's desk. The codebase is full of openai.chat.completions.create(...) calls. Replacing OpenAI everywhere would take weeks. The real fix is to point those calls at an endpoint they control — same protocol, same SDK, different host.
This guide is the playbook I run with those teams. By the end you will have an API at https://ai.yourcompany.com/v1 that any OpenAI client can hit, backed by your hardware, with per-user keys, rate limits, and an audit trail. We cover three deployment shapes — Ollama-only, LiteLLM in front of Ollama, and LiteLLM in front of vLLM — and the exact tradeoffs between them.
Why a private OpenAI-compatible API {#why}
Three reasons keep coming up:
- Cost. A team running 5M tokens/day through GPT-4o is paying about $1,800/month. The same workload on a $1,500 RTX 4090 (electricity included) breaks even in 2.4 months and is free thereafter. We tracked the math in Ollama vs ChatGPT API cost at scale.
- Compliance. GDPR, HIPAA, and SOC 2 all flag third-party processors as risk. A self-hosted endpoint inside your VPC removes the data egress entirely. See GDPR-compliant local AI for the legal shape of that argument.
- Vendor independence. OpenAI's pricing, deprecations, and rate limits are not yours to control. Your endpoint is.
What makes this practical in 2026 is the OpenAI HTTP contract being treated as a de facto standard. Ollama, vLLM, llama.cpp's server, LM Studio, Together's stack, Groq, Anthropic's compatibility shim, and even AWS Bedrock all expose a path that mostly behaves like /v1/chat/completions. Building on that contract means client code does not care which backend wins next year.
Architecture: three layers {#architecture}
A production-grade private API has three distinct layers. Skipping any one of them creates an outage waiting to happen.
┌──────────────────────┐
│ Edge layer │ TLS + WAF + per-key auth
│ nginx / Caddy │ rate limiting, IP allowlist
└──────────┬───────────┘
│
┌──────────▼───────────┐
│ Gateway layer │ Virtual keys, per-team budgets
│ LiteLLM │ Routing, fallbacks, retries
│ │ Audit logging to Postgres
└──────────┬───────────┘
│
┌──────────▼───────────┐
│ Inference layer │ Ollama (general purpose)
│ │ vLLM (high throughput)
│ │ llama.cpp (max efficiency)
└──────────────────────┘
The edge layer is non-optional. The gateway layer is optional for a single-team setup but mandatory the moment two teams share infrastructure. The inference layer is where you make the cost/throughput tradeoff.
Quick Start: Ollama only {#quick-start}
Smallest viable private API — single host, single user, useful for learning the contract.
# 1. Install Ollama on a server you control
curl -fsSL https://ollama.com/install.sh | sh
# 2. Pull a model
ollama pull llama3.1:8b
# 3. Bind to all interfaces (NOT public — see below)
sudo systemctl edit ollama.service
# Add:
# [Service]
# Environment="OLLAMA_HOST=0.0.0.0:11434"
# Environment="OLLAMA_KEEP_ALIVE=24h"
sudo systemctl restart ollama
# 4. Test the OpenAI-compatible endpoint
curl http://YOUR_HOST:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role":"user","content":"hello"}]
}'
That endpoint speaks the OpenAI dialect. Any SDK or tool that accepts a baseURL connects to it. Do not expose port 11434 to the internet directly. Bind to a private network or put nginx in front with TLS and a token check before opening anything externally. This Quick Start is the inner core; the next section wraps it correctly.
How-To: Production stack with LiteLLM + Ollama {#how-to}
This is the configuration I run for teams up to ~50 users sharing a couple of GPU boxes.
1. Run Ollama on a private interface
Edit /etc/systemd/system/ollama.service.d/override.conf:
[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
OLLAMA_NUM_PARALLEL=4 lets a single model serve four concurrent streams; OLLAMA_MAX_LOADED_MODELS=2 keeps two models hot in VRAM if you have headroom.
2. Install and configure LiteLLM
pip install "litellm[proxy]"
Create config.yaml:
model_list:
- model_name: gpt-4o-mini # the alias clients use
litellm_params:
model: ollama/llama3.1:8b
api_base: http://127.0.0.1:11434
- model_name: gpt-4o # route the heavier alias to a bigger model
litellm_params:
model: ollama/qwen2.5:32b
api_base: http://127.0.0.1:11434
- model_name: text-embedding-3-small
litellm_params:
model: ollama/nomic-embed-text
api_base: http://127.0.0.1:11434
litellm_settings:
drop_params: true # silently drop unsupported params
set_verbose: false
cache: true
cache_params:
type: redis
host: 127.0.0.1
port: 6379
general_settings:
master_key: sk-master-CHANGE-ME
database_url: "postgresql://litellm:pass@127.0.0.1:5432/litellm"
ui_username: admin
ui_password: CHANGE-ME
The model_name field is what clients send. By aliasing gpt-4o-mini → llama3.1:8b, every existing line of code that says model="gpt-4o-mini" keeps working. No client edits required.
litellm --config config.yaml --port 4000
3. Issue per-user keys
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer sk-master-CHANGE-ME" \
-d '{
"models": ["gpt-4o-mini", "text-embedding-3-small"],
"max_budget": 5.00,
"duration": "30d",
"metadata": {"user_id": "engineer-42", "team": "platform"}
}'
You get back sk-... keys that look exactly like OpenAI keys. Distribute them through your secrets manager.
4. Put nginx in front with TLS
server {
listen 443 ssl http2;
server_name ai.yourcompany.com;
ssl_certificate /etc/letsencrypt/live/ai.yourcompany.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/ai.yourcompany.com/privkey.pem;
# 60s headers; long body for streaming completions
proxy_read_timeout 600s;
proxy_buffering off;
location / {
proxy_pass http://127.0.0.1:4000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
Issue the cert with certbot --nginx -d ai.yourcompany.com. The full hardening recipe (fail2ban, log rotation, healthcheck) is in Ollama in production.
5. Point any OpenAI client at it
from openai import OpenAI
client = OpenAI(
base_url="https://ai.yourcompany.com/v1",
api_key="sk-engineer-42-XXXXXXXX",
)
r = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role":"user","content":"summarize: ..."}],
stream=True,
)
for chunk in r:
print(chunk.choices[0].delta.content or "", end="")
That same code worked against api.openai.com yesterday. Today it runs on your hardware.
When to swap Ollama for vLLM {#vllm}
Ollama is excellent for development and small teams. It serializes most requests through llama.cpp internally, so concurrent throughput plateaus quickly. For high-concurrency production (50+ simultaneous users on one model), vLLM is dramatically better — its PagedAttention scheduler handles 5–20× more concurrent requests on the same GPU.
Stand it up beside Ollama:
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 127.0.0.1 --port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.92
Add it to LiteLLM:
- model_name: gpt-4o-mini-fast
litellm_params:
model: openai/Meta-Llama-3.1-8B-Instruct
api_base: http://127.0.0.1:8000/v1
api_key: any
We measured 380 requests/minute single-GPU vLLM (RTX 4090, Llama 3.1 8B) vs 65 requests/minute single-GPU Ollama on the same model. vLLM also requires the model in HuggingFace format and full GPU residency — no CPU offload — which is why Ollama remains the default for mixed workloads.
Auth, rate limits, and audit {#auth}
The three controls a security review will ask about:
Authentication. LiteLLM virtual keys (sk-team-...) bound to specific models, expirations, and budgets. Master key for admin only, never in client code. MFA for the LiteLLM admin UI (proxy through Cloudflare Access or Tailscale).
Rate limiting. Per-key requests-per-minute and tokens-per-minute, configurable in config.yaml:
- model_name: gpt-4o-mini
litellm_params: { model: ollama/llama3.1:8b, api_base: http://127.0.0.1:11434 }
rpm: 60 # 60 requests per minute per key
tpm: 100000 # 100K tokens per minute per key
For burst control beyond what LiteLLM offers, layer the nginx limit_req module in front. Our Ollama rate limiting guide goes deeper on multi-tenant patterns.
Audit logging. LiteLLM writes to Postgres by default. Enable Langfuse for prompt-level tracing:
litellm_settings:
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
Set LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST (self-hosted Langfuse for full data residency). Every request, response, latency, token count, and user id ends up queryable in one place.
Comparison: Ollama, LiteLLM, vLLM {#comparison}
| Capability | Ollama (alone) | LiteLLM + Ollama | LiteLLM + vLLM |
|---|---|---|---|
| OpenAI-compatible endpoint | yes | yes | yes |
| Per-user API keys | no | yes | yes |
| Per-key rate limits | no | yes | yes |
| Per-key budgets | no | yes | yes |
| Audit log to Postgres | manual | built-in | built-in |
| Model aliasing (gpt-4o-mini → local) | no | yes | yes |
| Concurrent streams (RTX 4090, 8B) | ~8 | ~8 | ~60 |
| Embeddings | yes | yes | yes |
| Multi-modal (vision) | yes (LLaVA, Llama 3.2 Vision) | yes | partial |
| CPU fallback | yes | yes | no |
| Setup time | 10 min | 30 min | 1–2 hours |
| Best for | Solo dev, prototypes | Teams up to 50 | High-traffic prod |
Pitfalls and how to avoid them {#pitfalls}
- Binding 0.0.0.0 without a firewall. I have audited three companies that exposed Ollama directly. All three were being used by random scrapers within a week.
- Forgetting model aliases. If your client sends
gpt-4o-miniand your config does not alias it, LiteLLM returns a 400 by default. Setdrop_params: trueand define every alias your code paths use. - Streaming buffering at nginx. Without
proxy_buffering offthe user sees the entire response at once instead of streaming. Always disable buffering for AI endpoints. - No connection limits in nginx. A single client opening 200 streaming connections will starve everyone else. Add
limit_conn_zoneandlimit_conn perip 10. - Underprovisioned VRAM for embeddings. Embedding models share GPU memory with chat models. Either reserve a CPU-only model for embeddings (
nomic-embed-textruns fine on CPU), or use a second GPU. - Cache leaks between users. LiteLLM's cache key by default does not include the API key. For multi-tenant deployments, set
cache_params.namespace: "{api_key}"so a search query from user A never returns user B's cached response. - Token accounting drift. Self-hosted servers count tokens with their own tokenizer; OpenAI counts with tiktoken. Budgets enforced server-side will not match client-side estimates. Document this delta or you will field tickets every Friday.
Performance sanity check
Numbers we measured for a 4-week pilot at a 30-engineer fintech, replacing GPT-4o-mini for code review and ticket triage:
| Metric | OpenAI baseline | LiteLLM + Ollama (RTX 4090) | LiteLLM + vLLM (RTX 4090) |
|---|---|---|---|
| Avg requests/day | 28,400 | 28,400 | 28,400 |
| p50 latency to first token | 410 ms | 220 ms | 180 ms |
| p95 latency to full response (300 tok) | 4.8 s | 3.6 s | 3.1 s |
| Concurrent streams sustained | unlimited | 8 | 48 |
| Monthly cost | $1,820 | $0 (cap-ex amortized) | $0 (cap-ex amortized) |
| Quality (internal eval, 200 prompts) | 92.4% | 88.1% (Llama 3.1 8B) | 89.3% (Llama 3.1 8B FP8) |
The 4-point quality gap closes if you bump to Qwen 2.5 32B (which fits on the same GPU at Q4_K_M and scored 91.7%). The full eval methodology is in our AI benchmarks framework.
The OpenAI HTTP contract is documented at platform.openai.com/docs/api-reference — every endpoint LiteLLM and Ollama implement matches that schema.
What you actually get
- A single base URL that any OpenAI client speaks, including tools you did not write (Continue.dev, Cursor, Cline, OpenWebUI, LangChain, LlamaIndex, the Vercel AI SDK).
- A virtual key system with budgets, expirations, and an admin UI.
- A complete audit trail of who asked what, with token counts and latencies.
- The option to swap models behind the scenes — replace
llama3.1:8bwithqwen2.5:14btomorrow and zero clients break. - Compliance-friendly architecture: no data leaves your boundary, every request is logged, every key is revocable.
If your team is currently sharing one OpenAI key, this stack is a strict upgrade. If your team is paying $5K+/month to OpenAI, this stack pays for the GPU in a quarter. And if your team needs to pass a security review, this stack is the only one that gives you the controls auditors actually ask for.
The next thing to build on top is RAG over your private documents — once the API surface is yours, embeddings, retrieval, and prompt assembly all happen inside your network. Pair this with our private AI knowledge base guide and you have replaced ChatGPT Enterprise without the per-seat fee.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!