AI Gateway with LiteLLM: Route Local + Cloud Models in Production (2026)
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Building an AI Gateway with LiteLLM: Local + Cloud Routing in Production
Published April 23, 2026 • 19 min read
A team usually starts with one OpenAI key and a handful of services calling it directly. Six months later there are eleven services, three model providers, no usage attribution, no fallback when one provider has an outage, and the security team wants per-team quotas yesterday. That is the moment an AI gateway stops being optional. LiteLLM is the open-source proxy we have shipped to production for this exact problem — it speaks the OpenAI Chat Completions API on the front, and on the back it talks to roughly 100 model providers including local Ollama, vLLM, OpenAI, Anthropic, Google, Mistral, and Bedrock. This guide is the production deployment we wish we had when we started.
Quick Start: Run LiteLLM in 4 Minutes
# 1. Install with the proxy extras
pip install 'litellm[proxy]'
# 2. Minimal config: one local model + one cloud
cat > config.yaml <<'EOF'
model_list:
- model_name: fast-local
litellm_params:
model: ollama/llama3.2:3b
api_base: http://localhost:11434
- model_name: smart-cloud
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
EOF
# 3. Start the proxy
export OPENAI_API_KEY=sk-...
litellm --config config.yaml --port 4000
Now any OpenAI SDK can talk to either model through one endpoint:
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{"model": "fast-local", "messages": [{"role":"user","content":"hi"}]}'
Switch fast-local to smart-cloud in the request body and the same gateway routes the call to OpenAI instead. That is the entire developer experience. Everything else — auth, fallbacks, cost tracking, rate limits — bolts onto this base.
Table of Contents
- Why You Need a Gateway, Not Just an SDK
- Architecture Overview
- Production Config: All the Knobs
- Virtual Keys and Per-Team Budgets
- Fallbacks, Retries, and Cooldowns
- Cost Tracking and Logging
- Routing Strategies
- Benchmarks: LiteLLM Overhead Is Tiny
- Common Production Pitfalls
- FAQ
Why You Need a Gateway, Not Just an SDK {#why-gateway}
Direct SDK calls to providers work fine for one app and one provider. The pain compounds linearly with each new service and each new provider:
| Pain Point | Without Gateway | With LiteLLM |
|---|---|---|
| Switch from OpenAI to local Ollama | Code change, redeploy every service | Edit config.yaml, reload |
| Per-team budgets | Custom code in every service | Built-in virtual keys |
| Cost attribution | Spreadsheet from billing CSV | Real-time per-key spend |
| Provider outage | All requests fail | Auto-fallback to backup |
| Rate limit handling | Per-service retry logic | Centralized with cooldowns |
| Audit trail | Scattered logs | Single Postgres or S3 sink |
| New model releases | Update SDK in N services | Add one line to config |
LiteLLM is essentially nginx for LLMs. Once it is in the path, switching providers, adding fallbacks, or capping a runaway team's spend is a config change.
If you have not yet decided whether to go local at all, Ollama vs ChatGPT API cost and hybrid local + cloud architecture are the prerequisites for this guide.
Architecture Overview {#architecture}
The gateway sits between every internal client and every model provider. Internal services only know one URL: https://ai-gw.internal.
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Internal app │ │ Internal app │ │ Internal app │
└────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘
│ │ │
└───────────┬───────────┴───────────┬───────────┘
▼ ▼
┌──────────────────────────────────┐
│ LiteLLM Proxy (port 4000) │
│ - virtual keys │
│ - per-key budgets │
│ - routing & fallbacks │
│ - cost tracking → Postgres │
└──────────────────────────────────┘
│
┌─────────────┼─────────────┬─────────────────┬──────────────┐
▼ ▼ ▼ ▼ ▼
Ollama:11434 vLLM:8000 OpenAI API Anthropic API Google Vertex
(local 3B/7B) (local 70B)
For a single-team prototype, run LiteLLM as a single Python process. For anything serious: run it under uvicorn workers behind nginx with TLS, back it with Postgres for spend tracking, and keep a Redis instance for shared rate-limit state across replicas.
Production Config: All the Knobs {#production-config}
Here is the config.yaml we deploy, with comments. Adapt it to your model lineup.
# config.yaml
model_list:
# ───── Local fast lane (cheap, low latency) ─────
- model_name: fast-local
litellm_params:
model: ollama/llama3.2:3b
api_base: http://10.0.1.20:11434
timeout: 30
stream_timeout: 60
model_info:
mode: chat
max_input_tokens: 8192
# ───── Local heavy lane (big context, reasoning) ─────
- model_name: heavy-local
litellm_params:
model: openai/llama-3.3-70b-instruct
api_base: http://10.0.1.21:8000/v1 # vLLM endpoint
api_key: dummy
timeout: 180
# ───── Cloud premium (when local cannot do it) ─────
- model_name: smart-cloud
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
timeout: 90
- model_name: smart-cloud
litellm_params:
model: anthropic/claude-3-7-sonnet-20250219
api_key: os.environ/ANTHROPIC_API_KEY
timeout: 90
router_settings:
routing_strategy: usage-based-routing-v2
num_retries: 2
timeout: 600
fallbacks:
- { fast-local: ["heavy-local", "smart-cloud"] }
- { heavy-local: ["smart-cloud"] }
cooldown_time: 30 # seconds before retrying a failed deployment
enable_pre_call_checks: true
litellm_settings:
drop_params: true # silently drop params an upstream rejects
set_verbose: false
json_logs: true
request_timeout: 600
cache: true
cache_params:
type: redis
host: 10.0.1.30
port: 6379
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY # required for /key endpoints
database_url: os.environ/DATABASE_URL # Postgres for spend tracking
store_model_in_db: true
alerting: ["slack"]
alert_to_webhook_url: os.environ/SLACK_WEBHOOK_URL
proxy_budget_rescheduler_min_time: 60
proxy_budget_rescheduler_max_time: 64
Two non-obvious things. First, two entries can share the same model_name — that is how you give a single name multiple deployments for load balancing or fallback. Second, drop_params: true saves enormous pain. OpenAI accepts tools, but Ollama on a model that does not support tools rejects it. drop_params silently strips unsupported fields rather than failing.
Virtual Keys and Per-Team Budgets {#virtual-keys}
Hand each team a separate API key, attach a budget, and let LiteLLM enforce it. Generate keys via the admin endpoint with the master key:
curl -X POST http://ai-gw.internal/key/generate \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"models": ["fast-local", "smart-cloud"],
"max_budget": 200,
"budget_duration": "30d",
"rpm_limit": 600,
"tpm_limit": 100000,
"metadata": { "team": "growth", "owner": "alex@" }
}'
The response includes a key like sk-litellm-xxxxxxxx. Hand that to the growth team. They cannot exceed $200 in 30 days, 600 requests per minute, or 100k tokens per minute. When they hit 80% of budget, LiteLLM fires a Slack alert via the configured webhook. When they hit 100%, requests start failing with Budget Exceeded.
This single feature has saved us multiple times from runaway loops in someone's prototype that would otherwise have racked up four-figure OpenAI bills overnight.
Hierarchy: Org → Team → Key
LiteLLM supports nested budgets. Set an org-wide cap of $5000/month, allocate $1500 to each team, and distribute keys within each team. Spend rolls up automatically.
# Create an organization
curl -X POST http://ai-gw.internal/organization/new \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-d '{"organization_alias": "engineering", "max_budget": 5000}'
# Create a team under it
curl -X POST http://ai-gw.internal/team/new \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-d '{"team_alias": "platform", "organization_id": "<org_id>", "max_budget": 1500}'
For the audit-trail story, see our local AI audit trail guide — LiteLLM logs every prompt and response keyed to the calling key, which is what you actually need for compliance.
Fallbacks, Retries, and Cooldowns {#fallbacks}
A real production gateway treats every upstream as flaky. Configure three layers of resilience.
1. In-deployment retries
If a single deployment returns 429 or 5xx, retry up to num_retries times with exponential backoff. Default is 2.
2. Cross-deployment fallback
If all retries fail, fall back to a different model. Define the fallback chain in router_settings.fallbacks. We use:
fallbacks:
- { fast-local: ["heavy-local", "smart-cloud"] }
- { heavy-local: ["smart-cloud"] }
- { smart-cloud: ["heavy-local"] } # cloud out? fall to local
Yes, the last entry is correct. If OpenAI is down, fall back to your own GPU. We saw this play out for real on November 8, 2025 when both OpenAI and Anthropic had simultaneous degradations — local Ollama kept serving while every cloud-only competitor went dark.
3. Cooldowns
When a deployment fails, mark it cool for cooldown_time seconds and route around it. Without this, you will hammer a dying upstream and burn retries on a guaranteed-fail target.
cooldown_time: 30 # seconds
allowed_fails: 3 # trip after this many fails in cooldown_time window
4. Context-window-aware fallback
If a request exceeds the local model's context window, LiteLLM can automatically fall through to a model with a bigger window. Set context_window_fallbacks:
context_window_fallbacks:
- { fast-local: ["smart-cloud"] }
A 32k-token request that would error on a 4k-context Ollama model now silently routes to GPT-4o instead.
Cost Tracking and Logging {#cost-tracking}
Wire LiteLLM to Postgres and you get per-request spend with no code in the apps. The schema is straightforward — every completed request lands in LiteLLM_SpendLogs with the calling key, model, prompt tokens, completion tokens, and computed cost.
-- Top 10 most expensive keys in the last 7 days
SELECT
api_key, team_id, COUNT(*) AS calls,
SUM(spend) AS dollars,
SUM(prompt_tokens + completion_tokens) AS total_tokens
FROM "LiteLLM_SpendLogs"
WHERE "startTime" > NOW() - INTERVAL '7 days'
GROUP BY api_key, team_id
ORDER BY dollars DESC
LIMIT 10;
For exporting to existing observability stacks, configure the logging callbacks:
litellm_settings:
success_callback: ["langfuse", "prometheus"]
failure_callback: ["langfuse", "sentry"]
Langfuse gives you a per-trace UI with prompt/response inspection. Prometheus gives you metrics like litellm_total_tokens and litellm_request_duration_seconds for Grafana dashboards. We pair this with the Ollama Prometheus + Grafana setup for full-stack visibility.
Routing Strategies {#routing}
LiteLLM ships several routing strategies. Pick by your dominant constraint.
| Strategy | Best For | How It Decides |
|---|---|---|
simple-shuffle | Even load distribution | Random pick across deployments with same name |
least-busy | Latency-sensitive | Routes to deployment with fewest active connections |
usage-based-routing-v2 | RPM/TPM constraints | Picks deployment furthest from rate limit |
latency-based-routing | P99 latency targets | Routes to lowest 5-min average latency |
cost-based-routing | Save money | Picks cheapest deployment that satisfies the request |
Our default is usage-based-routing-v2 — it considers both rate limits and current load. Switch to cost-based-routing if you have many provider tiers (e.g. Haiku/Sonnet/Opus) and want LiteLLM to always pick the cheapest sufficient model.
Tag-based routing
Route by request metadata. Add tags: ["pii"] to a request and force it to a local model:
router_settings:
enable_tag_filtering: true
model_list:
- model_name: fast-local
litellm_params: { model: ollama/llama3.2:3b, api_base: ... }
model_info: { tags: ["pii", "internal"] }
- model_name: smart-cloud
litellm_params: { model: openai/gpt-4o, api_key: ... }
model_info: { tags: ["public"] }
Now any request from a service that tags itself pii will only ever route to local. This is how you enforce "no PII to cloud" as a config rule, not as a dev's discipline.
Benchmarks: LiteLLM Overhead Is Tiny {#benchmarks}
Tested on a c6i.xlarge (4 vCPU, 8 GB) with a 4-replica uvicorn LiteLLM behind nginx, against a local Ollama instance over a 1 Gbit private network.
| Test | Direct to Ollama | Via LiteLLM | Overhead |
|---|---|---|---|
| Median latency (small prompt) | 95 ms | 99 ms | 4 ms |
| P99 latency | 480 ms | 510 ms | 30 ms |
| Throughput (concurrent=20, llama3.2:3b) | 312 req/s | 305 req/s | 2.2% |
| CPU on gateway host @ 300 req/s | n/a | 38% | — |
| Memory on gateway host | n/a | 420 MB | — |
In other words, the proxy adds about 4 ms of unavoidable HTTP and routing overhead and 2-3% throughput cost. In exchange you get auth, budgets, fallbacks, and cost tracking. Worth it.
For comparison, here is the same load run against OpenAI directly versus through LiteLLM:
| Test | Direct to OpenAI | Via LiteLLM | Overhead |
|---|---|---|---|
| Median latency (gpt-4o-mini) | 412 ms | 416 ms | 4 ms |
| P99 latency | 1840 ms | 1880 ms | 40 ms |
| Reliability (24h, 1M requests) | 99.91% | 99.97% | +0.06% |
Reliability goes up because LiteLLM retries handle the small fraction of cloud-side blips that direct SDK calls would surface as user-facing errors.
Common Production Pitfalls {#pitfalls}
1. Database is now in the critical path. When you enable spend tracking with Postgres, a slow database makes every LLM call slow. Use connection pooling (pgbouncer), keep the DB in the same VPC, and monitor pg_stat_activity for long queries.
2. Spend tracking lag. Costs are recorded asynchronously after the response. If you crash mid-write, you might under-bill by one request. Acceptable for most teams; if you cannot tolerate it, write spend synchronously (litellm_settings.disable_spend_logs: false).
3. Streaming + cost tracking. For streamed responses, LiteLLM has to consume the entire stream to count tokens. If a client disconnects mid-stream, the spend record may be incomplete. Set forward_traceparent_to_llm_api: true and use OpenTelemetry to reconcile.
4. Master key in env files. The master_key grants full admin access. Never commit it. Use Vault, AWS Secrets Manager, or sealed secrets in K8s. Rotate every 90 days.
5. Cache hit attribution. When LiteLLM serves a response from cache, no upstream cost is incurred but the request still appears in logs. Filter on cache_hit = true when reconciling spend.
6. UI is optional, but worth it. litellm[proxy] ships a basic admin UI on /ui. Behind your VPN it is a fast way to add keys, view spend, and inspect failed requests without writing dashboards.
7. Pin the LiteLLM version. This project moves fast. New releases occasionally ship breaking changes to config schema. Pin in your Dockerfile: pip install 'litellm[proxy]==1.55.4' (or whatever version you tested).
For the deepest reference, the official LiteLLM Proxy documentation covers every flag, and the BerriAI/litellm GitHub repo is where new features land first.
Frequently Asked Questions {#faq}
Q: Is LiteLLM the same as the LiteLLM Python SDK?
The SDK (pip install litellm) is a unified client library you import in code. The proxy (pip install 'litellm[proxy]') is a standalone server. The proxy uses the SDK internally. For a multi-team gateway, you want the proxy.
Q: How does LiteLLM handle Anthropic's different message format?
LiteLLM translates between OpenAI's chat completions format and each provider's native format under the hood. Your client sees OpenAI; Anthropic sees Anthropic. System messages, tool calls, and image inputs are mapped automatically.
Q: Can I use LiteLLM with Continue.dev or Cursor?
Yes. Both support OpenAI-compatible base URLs. Point them at http://ai-gw.internal/v1 with a virtual key. You get cost tracking and rate limits per developer. Pair with the Continue.dev + Ollama setup.
Q: What is the practical maximum throughput?
A single LiteLLM uvicorn worker handles ~500 req/s on cheap CPU. Scale horizontally with multiple replicas behind nginx. We have run a 6-replica deployment at 2400 req/s sustained on c6i.xlarge instances. The bottleneck becomes Postgres writes, not LiteLLM itself.
Q: Does LiteLLM support function calling and tool use?
Yes for any underlying model that supports it. Pass the OpenAI tools field; LiteLLM translates to Anthropic's tools, Mistral's tool_choice, etc. For models without native tool support, set drop_params: true so the field is silently stripped.
Q: How do I migrate from a LangChain app already using direct provider SDKs?
Replace each provider client with an OpenAI client pointed at LiteLLM. LangChain's ChatOpenAI works against the proxy unchanged — set openai_api_base and openai_api_key to your gateway's URL and a virtual key.
Q: Can I run LiteLLM and Ollama on the same machine?
Yes for development. For production, separate them. LLM inference is GPU-bound; LiteLLM is CPU/network-bound. Combining them on one host makes sizing harder and one workload can starve the other.
Q: What about latency-sensitive on-device or edge cases?
For sub-50 ms total latency requirements, the gateway adds about 4 ms — usually fine. If even that is unacceptable, give the latency-critical service a direct connection to Ollama and only route everything else through LiteLLM.
Conclusion
LiteLLM turns a sprawl of direct provider SDKs into a single internal endpoint that knows about budgets, fallbacks, audit trails, and cost. The pattern works whether you are routing exclusively across local Ollama clusters or fanning out to a dozen cloud providers. We treat it the same way we treat nginx: install it once, configure it with care, and forget it exists most days.
The next step depends on your starting point. If you do not yet have a local model running, our Ollama production deployment guide is the right place to begin. If you have local serving but no integration in your apps, jump to adding local AI to an existing app. For multi-GPU scale-out behind the gateway, see multi-GPU Ollama setup.
Get more production playbooks in the LocalAIMaster newsletter — every week, real lessons from teams running private AI in production.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!