Free course — 2 free chapters of every course. No credit card.Start learning free
Production / Architecture

AI Gateway with LiteLLM: Route Local + Cloud Models in Production (2026)

April 23, 2026
19 min read
LocalAimaster Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

Building an AI Gateway with LiteLLM: Local + Cloud Routing in Production

Published April 23, 2026 • 19 min read

A team usually starts with one OpenAI key and a handful of services calling it directly. Six months later there are eleven services, three model providers, no usage attribution, no fallback when one provider has an outage, and the security team wants per-team quotas yesterday. That is the moment an AI gateway stops being optional. LiteLLM is the open-source proxy we have shipped to production for this exact problem — it speaks the OpenAI Chat Completions API on the front, and on the back it talks to roughly 100 model providers including local Ollama, vLLM, OpenAI, Anthropic, Google, Mistral, and Bedrock. This guide is the production deployment we wish we had when we started.

Quick Start: Run LiteLLM in 4 Minutes

# 1. Install with the proxy extras
pip install 'litellm[proxy]'

# 2. Minimal config: one local model + one cloud
cat > config.yaml <<'EOF'
model_list:
  - model_name: fast-local
    litellm_params:
      model: ollama/llama3.2:3b
      api_base: http://localhost:11434
  - model_name: smart-cloud
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
EOF

# 3. Start the proxy
export OPENAI_API_KEY=sk-...
litellm --config config.yaml --port 4000

Now any OpenAI SDK can talk to either model through one endpoint:

curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-1234" \
  -d '{"model": "fast-local", "messages": [{"role":"user","content":"hi"}]}'

Switch fast-local to smart-cloud in the request body and the same gateway routes the call to OpenAI instead. That is the entire developer experience. Everything else — auth, fallbacks, cost tracking, rate limits — bolts onto this base.

Table of Contents

  1. Why You Need a Gateway, Not Just an SDK
  2. Architecture Overview
  3. Production Config: All the Knobs
  4. Virtual Keys and Per-Team Budgets
  5. Fallbacks, Retries, and Cooldowns
  6. Cost Tracking and Logging
  7. Routing Strategies
  8. Benchmarks: LiteLLM Overhead Is Tiny
  9. Common Production Pitfalls
  10. FAQ

Why You Need a Gateway, Not Just an SDK {#why-gateway}

Direct SDK calls to providers work fine for one app and one provider. The pain compounds linearly with each new service and each new provider:

Pain PointWithout GatewayWith LiteLLM
Switch from OpenAI to local OllamaCode change, redeploy every serviceEdit config.yaml, reload
Per-team budgetsCustom code in every serviceBuilt-in virtual keys
Cost attributionSpreadsheet from billing CSVReal-time per-key spend
Provider outageAll requests failAuto-fallback to backup
Rate limit handlingPer-service retry logicCentralized with cooldowns
Audit trailScattered logsSingle Postgres or S3 sink
New model releasesUpdate SDK in N servicesAdd one line to config

LiteLLM is essentially nginx for LLMs. Once it is in the path, switching providers, adding fallbacks, or capping a runaway team's spend is a config change.

If you have not yet decided whether to go local at all, Ollama vs ChatGPT API cost and hybrid local + cloud architecture are the prerequisites for this guide.


Architecture Overview {#architecture}

The gateway sits between every internal client and every model provider. Internal services only know one URL: https://ai-gw.internal.

┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
│   Internal app   │    │   Internal app   │    │   Internal app   │
└────────┬─────────┘    └────────┬─────────┘    └────────┬─────────┘
         │                       │                       │
         └───────────┬───────────┴───────────┬───────────┘
                     ▼                       ▼
              ┌──────────────────────────────────┐
              │     LiteLLM Proxy (port 4000)    │
              │   - virtual keys                 │
              │   - per-key budgets              │
              │   - routing & fallbacks          │
              │   - cost tracking → Postgres     │
              └──────────────────────────────────┘
                     │
       ┌─────────────┼─────────────┬─────────────────┬──────────────┐
       ▼             ▼             ▼                 ▼              ▼
   Ollama:11434  vLLM:8000   OpenAI API   Anthropic API   Google Vertex
   (local 3B/7B) (local 70B)

For a single-team prototype, run LiteLLM as a single Python process. For anything serious: run it under uvicorn workers behind nginx with TLS, back it with Postgres for spend tracking, and keep a Redis instance for shared rate-limit state across replicas.


Production Config: All the Knobs {#production-config}

Here is the config.yaml we deploy, with comments. Adapt it to your model lineup.

# config.yaml
model_list:
  # ───── Local fast lane (cheap, low latency) ─────
  - model_name: fast-local
    litellm_params:
      model: ollama/llama3.2:3b
      api_base: http://10.0.1.20:11434
      timeout: 30
      stream_timeout: 60
    model_info:
      mode: chat
      max_input_tokens: 8192

  # ───── Local heavy lane (big context, reasoning) ─────
  - model_name: heavy-local
    litellm_params:
      model: openai/llama-3.3-70b-instruct
      api_base: http://10.0.1.21:8000/v1   # vLLM endpoint
      api_key: dummy
      timeout: 180

  # ───── Cloud premium (when local cannot do it) ─────
  - model_name: smart-cloud
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      timeout: 90

  - model_name: smart-cloud
    litellm_params:
      model: anthropic/claude-3-7-sonnet-20250219
      api_key: os.environ/ANTHROPIC_API_KEY
      timeout: 90

router_settings:
  routing_strategy: usage-based-routing-v2
  num_retries: 2
  timeout: 600
  fallbacks:
    - { fast-local: ["heavy-local", "smart-cloud"] }
    - { heavy-local: ["smart-cloud"] }
  cooldown_time: 30   # seconds before retrying a failed deployment
  enable_pre_call_checks: true

litellm_settings:
  drop_params: true       # silently drop params an upstream rejects
  set_verbose: false
  json_logs: true
  request_timeout: 600
  cache: true
  cache_params:
    type: redis
    host: 10.0.1.30
    port: 6379

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY   # required for /key endpoints
  database_url: os.environ/DATABASE_URL       # Postgres for spend tracking
  store_model_in_db: true
  alerting: ["slack"]
  alert_to_webhook_url: os.environ/SLACK_WEBHOOK_URL
  proxy_budget_rescheduler_min_time: 60
  proxy_budget_rescheduler_max_time: 64

Two non-obvious things. First, two entries can share the same model_name — that is how you give a single name multiple deployments for load balancing or fallback. Second, drop_params: true saves enormous pain. OpenAI accepts tools, but Ollama on a model that does not support tools rejects it. drop_params silently strips unsupported fields rather than failing.


Virtual Keys and Per-Team Budgets {#virtual-keys}

Hand each team a separate API key, attach a budget, and let LiteLLM enforce it. Generate keys via the admin endpoint with the master key:

curl -X POST http://ai-gw.internal/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["fast-local", "smart-cloud"],
    "max_budget": 200,
    "budget_duration": "30d",
    "rpm_limit": 600,
    "tpm_limit": 100000,
    "metadata": { "team": "growth", "owner": "alex@" }
  }'

The response includes a key like sk-litellm-xxxxxxxx. Hand that to the growth team. They cannot exceed $200 in 30 days, 600 requests per minute, or 100k tokens per minute. When they hit 80% of budget, LiteLLM fires a Slack alert via the configured webhook. When they hit 100%, requests start failing with Budget Exceeded.

This single feature has saved us multiple times from runaway loops in someone's prototype that would otherwise have racked up four-figure OpenAI bills overnight.

Hierarchy: Org → Team → Key

LiteLLM supports nested budgets. Set an org-wide cap of $5000/month, allocate $1500 to each team, and distribute keys within each team. Spend rolls up automatically.

# Create an organization
curl -X POST http://ai-gw.internal/organization/new \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -d '{"organization_alias": "engineering", "max_budget": 5000}'

# Create a team under it
curl -X POST http://ai-gw.internal/team/new \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -d '{"team_alias": "platform", "organization_id": "<org_id>", "max_budget": 1500}'

For the audit-trail story, see our local AI audit trail guide — LiteLLM logs every prompt and response keyed to the calling key, which is what you actually need for compliance.


Fallbacks, Retries, and Cooldowns {#fallbacks}

A real production gateway treats every upstream as flaky. Configure three layers of resilience.

1. In-deployment retries

If a single deployment returns 429 or 5xx, retry up to num_retries times with exponential backoff. Default is 2.

2. Cross-deployment fallback

If all retries fail, fall back to a different model. Define the fallback chain in router_settings.fallbacks. We use:

fallbacks:
  - { fast-local: ["heavy-local", "smart-cloud"] }
  - { heavy-local: ["smart-cloud"] }
  - { smart-cloud: ["heavy-local"] }   # cloud out? fall to local

Yes, the last entry is correct. If OpenAI is down, fall back to your own GPU. We saw this play out for real on November 8, 2025 when both OpenAI and Anthropic had simultaneous degradations — local Ollama kept serving while every cloud-only competitor went dark.

3. Cooldowns

When a deployment fails, mark it cool for cooldown_time seconds and route around it. Without this, you will hammer a dying upstream and burn retries on a guaranteed-fail target.

cooldown_time: 30   # seconds
allowed_fails: 3    # trip after this many fails in cooldown_time window

4. Context-window-aware fallback

If a request exceeds the local model's context window, LiteLLM can automatically fall through to a model with a bigger window. Set context_window_fallbacks:

context_window_fallbacks:
  - { fast-local: ["smart-cloud"] }

A 32k-token request that would error on a 4k-context Ollama model now silently routes to GPT-4o instead.


Cost Tracking and Logging {#cost-tracking}

Wire LiteLLM to Postgres and you get per-request spend with no code in the apps. The schema is straightforward — every completed request lands in LiteLLM_SpendLogs with the calling key, model, prompt tokens, completion tokens, and computed cost.

-- Top 10 most expensive keys in the last 7 days
SELECT
  api_key, team_id, COUNT(*) AS calls,
  SUM(spend) AS dollars,
  SUM(prompt_tokens + completion_tokens) AS total_tokens
FROM "LiteLLM_SpendLogs"
WHERE "startTime" > NOW() - INTERVAL '7 days'
GROUP BY api_key, team_id
ORDER BY dollars DESC
LIMIT 10;

For exporting to existing observability stacks, configure the logging callbacks:

litellm_settings:
  success_callback: ["langfuse", "prometheus"]
  failure_callback: ["langfuse", "sentry"]

Langfuse gives you a per-trace UI with prompt/response inspection. Prometheus gives you metrics like litellm_total_tokens and litellm_request_duration_seconds for Grafana dashboards. We pair this with the Ollama Prometheus + Grafana setup for full-stack visibility.


Routing Strategies {#routing}

LiteLLM ships several routing strategies. Pick by your dominant constraint.

StrategyBest ForHow It Decides
simple-shuffleEven load distributionRandom pick across deployments with same name
least-busyLatency-sensitiveRoutes to deployment with fewest active connections
usage-based-routing-v2RPM/TPM constraintsPicks deployment furthest from rate limit
latency-based-routingP99 latency targetsRoutes to lowest 5-min average latency
cost-based-routingSave moneyPicks cheapest deployment that satisfies the request

Our default is usage-based-routing-v2 — it considers both rate limits and current load. Switch to cost-based-routing if you have many provider tiers (e.g. Haiku/Sonnet/Opus) and want LiteLLM to always pick the cheapest sufficient model.

Tag-based routing

Route by request metadata. Add tags: ["pii"] to a request and force it to a local model:

router_settings:
  enable_tag_filtering: true

model_list:
  - model_name: fast-local
    litellm_params: { model: ollama/llama3.2:3b, api_base: ... }
    model_info: { tags: ["pii", "internal"] }
  - model_name: smart-cloud
    litellm_params: { model: openai/gpt-4o, api_key: ... }
    model_info: { tags: ["public"] }

Now any request from a service that tags itself pii will only ever route to local. This is how you enforce "no PII to cloud" as a config rule, not as a dev's discipline.


Benchmarks: LiteLLM Overhead Is Tiny {#benchmarks}

Tested on a c6i.xlarge (4 vCPU, 8 GB) with a 4-replica uvicorn LiteLLM behind nginx, against a local Ollama instance over a 1 Gbit private network.

TestDirect to OllamaVia LiteLLMOverhead
Median latency (small prompt)95 ms99 ms4 ms
P99 latency480 ms510 ms30 ms
Throughput (concurrent=20, llama3.2:3b)312 req/s305 req/s2.2%
CPU on gateway host @ 300 req/sn/a38%
Memory on gateway hostn/a420 MB

In other words, the proxy adds about 4 ms of unavoidable HTTP and routing overhead and 2-3% throughput cost. In exchange you get auth, budgets, fallbacks, and cost tracking. Worth it.

For comparison, here is the same load run against OpenAI directly versus through LiteLLM:

TestDirect to OpenAIVia LiteLLMOverhead
Median latency (gpt-4o-mini)412 ms416 ms4 ms
P99 latency1840 ms1880 ms40 ms
Reliability (24h, 1M requests)99.91%99.97%+0.06%

Reliability goes up because LiteLLM retries handle the small fraction of cloud-side blips that direct SDK calls would surface as user-facing errors.


Common Production Pitfalls {#pitfalls}

1. Database is now in the critical path. When you enable spend tracking with Postgres, a slow database makes every LLM call slow. Use connection pooling (pgbouncer), keep the DB in the same VPC, and monitor pg_stat_activity for long queries.

2. Spend tracking lag. Costs are recorded asynchronously after the response. If you crash mid-write, you might under-bill by one request. Acceptable for most teams; if you cannot tolerate it, write spend synchronously (litellm_settings.disable_spend_logs: false).

3. Streaming + cost tracking. For streamed responses, LiteLLM has to consume the entire stream to count tokens. If a client disconnects mid-stream, the spend record may be incomplete. Set forward_traceparent_to_llm_api: true and use OpenTelemetry to reconcile.

4. Master key in env files. The master_key grants full admin access. Never commit it. Use Vault, AWS Secrets Manager, or sealed secrets in K8s. Rotate every 90 days.

5. Cache hit attribution. When LiteLLM serves a response from cache, no upstream cost is incurred but the request still appears in logs. Filter on cache_hit = true when reconciling spend.

6. UI is optional, but worth it. litellm[proxy] ships a basic admin UI on /ui. Behind your VPN it is a fast way to add keys, view spend, and inspect failed requests without writing dashboards.

7. Pin the LiteLLM version. This project moves fast. New releases occasionally ship breaking changes to config schema. Pin in your Dockerfile: pip install 'litellm[proxy]==1.55.4' (or whatever version you tested).

For the deepest reference, the official LiteLLM Proxy documentation covers every flag, and the BerriAI/litellm GitHub repo is where new features land first.


Frequently Asked Questions {#faq}

Q: Is LiteLLM the same as the LiteLLM Python SDK?

The SDK (pip install litellm) is a unified client library you import in code. The proxy (pip install 'litellm[proxy]') is a standalone server. The proxy uses the SDK internally. For a multi-team gateway, you want the proxy.

Q: How does LiteLLM handle Anthropic's different message format?

LiteLLM translates between OpenAI's chat completions format and each provider's native format under the hood. Your client sees OpenAI; Anthropic sees Anthropic. System messages, tool calls, and image inputs are mapped automatically.

Q: Can I use LiteLLM with Continue.dev or Cursor?

Yes. Both support OpenAI-compatible base URLs. Point them at http://ai-gw.internal/v1 with a virtual key. You get cost tracking and rate limits per developer. Pair with the Continue.dev + Ollama setup.

Q: What is the practical maximum throughput?

A single LiteLLM uvicorn worker handles ~500 req/s on cheap CPU. Scale horizontally with multiple replicas behind nginx. We have run a 6-replica deployment at 2400 req/s sustained on c6i.xlarge instances. The bottleneck becomes Postgres writes, not LiteLLM itself.

Q: Does LiteLLM support function calling and tool use?

Yes for any underlying model that supports it. Pass the OpenAI tools field; LiteLLM translates to Anthropic's tools, Mistral's tool_choice, etc. For models without native tool support, set drop_params: true so the field is silently stripped.

Q: How do I migrate from a LangChain app already using direct provider SDKs?

Replace each provider client with an OpenAI client pointed at LiteLLM. LangChain's ChatOpenAI works against the proxy unchanged — set openai_api_base and openai_api_key to your gateway's URL and a virtual key.

Q: Can I run LiteLLM and Ollama on the same machine?

Yes for development. For production, separate them. LLM inference is GPU-bound; LiteLLM is CPU/network-bound. Combining them on one host makes sizing harder and one workload can starve the other.

Q: What about latency-sensitive on-device or edge cases?

For sub-50 ms total latency requirements, the gateway adds about 4 ms — usually fine. If even that is unacceptable, give the latency-critical service a direct connection to Ollama and only route everything else through LiteLLM.


Conclusion

LiteLLM turns a sprawl of direct provider SDKs into a single internal endpoint that knows about budgets, fallbacks, audit trails, and cost. The pattern works whether you are routing exclusively across local Ollama clusters or fanning out to a dozen cloud providers. We treat it the same way we treat nginx: install it once, configure it with care, and forget it exists most days.

The next step depends on your starting point. If you do not yet have a local model running, our Ollama production deployment guide is the right place to begin. If you have local serving but no integration in your apps, jump to adding local AI to an existing app. For multi-GPU scale-out behind the gateway, see multi-GPU Ollama setup.

Get more production playbooks in the LocalAIMaster newsletter — every week, real lessons from teams running private AI in production.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: April 23, 2026🔄 Last Updated: April 23, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

Was this helpful?

Run Your AI Stack Like Production Infrastructure

Get LiteLLM configs, gateway recipes, and cost-tracking dashboards we use in real deployments — straight to your inbox.

Related Guides

Continue your local AI journey with these comprehensive guides

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Continue Learning

📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators