Is the LiteLLM Python SDK the same thing as the LiteLLM Proxy?

No. The SDK is a unified client library you import directly into application code. The proxy is a standalone HTTP server you deploy as a gateway. The proxy uses the SDK under the hood. For a multi-team or production gateway you want the proxy, not just the SDK.

How does LiteLLM translate between OpenAI and Anthropic message formats?

LiteLLM normalizes everything to OpenAI's chat completions schema on the front end and translates to each provider's native format on the back end. System messages, tool calls, vision inputs, and stop sequences are mapped automatically so your client code only ever sees the OpenAI shape.

What is the actual latency overhead of putting LiteLLM in the request path?

Roughly 4 milliseconds at the median and around 30 to 40 ms at P99 on a c6i.xlarge instance. Throughput drops by about 2 to 3 percent. In exchange you get virtual keys, budgets, fallbacks, retries, and cost tracking, which is almost always worth those few milliseconds.

Can LiteLLM enforce that requests with PII never go to cloud providers?

Yes, via tag-based routing. Mark local models with a tag like 'pii' and configure tag filtering. Any request that includes the matching tag in metadata is routed only to deployments with that tag. This turns a policy into a config rule rather than relying on developer discipline.

How do per-team budgets actually work and what happens at 100% spend?

You attach max_budget and budget_duration to each virtual key or team. LiteLLM tracks spend in Postgres and emits Slack alerts at 80% by default. When the budget is exhausted, requests with that key fail with a Budget Exceeded error until the budget window resets or you raise the cap.

What happens to cost tracking if a streaming request is interrupted?

LiteLLM consumes the stream server-side to count tokens, so a client disconnect mid-stream may produce an incomplete spend record. For exact attribution under streaming workloads, enable forward_traceparent_to_llm_api and reconcile with OpenTelemetry traces, or use synchronous spend logging for a small latency penalty.

Will LiteLLM work with Continue.dev, Cursor, or LangChain unchanged?

Yes, all three support an OpenAI-compatible base URL. Point them at your LiteLLM URL such as http://ai-gw.internal/v1 with a virtual key. You then get per-developer rate limits, cost tracking, and audit logs without modifying the upstream tool.

Should I run LiteLLM and Ollama on the same physical host?

Fine for local development. In production, separate them. LiteLLM is CPU and network bound; Ollama is GPU bound. Co-locating them complicates sizing and lets one workload starve the other under spikes. A small dedicated CPU box for the gateway and one or more GPU hosts for inference is the standard pattern.

Building an AI Gateway with LiteLLM: Local + Cloud Routing in Production

Published April 23, 2026 • 19 min read

A team usually starts with one OpenAI key and a handful of services calling it directly. Six months later there are eleven services, three model providers, no usage attribution, no fallback when one provider has an outage, and the security team wants per-team quotas yesterday. That is the moment an AI gateway stops being optional. LiteLLM is the open-source proxy we have shipped to production for this exact problem — it speaks the OpenAI Chat Completions API on the front, and on the back it talks to roughly 100 model providers including local Ollama, vLLM, OpenAI, Anthropic, Google, Mistral, and Bedrock. This guide is the production deployment we wish we had when we started.

Quick Start: Run LiteLLM in 4 Minutes

# 1. Install with the proxy extras
pip install 'litellm[proxy]'

# 2. Minimal config: one local model + one cloud
cat > config.yaml <<'EOF'
model_list:
  - model_name: fast-local
    litellm_params:
      model: ollama/llama3.2:3b
      api_base: http://localhost:11434
  - model_name: smart-cloud
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
EOF

# 3. Start the proxy
export OPENAI_API_KEY=sk-...
litellm --config config.yaml --port 4000

Now any OpenAI SDK can talk to either model through one endpoint:

curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-1234" \
  -d '{"model": "fast-local", "messages": [{"role":"user","content":"hi"}]}'

Switch fast-local to smart-cloud in the request body and the same gateway routes the call to OpenAI instead. That is the entire developer experience. Everything else — auth, fallbacks, cost tracking, rate limits — bolts onto this base.

Why You Need a Gateway, Not Just an SDK
Architecture Overview
Production Config: All the Knobs
Virtual Keys and Per-Team Budgets
Fallbacks, Retries, and Cooldowns
Cost Tracking and Logging
Routing Strategies
Benchmarks: LiteLLM Overhead Is Tiny
Common Production Pitfalls
FAQ

Why You Need a Gateway, Not Just an SDK {#why-gateway}

Direct SDK calls to providers work fine for one app and one provider. The pain compounds linearly with each new service and each new provider:

Pain Point	Without Gateway	With LiteLLM
Switch from OpenAI to local Ollama	Code change, redeploy every service	Edit `config.yaml`, reload
Per-team budgets	Custom code in every service	Built-in virtual keys
Cost attribution	Spreadsheet from billing CSV	Real-time per-key spend
Provider outage	All requests fail	Auto-fallback to backup
Rate limit handling	Per-service retry logic	Centralized with cooldowns
Audit trail	Scattered logs	Single Postgres or S3 sink
New model releases	Update SDK in N services	Add one line to config

LiteLLM is essentially nginx for LLMs. Once it is in the path, switching providers, adding fallbacks, or capping a runaway team's spend is a config change.

If you have not yet decided whether to go local at all, Ollama vs ChatGPT API cost and hybrid local + cloud architecture are the prerequisites for this guide.

Architecture Overview {#architecture}

The gateway sits between every internal client and every model provider. Internal services only know one URL: https://ai-gw.internal.

┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
│   Internal app   │    │   Internal app   │    │   Internal app   │
└────────┬─────────┘    └────────┬─────────┘    └────────┬─────────┘
         │                       │                       │
         └───────────┬───────────┴───────────┬───────────┘
                     ▼                       ▼
              ┌──────────────────────────────────┐
              │     LiteLLM Proxy (port 4000)    │
              │   - virtual keys                 │
              │   - per-key budgets              │
              │   - routing & fallbacks          │
              │   - cost tracking → Postgres     │
              └──────────────────────────────────┘
                     │
       ┌─────────────┼─────────────┬─────────────────┬──────────────┐
       ▼             ▼             ▼                 ▼              ▼
   Ollama:11434  vLLM:8000   OpenAI API   Anthropic API   Google Vertex
   (local 3B/7B) (local 70B)

For a single-team prototype, run LiteLLM as a single Python process. For anything serious: run it under uvicorn workers behind nginx with TLS, back it with Postgres for spend tracking, and keep a Redis instance for shared rate-limit state across replicas.

Production Config: All the Knobs {#production-config}

Here is the config.yaml we deploy, with comments. Adapt it to your model lineup.

# config.yaml
model_list:
  # ───── Local fast lane (cheap, low latency) ─────
  - model_name: fast-local
    litellm_params:
      model: ollama/llama3.2:3b
      api_base: http://10.0.1.20:11434
      timeout: 30
      stream_timeout: 60
    model_info:
      mode: chat
      max_input_tokens: 8192

  # ───── Local heavy lane (big context, reasoning) ─────
  - model_name: heavy-local
    litellm_params:
      model: openai/llama-3.3-70b-instruct
      api_base: http://10.0.1.21:8000/v1   # vLLM endpoint
      api_key: dummy
      timeout: 180

  # ───── Cloud premium (when local cannot do it) ─────
  - model_name: smart-cloud
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      timeout: 90

  - model_name: smart-cloud
    litellm_params:
      model: anthropic/claude-3-7-sonnet-20250219
      api_key: os.environ/ANTHROPIC_API_KEY
      timeout: 90

router_settings:
  routing_strategy: usage-based-routing-v2
  num_retries: 2
  timeout: 600
  fallbacks:
    - { fast-local: ["heavy-local", "smart-cloud"] }
    - { heavy-local: ["smart-cloud"] }
  cooldown_time: 30   # seconds before retrying a failed deployment
  enable_pre_call_checks: true

litellm_settings:
  drop_params: true       # silently drop params an upstream rejects
  set_verbose: false
  json_logs: true
  request_timeout: 600
  cache: true
  cache_params:
    type: redis
    host: 10.0.1.30
    port: 6379

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY   # required for /key endpoints
  database_url: os.environ/DATABASE_URL       # Postgres for spend tracking
  store_model_in_db: true
  alerting: ["slack"]
  alert_to_webhook_url: os.environ/SLACK_WEBHOOK_URL
  proxy_budget_rescheduler_min_time: 60
  proxy_budget_rescheduler_max_time: 64

Two non-obvious things. First, two entries can share the same model_name — that is how you give a single name multiple deployments for load balancing or fallback. Second, drop_params: true saves enormous pain. OpenAI accepts tools, but Ollama on a model that does not support tools rejects it. drop_params silently strips unsupported fields rather than failing.

Virtual Keys and Per-Team Budgets {#virtual-keys}

Hand each team a separate API key, attach a budget, and let LiteLLM enforce it. Generate keys via the admin endpoint with the master key:

curl -X POST http://ai-gw.internal/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["fast-local", "smart-cloud"],
    "max_budget": 200,
    "budget_duration": "30d",
    "rpm_limit": 600,
    "tpm_limit": 100000,
    "metadata": { "team": "growth", "owner": "alex@" }
  }'

The response includes a key like sk-litellm-xxxxxxxx. Hand that to the growth team. They cannot exceed $200 in 30 days, 600 requests per minute, or 100k tokens per minute. When they hit 80% of budget, LiteLLM fires a Slack alert via the configured webhook. When they hit 100%, requests start failing with Budget Exceeded.

This single feature has saved us multiple times from runaway loops in someone's prototype that would otherwise have racked up four-figure OpenAI bills overnight.

Hierarchy: Org → Team → Key

LiteLLM supports nested budgets. Set an org-wide cap of $5000/month, allocate $1500 to each team, and distribute keys within each team. Spend rolls up automatically.

# Create an organization
curl -X POST http://ai-gw.internal/organization/new \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -d '{"organization_alias": "engineering", "max_budget": 5000}'

# Create a team under it
curl -X POST http://ai-gw.internal/team/new \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -d '{"team_alias": "platform", "organization_id": "<org_id>", "max_budget": 1500}'

For the audit-trail story, see our local AI audit trail guide — LiteLLM logs every prompt and response keyed to the calling key, which is what you actually need for compliance.

Fallbacks, Retries, and Cooldowns {#fallbacks}

A real production gateway treats every upstream as flaky. Configure three layers of resilience.

1. In-deployment retries

If a single deployment returns 429 or 5xx, retry up to num_retries times with exponential backoff. Default is 2.

2. Cross-deployment fallback

If all retries fail, fall back to a different model. Define the fallback chain in router_settings.fallbacks. We use:

fallbacks:
  - { fast-local: ["heavy-local", "smart-cloud"] }
  - { heavy-local: ["smart-cloud"] }
  - { smart-cloud: ["heavy-local"] }   # cloud out? fall to local

Yes, the last entry is correct. If OpenAI is down, fall back to your own GPU. We saw this play out for real on November 8, 2025 when both OpenAI and Anthropic had simultaneous degradations — local Ollama kept serving while every cloud-only competitor went dark.

3. Cooldowns

When a deployment fails, mark it cool for cooldown_time seconds and route around it. Without this, you will hammer a dying upstream and burn retries on a guaranteed-fail target.

cooldown_time: 30   # seconds
allowed_fails: 3    # trip after this many fails in cooldown_time window

4. Context-window-aware fallback

If a request exceeds the local model's context window, LiteLLM can automatically fall through to a model with a bigger window. Set context_window_fallbacks:

context_window_fallbacks:
  - { fast-local: ["smart-cloud"] }

A 32k-token request that would error on a 4k-context Ollama model now silently routes to GPT-4o instead.

Cost Tracking and Logging {#cost-tracking}

Wire LiteLLM to Postgres and you get per-request spend with no code in the apps. The schema is straightforward — every completed request lands in LiteLLM_SpendLogs with the calling key, model, prompt tokens, completion tokens, and computed cost.

-- Top 10 most expensive keys in the last 7 days
SELECT
  api_key, team_id, COUNT(*) AS calls,
  SUM(spend) AS dollars,
  SUM(prompt_tokens + completion_tokens) AS total_tokens
FROM "LiteLLM_SpendLogs"
WHERE "startTime" > NOW() - INTERVAL '7 days'
GROUP BY api_key, team_id
ORDER BY dollars DESC
LIMIT 10;

For exporting to existing observability stacks, configure the logging callbacks:

litellm_settings:
  success_callback: ["langfuse", "prometheus"]
  failure_callback: ["langfuse", "sentry"]

Langfuse gives you a per-trace UI with prompt/response inspection. Prometheus gives you metrics like litellm_total_tokens and litellm_request_duration_seconds for Grafana dashboards. We pair this with the Ollama Prometheus + Grafana setup for full-stack visibility.

Routing Strategies {#routing}

LiteLLM ships several routing strategies. Pick by your dominant constraint.

Strategy	Best For	How It Decides
`simple-shuffle`	Even load distribution	Random pick across deployments with same name
`least-busy`	Latency-sensitive	Routes to deployment with fewest active connections
`usage-based-routing-v2`	RPM/TPM constraints	Picks deployment furthest from rate limit
`latency-based-routing`	P99 latency targets	Routes to lowest 5-min average latency
`cost-based-routing`	Save money	Picks cheapest deployment that satisfies the request

Our default is usage-based-routing-v2 — it considers both rate limits and current load. Switch to cost-based-routing if you have many provider tiers (e.g. Haiku/Sonnet/Opus) and want LiteLLM to always pick the cheapest sufficient model.

Tag-based routing

Route by request metadata. Add tags: ["pii"] to a request and force it to a local model:

router_settings:
  enable_tag_filtering: true

model_list:
  - model_name: fast-local
    litellm_params: { model: ollama/llama3.2:3b, api_base: ... }
    model_info: { tags: ["pii", "internal"] }
  - model_name: smart-cloud
    litellm_params: { model: openai/gpt-4o, api_key: ... }
    model_info: { tags: ["public"] }

Now any request from a service that tags itself pii will only ever route to local. This is how you enforce "no PII to cloud" as a config rule, not as a dev's discipline.

Benchmarks: LiteLLM Overhead Is Tiny {#benchmarks}

Tested on a c6i.xlarge (4 vCPU, 8 GB) with a 4-replica uvicorn LiteLLM behind nginx, against a local Ollama instance over a 1 Gbit private network.

Test	Direct to Ollama	Via LiteLLM	Overhead
Median latency (small prompt)	95 ms	99 ms	4 ms
P99 latency	480 ms	510 ms	30 ms
Throughput (concurrent=20, llama3.2:3b)	312 req/s	305 req/s	2.2%
CPU on gateway host @ 300 req/s	n/a	38%	—
Memory on gateway host	n/a	420 MB	—

In other words, the proxy adds about 4 ms of unavoidable HTTP and routing overhead and 2-3% throughput cost. In exchange you get auth, budgets, fallbacks, and cost tracking. Worth it.

For comparison, here is the same load run against OpenAI directly versus through LiteLLM:

Test	Direct to OpenAI	Via LiteLLM	Overhead
Median latency (gpt-4o-mini)	412 ms	416 ms	4 ms
P99 latency	1840 ms	1880 ms	40 ms
Reliability (24h, 1M requests)	99.91%	99.97%	+0.06%

Reliability goes up because LiteLLM retries handle the small fraction of cloud-side blips that direct SDK calls would surface as user-facing errors.

Common Production Pitfalls {#pitfalls}

1. Database is now in the critical path. When you enable spend tracking with Postgres, a slow database makes every LLM call slow. Use connection pooling (pgbouncer), keep the DB in the same VPC, and monitor pg_stat_activity for long queries.

2. Spend tracking lag. Costs are recorded asynchronously after the response. If you crash mid-write, you might under-bill by one request. Acceptable for most teams; if you cannot tolerate it, write spend synchronously (litellm_settings.disable_spend_logs: false).

3. Streaming + cost tracking. For streamed responses, LiteLLM has to consume the entire stream to count tokens. If a client disconnects mid-stream, the spend record may be incomplete. Set forward_traceparent_to_llm_api: true and use OpenTelemetry to reconcile.

4. Master key in env files. The master_key grants full admin access. Never commit it. Use Vault, AWS Secrets Manager, or sealed secrets in K8s. Rotate every 90 days.

5. Cache hit attribution. When LiteLLM serves a response from cache, no upstream cost is incurred but the request still appears in logs. Filter on cache_hit = true when reconciling spend.

6. UI is optional, but worth it. litellm[proxy] ships a basic admin UI on /ui. Behind your VPN it is a fast way to add keys, view spend, and inspect failed requests without writing dashboards.

7. Pin the LiteLLM version. This project moves fast. New releases occasionally ship breaking changes to config schema. Pin in your Dockerfile: pip install 'litellm[proxy]==1.55.4' (or whatever version you tested).

For the deepest reference, the official LiteLLM Proxy documentation covers every flag, and the BerriAI/litellm GitHub repo is where new features land first.

Frequently Asked Questions {#faq}

Q: Is LiteLLM the same as the LiteLLM Python SDK?

The SDK (pip install litellm) is a unified client library you import in code. The proxy (pip install 'litellm[proxy]') is a standalone server. The proxy uses the SDK internally. For a multi-team gateway, you want the proxy.

Q: How does LiteLLM handle Anthropic's different message format?

LiteLLM translates between OpenAI's chat completions format and each provider's native format under the hood. Your client sees OpenAI; Anthropic sees Anthropic. System messages, tool calls, and image inputs are mapped automatically.

Q: Can I use LiteLLM with Continue.dev or Cursor?

Yes. Both support OpenAI-compatible base URLs. Point them at http://ai-gw.internal/v1 with a virtual key. You get cost tracking and rate limits per developer. Pair with the Continue.dev + Ollama setup.

Q: What is the practical maximum throughput?

A single LiteLLM uvicorn worker handles ~500 req/s on cheap CPU. Scale horizontally with multiple replicas behind nginx. We have run a 6-replica deployment at 2400 req/s sustained on c6i.xlarge instances. The bottleneck becomes Postgres writes, not LiteLLM itself.

Q: Does LiteLLM support function calling and tool use?

Yes for any underlying model that supports it. Pass the OpenAI tools field; LiteLLM translates to Anthropic's tools, Mistral's tool_choice, etc. For models without native tool support, set drop_params: true so the field is silently stripped.

Q: How do I migrate from a LangChain app already using direct provider SDKs?

Replace each provider client with an OpenAI client pointed at LiteLLM. LangChain's ChatOpenAI works against the proxy unchanged — set openai_api_base and openai_api_key to your gateway's URL and a virtual key.

Q: Can I run LiteLLM and Ollama on the same machine?

Yes for development. For production, separate them. LLM inference is GPU-bound; LiteLLM is CPU/network-bound. Combining them on one host makes sizing harder and one workload can starve the other.

Q: What about latency-sensitive on-device or edge cases?

For sub-50 ms total latency requirements, the gateway adds about 4 ms — usually fine. If even that is unacceptable, give the latency-critical service a direct connection to Ollama and only route everything else through LiteLLM.

Conclusion

LiteLLM turns a sprawl of direct provider SDKs into a single internal endpoint that knows about budgets, fallbacks, audit trails, and cost. The pattern works whether you are routing exclusively across local Ollama clusters or fanning out to a dozen cloud providers. We treat it the same way we treat nginx: install it once, configure it with care, and forget it exists most days.

The next step depends on your starting point. If you do not yet have a local model running, our Ollama production deployment guide is the right place to begin. If you have local serving but no integration in your apps, jump to adding local AI to an existing app. For multi-GPU scale-out behind the gateway, see multi-GPU Ollama setup.

Get more production playbooks in the LocalAIMaster newsletter — every week, real lessons from teams running private AI in production.

AI Gateway with LiteLLM: Route Local + Cloud Models in Production (2026)

Want to go deeper than this article?

Building an AI Gateway with LiteLLM: Local + Cloud Routing in Production

Quick Start: Run LiteLLM in 4 Minutes

Table of Contents

Why You Need a Gateway, Not Just an SDK {#why-gateway}

Architecture Overview {#architecture}

Production Config: All the Knobs {#production-config}

Virtual Keys and Per-Team Budgets {#virtual-keys}

Hierarchy: Org → Team → Key

Fallbacks, Retries, and Cooldowns {#fallbacks}

1. In-deployment retries

2. Cross-deployment fallback

3. Cooldowns

4. Context-window-aware fallback

Cost Tracking and Logging {#cost-tracking}

Routing Strategies {#routing}

Tag-based routing

Benchmarks: LiteLLM Overhead Is Tiny {#benchmarks}

Common Production Pitfalls {#pitfalls}

Frequently Asked Questions {#faq}

Q: Is LiteLLM the same as the LiteLLM Python SDK?

Q: How does LiteLLM handle Anthropic's different message format?

Q: Can I use LiteLLM with Continue.dev or Cursor?

Q: What is the practical maximum throughput?

Q: Does LiteLLM support function calling and tool use?

Q: How do I migrate from a LangChain app already using direct provider SDKs?

Q: Can I run LiteLLM and Ollama on the same machine?

Q: What about latency-sensitive on-device or edge cases?

Conclusion

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

🎓 Continue Learning

Run Your AI Stack Like Production Infrastructure

Related Guides

Build Real AI on Your Machine

Continue Learning

Ollama in Production

Ollama Load Balancing

Securing Ollama

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI