The Hybrid AI Architecture: Route Local + Cloud
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
The Hybrid AI Architecture: Route Local + Cloud Intelligently
Published on April 11, 2026 · 20 min read
Most teams frame the AI infrastructure decision as binary: run everything locally or use cloud APIs. Both are wrong for most workloads.
After deploying hybrid setups for three different production applications, I can tell you the pattern that works: route 85-95% of queries to local Ollama (fast, free, private) and send the remaining 5-15% to cloud APIs (GPT-4o, Claude) when you genuinely need them. The result is cloud-quality output at 80-95% lower cost, with sub-100ms first-token latency on most requests.
Here is the exact architecture, config files, and Docker Compose stack to build it.
Why Hybrid Beats Either-Or {#why-hybrid-beats-either-or}
All-cloud problems:
- Linear cost scaling ($1,800/month at 10K queries/day with GPT-4o)
- 500-2,000ms latency on every request
- All data leaves your network
- Rate limits at scale
- Single point of failure (API outages)
All-local problems:
- No multimodal support (or limited)
- Context window limits (32K-128K vs 1M)
- No access to latest knowledge
- Complex reasoning still lags behind GPT-4o/Claude on edge cases
- Hardware failure = total outage
Hybrid solves both. Your application talks to one endpoint. Behind that endpoint, a router decides — in under 5ms — whether to send the query to the GPU under your desk or to OpenAI's data center.
The Architecture {#the-architecture}
┌─────────────────────────────┐
│ Your Application │
│ (OpenAI-compatible SDK) │
└─────────────┬───────────────┘
│
│ HTTP POST /v1/chat/completions
│
┌─────────────▼───────────────┐
│ LiteLLM Proxy │
│ (Router + Budget + Logs) │
│ │
│ ┌─────────────────────┐ │
│ │ Routing Engine │ │
│ │ │ │
│ │ model_name match? │ │
│ │ token_count check? │ │
│ │ budget remaining? │ │
│ │ health check ok? │ │
│ └──────┬──────┬───────┘ │
└─────────┼──────┼────────────┘
┌──────────┘ └──────────┐
│ │
┌────────────▼──────────┐ ┌────────────▼──────────┐
│ Ollama (Local) │ │ Cloud APIs │
│ │ │ │
│ Llama 3.3 70B │ │ GPT-4o (OpenAI) │
│ Qwen 2.5 72B │ │ Claude 4 (Anthropic) │
│ CodeLlama 34B │ │ Gemini (Google) │
│ Llama 3.2 8B │ │ │
│ │ │ (only when needed) │
│ ⚡ 50-150ms TTFT │ │ ☁️ 500-2000ms TTFT │
│ 🔒 Data stays local │ │ 💰 Pay per token │
└───────────────────────┘ └────────────────────────┘
The key insight: your application code does not know or care whether a request goes to Ollama or OpenAI. It uses the standard OpenAI SDK, points at http://localhost:4000 (LiteLLM), and gets back a response. All routing decisions happen inside the proxy.
Setting Up LiteLLM as the Router {#setting-up-litellm-as-the-router}
LiteLLM is the routing layer. It speaks the OpenAI API format on the frontend and translates to Ollama, Anthropic, Google, or any other provider on the backend.
Step 1: Install LiteLLM
# Option A: pip install (for development)
pip install litellm[proxy]
# Option B: Docker (for production — recommended)
docker pull ghcr.io/berriai/litellm:main-latest
Step 2: Create the Config File
Save this as litellm_config.yaml:
# litellm_config.yaml — Hybrid routing configuration
model_list:
# LOCAL MODELS (Ollama) — fast, free, private
- model_name: "fast-local"
litellm_params:
model: "ollama/llama3.2:8b"
api_base: "http://ollama:11434"
stream: true
max_tokens: 4096
model_info:
description: "Fast local model for simple tasks"
- model_name: "strong-local"
litellm_params:
model: "ollama/llama3.3:70b-instruct-q4_K_M"
api_base: "http://ollama:11434"
stream: true
max_tokens: 8192
model_info:
description: "Strong local model for complex tasks"
- model_name: "code-local"
litellm_params:
model: "ollama/qwen2.5-coder:32b-instruct-q4_K_M"
api_base: "http://ollama:11434"
stream: true
max_tokens: 8192
# CLOUD MODELS — used for fallback and complex tasks
- model_name: "cloud-strong"
litellm_params:
model: "gpt-4o"
api_key: "os.environ/OPENAI_API_KEY"
max_tokens: 16384
- model_name: "cloud-fast"
litellm_params:
model: "gpt-4o-mini"
api_key: "os.environ/OPENAI_API_KEY"
max_tokens: 16384
- model_name: "cloud-reasoning"
litellm_params:
model: "claude-sonnet-4-20250514"
api_key: "os.environ/ANTHROPIC_API_KEY"
max_tokens: 8192
litellm_settings:
# Enable request logging
set_verbose: false
request_timeout: 120
# Fallback config
default_fallbacks: ["cloud-fast"]
context_window_fallbacks:
- strong-local: ["cloud-strong"]
- fast-local: ["cloud-fast"]
general_settings:
master_key: "sk-your-proxy-master-key"
database_url: "postgresql://litellm:password@postgres:5432/litellm"
# Budget management
max_budget: 200 # $200/month total cloud spend cap
budget_duration: "30d"
Step 3: Start the Proxy
# Development
litellm --config litellm_config.yaml --port 4000
# Production (Docker)
docker run -d \
--name litellm-proxy \
-p 4000:4000 \
-v ./litellm_config.yaml:/app/config.yaml \
-e OPENAI_API_KEY=${OPENAI_API_KEY} \
-e ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} \
ghcr.io/berriai/litellm:main-latest \
--config /app/config.yaml
Step 4: Test the Router
# Test local routing
curl http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer sk-your-proxy-master-key" \
-H "Content-Type: application/json" \
-d '{
"model": "fast-local",
"messages": [{"role": "user", "content": "Classify this as positive or negative: Great product!"}]
}'
# Test cloud routing
curl http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer sk-your-proxy-master-key" \
-H "Content-Type: application/json" \
-d '{
"model": "cloud-strong",
"messages": [{"role": "user", "content": "Analyze this complex legal document..."}]
}'
Your application just switches the model field. Everything else stays identical.
Configuring Routing Rules {#configuring-routing-rules}
The real power comes from intelligent routing. Here are the patterns that work in production:
Pattern 1: Route by Task Type
Your application tags each request with the intended task:
import openai
client = openai.OpenAI(
base_url="http://localhost:4000/v1",
api_key="sk-your-proxy-master-key"
)
def classify_text(text: str) -> str:
"""Simple task → local model"""
response = client.chat.completions.create(
model="fast-local", # Routes to Ollama llama3.2:8b
messages=[{"role": "user", "content": f"Classify sentiment: {text}"}],
max_tokens=10
)
return response.choices[0].message.content
def analyze_document(doc: str) -> str:
"""Complex task → cloud model"""
response = client.chat.completions.create(
model="cloud-strong", # Routes to GPT-4o
messages=[{"role": "user", "content": f"Provide detailed analysis: {doc}"}],
max_tokens=4096
)
return response.choices[0].message.content
def generate_code(prompt: str) -> str:
"""Code task → specialized local model"""
response = client.chat.completions.create(
model="code-local", # Routes to Qwen 2.5 Coder 32B
messages=[{"role": "user", "content": prompt}],
max_tokens=2048
)
return response.choices[0].message.content
Pattern 2: Route by Complexity Score
Estimate query complexity before routing:
def estimate_complexity(messages: list) -> str:
"""Score query complexity to pick the right model."""
total_tokens = sum(len(m["content"].split()) * 1.3 for m in messages)
has_code = any("code" in m["content"].lower() or
"```" in m["content"] for m in messages)
has_analysis = any(word in m["content"].lower()
for m in messages
for word in ["analyze", "compare", "evaluate", "explain why"])
multi_turn = len(messages) > 4
if total_tokens > 8000 or (has_analysis and has_code and multi_turn):
return "cloud-strong"
elif has_code:
return "code-local"
elif total_tokens > 2000 or has_analysis:
return "strong-local"
else:
return "fast-local"
def smart_query(messages: list) -> str:
model = estimate_complexity(messages)
response = client.chat.completions.create(
model=model,
messages=messages
)
return response.choices[0].message.content
Pattern 3: Route by Token Count
LiteLLM can automatically fall back when a request exceeds a model's context window:
# In litellm_config.yaml
litellm_settings:
context_window_fallbacks:
- fast-local: ["strong-local"] # 8B → 70B for longer context
- strong-local: ["cloud-strong"] # 70B → GPT-4o for very long context
If you send 50K tokens to fast-local (which has an 8K context in practice), LiteLLM automatically re-routes to strong-local. If that also fails, it goes to cloud-strong. Your application code handles zero of this.
Fallback Patterns {#fallback-patterns}
Hardware fails. Networks drop. GPUs overheat. The fallback configuration determines whether your users notice.
Health-Check Fallback
# litellm_config.yaml
router_settings:
routing_strategy: "latency-based-routing"
enable_pre_call_checks: true # Check model health before routing
# Health check settings
model_group_retry_policy:
fast-local:
timeout_seconds: 5 # If Ollama doesn't respond in 5s
num_retries: 1 # Retry once
fallbacks: ["cloud-fast"] # Then fall back to cloud
strong-local:
timeout_seconds: 30
num_retries: 1
fallbacks: ["cloud-strong"]
Cascading Fallback Chain
Request → fast-local (Ollama 8B)
├─ Success → Return response
└─ Fail (timeout/error) → strong-local (Ollama 70B)
├─ Success → Return response
└─ Fail → cloud-fast (GPT-4o-mini)
├─ Success → Return response
└─ Fail → Return error to client
This gives you three layers of redundancy. In six months of production use, I have had zero complete outages. Individual failures happen weekly (GPU thermal throttle, Ollama memory leak, OpenAI rate limit), but the cascade catches everything.
Full Docker Compose Stack {#full-docker-compose-stack}
This is the production-ready stack: Ollama + LiteLLM + Open WebUI + PostgreSQL (for LiteLLM logging).
# docker-compose.yaml
version: "3.8"
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- OLLAMA_NUM_PARALLEL=4
- OLLAMA_MAX_LOADED_MODELS=3
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
litellm:
image: ghcr.io/berriai/litellm:main-latest
container_name: litellm-proxy
restart: unless-stopped
ports:
- "4000:4000"
volumes:
- ./litellm_config.yaml:/app/config.yaml
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- DATABASE_URL=postgresql://litellm:litellm_pass@postgres:5432/litellm
command: --config /app/config.yaml --port 4000
depends_on:
ollama:
condition: service_healthy
postgres:
condition: service_healthy
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
environment:
- OPENAI_API_BASE_URL=http://litellm:4000/v1
- OPENAI_API_KEY=sk-your-proxy-master-key
- WEBUI_AUTH=true
depends_on:
- litellm
postgres:
image: postgres:16-alpine
container_name: litellm-db
restart: unless-stopped
environment:
- POSTGRES_DB=litellm
- POSTGRES_USER=litellm
- POSTGRES_PASSWORD=litellm_pass
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U litellm"]
interval: 10s
timeout: 5s
retries: 5
volumes:
ollama_data:
postgres_data:
Deploy the Stack
# Create .env file
cat > .env << 'ENVEOF'
OPENAI_API_KEY=sk-your-openai-key-here
ANTHROPIC_API_KEY=sk-ant-your-anthropic-key-here
ENVEOF
# Pull and start
docker compose pull
docker compose up -d
# Pull models into Ollama
docker exec ollama ollama pull llama3.2:8b
docker exec ollama ollama pull llama3.3:70b-instruct-q4_K_M
docker exec ollama ollama pull qwen2.5-coder:32b-instruct-q4_K_M
# Verify everything is running
docker compose ps
curl http://localhost:4000/health
Open WebUI at http://localhost:3000 now shows all your models — local and cloud — in a single dropdown. Users pick a model and the routing happens transparently.
For more on the Ollama + Open WebUI portion of this stack, see our Docker setup guide.
Budget Caps and Cost Control {#budget-caps-and-cost-control}
The biggest risk with hybrid is cloud cost surprise. LiteLLM prevents this:
Set Monthly Budget
# litellm_config.yaml
general_settings:
max_budget: 200 # Hard cap: $200/month total cloud spend
budget_duration: "30d" # Reset every 30 days
Set Per-Model Budgets
model_list:
- model_name: "cloud-strong"
litellm_params:
model: "gpt-4o"
api_key: "os.environ/OPENAI_API_KEY"
model_info:
max_budget: 150 # Max $150/month for GPT-4o
budget_duration: "30d"
- model_name: "cloud-reasoning"
litellm_params:
model: "claude-sonnet-4-20250514"
api_key: "os.environ/ANTHROPIC_API_KEY"
model_info:
max_budget: 50 # Max $50/month for Claude
budget_duration: "30d"
Monitor Spend in Real-Time
# Check current spend
curl http://localhost:4000/global/spend/logs \
-H "Authorization: Bearer sk-your-proxy-master-key"
# Get spend by model
curl "http://localhost:4000/global/spend/logs?model=gpt-4o" \
-H "Authorization: Bearer sk-your-proxy-master-key"
When the budget is exhausted, LiteLLM returns a 429 error for cloud models. Your fallback config routes those requests to local models instead — degraded quality, but no surprise bills.
Quality Monitoring {#quality-monitoring}
Running hybrid means you need to know when local quality is not good enough. Here is a lightweight monitoring setup:
A/B Logging
Send 1% of queries to both local and cloud, then compare:
import random
import json
from datetime import datetime
def monitored_query(messages: list, primary_model: str = "strong-local"):
"""Send to primary model. 1% of the time, also send to cloud for comparison."""
primary_response = client.chat.completions.create(
model=primary_model,
messages=messages
)
if random.random() < 0.01: # 1% sample rate
cloud_response = client.chat.completions.create(
model="cloud-strong",
messages=messages
)
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"prompt_preview": messages[-1]["content"][:200],
"local_response": primary_response.choices[0].message.content[:500],
"cloud_response": cloud_response.choices[0].message.content[:500],
"local_tokens": primary_response.usage.total_tokens,
"cloud_tokens": cloud_response.usage.total_tokens,
}
with open("quality_log.jsonl", "a") as f:
f.write(json.dumps(log_entry) + "\n")
return primary_response.choices[0].message.content
Weekly Quality Review
# Count quality log entries
wc -l quality_log.jsonl
# Extract for manual review (sample 10 random pairs)
shuf -n 10 quality_log.jsonl | python3 -m json.tool
Review 10 pairs weekly. If local responses consistently lag on a specific task type, add a routing rule to send that task type to cloud. Over time, this tightens your routing rules to the optimal split.
Security: Sensitive Query Routing {#security-sensitive-query-routing}
For regulated industries, certain data must never leave your network.
import re
SENSITIVE_PATTERNS = [
r'\b\d{3}-\d{2}-\d{4}\b', # SSN
r'\b\d{16}\b', # Credit card
r'\b[A-Z]{2}\d{6,8}\b', # Passport
r'patient|diagnosis|medical record', # Medical terms
r'confidential|proprietary|nda', # Business sensitivity
]
def is_sensitive(text: str) -> bool:
text_lower = text.lower()
return any(re.search(pattern, text_lower) for pattern in SENSITIVE_PATTERNS)
def secure_query(messages: list) -> str:
"""Route sensitive queries to local only. Never send to cloud."""
full_text = " ".join(m["content"] for m in messages)
if is_sensitive(full_text):
# Force local — no fallback to cloud
response = client.chat.completions.create(
model="strong-local",
messages=messages,
extra_body={"metadata": {"no_fallback": True}}
)
else:
response = client.chat.completions.create(
model=estimate_complexity(messages),
messages=messages
)
return response.choices[0].message.content
This is not optional for healthcare, finance, or legal applications. The hybrid architecture makes it trivial: sensitive data hits the GPU under your desk, everything else can use the cloud.
For a deeper dive into data privacy with local AI, see our Ollama production deployment guide.
Performance: Latency Comparison {#performance-latency-comparison}
Measured on our production hybrid stack (RTX 4090, Ollama 0.6.x, LiteLLM 1.57.x):
| Metric | Local (Llama 3.3 70B Q4) | Local (Llama 3.2 8B) | Cloud (GPT-4o) | Cloud (GPT-4o-mini) |
|---|---|---|---|---|
| Time to First Token | 80-150ms | 30-60ms | 500-2,000ms | 300-800ms |
| Tokens/sec | 25-35 | 80-120 | 50-80 | 80-120 |
| 400-token response | 12-16s | 3-5s | 5-8s + network | 3-5s + network |
| P99 TTFT | 200ms | 80ms | 3,000ms | 1,200ms |
| LiteLLM routing overhead | +2ms | +2ms | +2ms | +2ms |
Key takeaways:
- Local 8B model is faster than cloud on everything — first token and total completion. This is the sweet spot for high-volume simple tasks.
- Local 70B is slower on total completion but faster on first token. Good for streaming UIs where perceived speed matters.
- Cloud has unpredictable tail latency. P99 for GPT-4o is 3 seconds. For local models, it is 200ms. If you are building real-time features, local consistency matters.
- LiteLLM adds negligible overhead. 2ms routing delay is invisible.
Scaling the Hybrid Stack {#scaling-the-hybrid-stack}
As traffic grows, scale each layer independently:
| Daily Queries | Local Hardware | Cloud Budget | LiteLLM Config |
|---|---|---|---|
| 1K-5K | Single RTX 3090/4090 | $50/month | Single instance |
| 5K-20K | Dual GPU or RTX 4090 | $100/month | Single instance |
| 20K-50K | 2x RTX 4090 server | $200/month | Single instance + Redis cache |
| 50K-200K | 4x GPU rack server | $500/month | 2 LiteLLM instances + load balancer |
| 200K+ | Multi-node cluster | $1,000/month | Kubernetes deployment |
At every scale, the hybrid approach costs 60-90% less than all-cloud while maintaining the same quality ceiling.
Frequently Asked Questions {#faq}
See the FAQ section below for detailed answers to common hybrid architecture questions.
Putting It Together
The hybrid architecture is not theoretically elegant — it is pragmatically effective. You get local speed and privacy on the 90% of queries that do not need cloud capabilities, and you get GPT-4o/Claude quality on the 10% that do.
The setup takes about an hour with the Docker Compose stack above. The ongoing maintenance is minimal: update Ollama models monthly, review your quality logs weekly, and check your cloud spend dashboard. That is it.
Start by routing everything to local. Then add cloud routing only for the specific cases where local output is demonstrably worse. Most teams discover they need cloud for far fewer queries than they expected.
For the cost math behind this architecture, see our Ollama vs ChatGPT API cost breakdown. To build the underlying Ollama server, follow the production deployment guide.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, MLOps — chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!