Load Balancing Ollama with Nginx and HAProxy
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Load Balancing Ollama with Nginx and HAProxy
Published April 23, 2026 • 19 min read
A single Ollama server saturates fast. One RTX 4090 with llama3.1:8b handles maybe 4-8 concurrent users before queue depth spikes. The fix is not "buy a bigger GPU" — it is horizontal scaling with a load balancer in front of multiple Ollama instances. This guide is the complete production playbook: working Nginx and HAProxy configs, when to use which, sticky session strategies for streaming, model-aware routing with LiteLLM, and benchmarks from a real 4-GPU rig under load.
I have been running this exact setup for an internal team of 22 people for 11 months. The configs below are what is in production today, on Ubuntu 24.04 + Ollama 0.5.7 + Nginx 1.26 + HAProxy 3.0.
Quick Start: 2-Backend Nginx Setup in 5 Minutes
Run two Ollama instances on different ports:
OLLAMA_HOST=0.0.0.0:11434 OLLAMA_KEEP_ALIVE=24h ollama serve &
OLLAMA_HOST=0.0.0.0:11435 OLLAMA_KEEP_ALIVE=24h ollama serve &
Drop this Nginx config at /etc/nginx/sites-available/ollama:
upstream ollama_backends {
least_conn;
server 127.0.0.1:11434 max_fails=3 fail_timeout=30s;
server 127.0.0.1:11435 max_fails=3 fail_timeout=30s;
}
server {
listen 8080;
location / {
proxy_pass http://ollama_backends;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_buffering off;
proxy_read_timeout 600s;
proxy_send_timeout 600s;
}
}
Reload Nginx, point clients at port 8080, and you are load balanced. Five minutes, two backends, real production-shaped config. The rest of this guide is the engineering you do after the first incident.
Table of Contents
- Why Ollama Needs a Load Balancer
- Nginx vs HAProxy vs LiteLLM
- Production Nginx Configuration
- Production HAProxy Configuration
- Streaming and Sticky Sessions
- Model-Aware Routing with LiteLLM
- Health Checks Done Right
- TLS, Auth, and Rate Limiting
- Real Benchmarks (4× RTX 4090)
- Common Pitfalls
- FAQs
Why Ollama Needs a Load Balancer {#why}
A single Ollama instance is fundamentally bound by:
- VRAM: one model can run, maybe two if quantized.
- GPU compute: 4-8 concurrent generations max before TTFB degrades.
- Process model: single Go server, one node-local PVC of weights.
You hit those limits faster than you think. With OLLAMA_NUM_PARALLEL=4 on an RTX 4090, the fifth concurrent request queues. On a Mac M2 with the same setting, even that is generous. A 22-person team easily generates 8-16 concurrent inflight requests at peak.
Horizontal scaling — multiple Ollama instances behind a load balancer — sidesteps every one of these:
- Add a backend, double your capacity. No GPU swap.
- Mix model loadouts: some backends with 8B chat, others with 70B for special requests.
- Survive backend death. One node reboots, traffic routes around it.
- Cap request explosion: rate limit at the LB, not in every app.
Compared to scaling vertically (bigger GPU), a 4× RTX 4090 cluster ($6.4k of GPUs) outperforms a single H100 ($30k+) for parallel small-model serving by a factor of 2-3x. For mixed workloads with bursty concurrency, horizontal wins.
For the deeper deployment context, our Ollama production deployment guide covers the single-backend hardening, and Ollama on Kubernetes is the right step up once you outgrow systemd-managed VMs.
Nginx vs HAProxy vs LiteLLM {#comparison}
Three good options, different strengths:
| Feature | Nginx | HAProxy | LiteLLM Router |
|---|---|---|---|
| Setup complexity | Low | Medium | Low (Python) |
| Streaming support | Yes (with config) | Yes (native) | Yes |
| Least-connections | Yes | Yes | Yes |
| Model-aware routing | Hard | Hard | Native |
| Health checks | Passive + active | Active by default | Active |
| TLS termination | Yes | Yes | Use proxy in front |
| Rate limiting | limit_req module | stick-tables | Token-based per key |
| Observability | access_log + 3rd party | Native stats page | Built-in metrics |
| Resource usage | ~20 MB RAM | ~30 MB RAM | ~150 MB RAM |
| Best for | Generic LB + proxy | High concurrency, low latency | Fleet of mixed local + cloud LLMs |
My picks:
- Single team, 2-4 Ollama backends, simple routing → Nginx
- High concurrency, stricter latency SLOs, advanced health checks → HAProxy
- Multiple models with different VRAM footprints, or hybrid local+cloud → LiteLLM
You can also stack them: LiteLLM for model-aware routing, Nginx in front for TLS, auth, and rate limiting. That is what we run.
Production Nginx Configuration {#nginx}
The full /etc/nginx/sites-available/ollama we run in production:
upstream ollama_backends {
least_conn;
keepalive 32;
server 10.0.1.10:11434 max_fails=3 fail_timeout=30s;
server 10.0.1.11:11434 max_fails=3 fail_timeout=30s;
server 10.0.1.12:11434 max_fails=3 fail_timeout=30s;
server 10.0.1.13:11434 max_fails=3 fail_timeout=30s;
}
# Rate limit zone: 60 req/min per key
limit_req_zone $http_authorization zone=ollama_per_key:10m rate=60r/m;
server {
listen 443 ssl http2;
server_name ollama.internal.example.com;
ssl_certificate /etc/letsencrypt/live/ollama.internal.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/ollama.internal.example.com/privkey.pem;
ssl_protocols TLSv1.3 TLSv1.2;
ssl_ciphers HIGH:!aNULL:!MD5;
# API key gate
if ($http_authorization !~ "^Bearer (sk-team-[a-z0-9]+)$") {
return 401;
}
# Rate limit (60/min per API key)
limit_req zone=ollama_per_key burst=20 nodelay;
limit_req_status 429;
# Streaming requirements
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_buffering off; # critical for token streaming
proxy_cache off;
proxy_read_timeout 600s; # long generations
proxy_send_timeout 600s;
proxy_request_buffering off; # don't buffer uploads (file inference)
# Health check endpoint (no auth required)
location = /health {
proxy_pass http://ollama_backends/;
access_log off;
}
# Main proxy
location / {
proxy_pass http://ollama_backends;
}
# Generous body size for image inputs (vision models)
client_max_body_size 50M;
}
server {
listen 80;
server_name ollama.internal.example.com;
return 301 https://$host$request_uri;
}
What every directive does
least_conn— sends new request to backend with fewest in-flight. Critical for LLM workloads with variable response time.keepalive 32— reuse upstream connections, saves TCP handshake on every request. Big win for streaming.max_fails=3 fail_timeout=30s— after 3 failures, mark backend down for 30s. Tune higher if you have flaky backends, lower for stricter health.proxy_buffering off— without this, Nginx buffers the entire response before sending to client. Streaming dies.proxy_http_version 1.1+Connection ""— required for HTTP keepalive to upstream.proxy_read_timeout 600s— the generous one. Long 70B generations can take 5+ minutes.limit_req— per-API-key rate limit. Burst 20 means you can occasionally spike, sustained 60/min.client_max_body_size 50M— vision models accept image uploads. Default 1M will reject them.
Reload with nginx -t && systemctl reload nginx. Verify least_conn is working by tailing access logs and watching upstream_addr distribution under load.
Production HAProxy Configuration {#haproxy}
HAProxy shines when you need stricter latency SLOs and more sophisticated health checks. The full /etc/haproxy/haproxy.cfg:
global
log /dev/log local0
log /dev/log local1 notice
maxconn 4096
user haproxy
group haproxy
daemon
defaults
log global
mode http
option httplog
option dontlognull
option http-server-close
option forwardfor except 127.0.0.0/8
option redispatch
retries 3
timeout connect 5s
timeout client 600s # long for streaming
timeout server 600s
timeout http-keep-alive 30s
timeout queue 60s
frontend ollama_frontend
bind *:443 ssl crt /etc/haproxy/ollama.pem
bind *:80
redirect scheme https if !{ ssl_fc }
# ACL: API key required
acl has_valid_key hdr_reg(authorization) -i ^Bearer\ sk-team-[a-z0-9]+$
http-request deny if !has_valid_key
# Rate limit: 60 requests per minute per API key
stick-table type string len 64 size 1m expire 1m store http_req_rate(60s)
http-request track-sc0 hdr(authorization)
http-request deny deny_status 429 if { sc_http_req_rate(0) gt 60 }
default_backend ollama_backend
backend ollama_backend
balance leastconn
option httpchk GET /
http-check expect status 200
default-server inter 5s fall 3 rise 2 maxconn 100
server ollama1 10.0.1.10:11434 check
server ollama2 10.0.1.11:11434 check
server ollama3 10.0.1.12:11434 check
server ollama4 10.0.1.13:11434 check
frontend stats
bind *:8404
stats enable
stats uri /stats
stats refresh 5s
stats admin if { src 10.0.0.0/8 }
Why HAProxy can be the right choice
- Active health checks by default — every 5 seconds (
inter 5s), 3 failures to mark down (fall 3), 2 successes to bring back up (rise 2). No silent degradation. - Built-in stick-tables for rate limiting without an external store like Redis.
- Native stats page at
:8404/stats— see per-backend RPS, connection counts, response times in real time. - Better connection draining during reloads —
-sfflag to old PID does graceful handoff, in-flight requests complete on the old process. option redispatch— if a backend fails mid-request and we have retries left, redispatch to a different backend. Saves a lot of 5xx during rolling restarts.
The trade-off is config syntax that is more compact but less Googleable than Nginx. Once you grok it, HAProxy is rock-solid.
Streaming and Sticky Sessions {#streaming}
Ollama's /api/generate and /api/chat support streaming via NDJSON when stream: true. For load balancing, this matters in two ways:
1. Buffering must be off
Already covered in the Nginx config. In HAProxy, option http-server-close plus default buffering settings handle this fine — HAProxy does not buffer responses by default.
2. Sticky sessions are usually unnecessary
LLM generation is stateless on the Ollama side. Each request loads the model context from scratch (the messages array), generates, and returns. No backend-side conversation state.
You only need sticky sessions if:
- You are using Ollama's KV cache reuse for follow-up turns (rare, requires explicit prompt prefix matching).
- You are running per-backend speculative decoding caches that benefit from request locality.
- You have prompt caching at the proxy layer that is per-backend.
For 95% of teams: do not enable stickiness. Round-robin or least-connections is more efficient.
If you do need it, in Nginx:
upstream ollama_backends {
ip_hash; # client IP based stickiness
server 10.0.1.10:11434;
server 10.0.1.11:11434;
}
In HAProxy:
backend ollama_backend
balance source
hash-type consistent
server ollama1 10.0.1.10:11434 check
Note: sticky sessions can pin one user to a slow backend if they happen to land there during a model swap. Use only when you understand the trade-off.
Model-Aware Routing with LiteLLM {#litellm}
When backends do not all have the same models loaded, naive load balancing fails. Request for llama3.1:70b lands on a backend that only has llama3.1:8b — you get a slow on-demand pull, or worse, an error.
LiteLLM Router solves this. It knows which backends serve which models and routes accordingly:
pip install litellm[proxy]
config.yaml:
model_list:
# Backends 1-3 serve the 8B chat model
- model_name: chat
litellm_params:
model: ollama/llama3.1:8b
api_base: http://10.0.1.10:11434
- model_name: chat
litellm_params:
model: ollama/llama3.1:8b
api_base: http://10.0.1.11:11434
- model_name: chat
litellm_params:
model: ollama/llama3.1:8b
api_base: http://10.0.1.12:11434
# Backend 4 has the 70B model
- model_name: chat-large
litellm_params:
model: ollama/llama3.1:70b
api_base: http://10.0.1.13:11434
# Coding model on backends 1-2 only
- model_name: code
litellm_params:
model: ollama/qwen2.5-coder:14b
api_base: http://10.0.1.10:11434
- model_name: code
litellm_params:
model: ollama/qwen2.5-coder:14b
api_base: http://10.0.1.11:11434
router_settings:
routing_strategy: least-busy
num_retries: 2
timeout: 600
cooldown_time: 30
fallbacks:
- chat-large: ["chat"] # if 70B is down, fall back to 8B
litellm_settings:
set_verbose: false
drop_params: true
general_settings:
master_key: sk-master-key-rotate-me
database_url: "postgresql://litellm:pass@localhost/litellm"
Run it:
litellm --config config.yaml --port 8000
Clients use the OpenAI SDK pointed at http://lb-host:8000/v1 with model name chat, chat-large, or code. LiteLLM handles routing, retries, fallback, and per-key rate limiting. Bonus: same router can mix Ollama, OpenAI, Anthropic, and Bedrock backends — handy for hybrid local + cloud setups.
For the OpenAI-compatible API path see our private OpenAI-compatible API guide which covers the SDK-side setup.
Health Checks Done Right {#health-checks}
Three layers of health, increasing depth:
1. TCP check (always on)
Both Nginx and HAProxy mark a backend down if the TCP connect fails. Good baseline, terrible signal — Ollama can answer TCP while being completely deadlocked on a stuck inference.
2. HTTP check (recommended)
GET / returns "Ollama is running" with HTTP 200 when the server is alive and the listener thread is responsive.
In Nginx:
location = /healthz {
proxy_pass http://127.0.0.1:11434/;
access_log off;
}
In HAProxy:
option httpchk GET /
http-check expect status 200
3. Inference smoke test (paranoid mode)
Periodic external probe that actually calls /api/generate with a tiny prompt:
#!/bin/bash
# /usr/local/bin/ollama-deep-check.sh
RESPONSE=$(curl -s -m 10 -X POST http://localhost:11434/api/generate \
-d '{"model":"llama3.1:8b","prompt":"ping","stream":false,"options":{"num_predict":1}}')
if [[ "$RESPONSE" == *"response"* ]]; then
exit 0
fi
exit 1
Run it from the LB host every 60 seconds, mark backend down on 3 consecutive failures. We do this in addition to the HTTP check — caught a deadlocked Ollama instance twice that the basic check missed.
TLS, Auth, and Rate Limiting {#security}
TLS termination
Always terminate TLS at the LB, not at Ollama (Ollama has no TLS support). Use Let's Encrypt via certbot or cert-manager:
sudo certbot certonly --nginx -d ollama.internal.example.com
sudo systemctl enable certbot.timer # auto-renewal
API key auth
Three patterns from simple to robust:
Inline allowlist (small teams):
if ($http_authorization !~ "^Bearer (sk-team-eng|sk-team-product)$") {
return 401;
}
Map-based (cleaner for many keys):
map $http_authorization $valid_key {
default 0;
"Bearer sk-team-eng-2026" 1;
"Bearer sk-team-product-2026" 1;
}
server {
if ($valid_key = 0) { return 401; }
}
External auth service (real production): OAuth2-proxy or Pomerium ext-auth filter. Validates JWTs, supports key rotation without Nginx reload, can integrate with your IdP.
Rate limiting
Already shown above. Key choices:
- Per-API-key — fair across users, depends on key being present.
- Per-IP — fallback for unauthenticated traffic (probably should not exist for Ollama).
- Per-route — protect expensive endpoints like
/api/generateseparately from cheap ones like/api/tags.
For deeper rate limiting (token-based, per-model quotas), LiteLLM has it built in. NGINX/HAProxy do request-rate only.
Real Benchmarks (4× RTX 4090) {#benchmarks}
Test rig:
- 4 nodes, each Ryzen 9 7950X + RTX 4090 24GB + 64GB DDR5 + NVMe
- Nginx 1.26 LB on a separate Ryzen 7 box
- Ollama 0.5.7, llama3.1:8b Q4_K_M, OLLAMA_NUM_PARALLEL=4 each
- Workload: 16 concurrent users, mixed prompt lengths (50-2000 tokens in)
| Metric | 1 backend | 2 backends | 4 backends |
|---|---|---|---|
| Aggregate throughput (tok/s) | 280 | 560 | 1,150 |
| TTFB p50 | 320 ms | 280 ms | 240 ms |
| TTFB p95 | 1,400 ms | 920 ms | 1,180 ms |
| Request queue depth (max) | 12 | 6 | 2 |
| Errors (5xx + timeouts) | 8/1000 | 1/1000 | 0/1000 |
| GPU utilization avg | 91% | 78% | 64% |
Scaling is roughly linear up to 4 backends for this workload. Beyond that you start seeing the LB itself become a bottleneck around 2,500 RPS — that is when you split into multiple LB instances behind DNS round-robin.
For the 70B model:
| Backends | Throughput | TTFB p50 | Notes |
|---|---|---|---|
| 1 (×4090) | does not fit | — | Need 48GB+ |
| 1 (×A6000 48GB) | 38 tok/s | 1,800 ms | 70B Q4 fits |
| 4 (×A6000 48GB) | 145 tok/s | 1,400 ms | 4 separate model copies |
For 70B+ scale, vLLM or sglang with tensor parallelism beats Ollama horizontal scaling. Ollama wins for "many small models, many users."
Common Pitfalls {#pitfalls}
1. Round-robin instead of least-connections. LLM request times vary 50x. Round-robin overloads slow backends. Always least_conn / leastconn.
2. proxy_buffering on (default in Nginx). Streaming dies. Set off.
3. 60-second default timeouts. Long generations fail mid-response. Bump to 600s.
4. Forgetting OLLAMA_HOST=0.0.0.0. Ollama binds to localhost by default. The LB cannot reach it from another host.
5. No keep_alive on backends. Models unload, every burst pays cold-start cost. OLLAMA_KEEP_ALIVE=24h everywhere.
6. TCP-only health checks. Ollama can hang while accepting TCP connections. Use HTTP checks at minimum.
7. Same model name, different versions across backends. Pin tags (llama3.1:8b-instruct-q4_K_M not llama3.1) and verify with ollama list on every backend.
8. Missing client_max_body_size. Vision and file inputs get rejected with 413. Default 1MB is way too small.
9. Reloading Nginx during peak. systemctl reload nginx is graceful but restart is not. Always reload, never restart, during business hours.
10. No observability. Without per-backend metrics (RPS, p95, error rate), you cannot tell which backend is misbehaving. Add Prometheus + Grafana — the Ollama production deployment guide has dashboards.
For the broader infra picture, see our load balancing reference on the Nginx docs for additional algorithms and HAProxy load balancing strategies for the deeper algorithm comparison.
Conclusion
Load balancing Ollama is the difference between a hobby project and a real shared service. Two backends behind Nginx with least-connections gets you 90% of the way. Add HAProxy or LiteLLM when you need stricter SLOs or model-aware routing. Layer in TLS, API keys, rate limits, and proper health checks and you have an internal AI gateway your team can actually depend on.
The thing nobody tells you up front: most of the work is not the LB config. It is the operational hygiene around it — pre-warming new backends, handling rolling restarts without dropping streams, monitoring per-backend latency so you catch a degrading GPU before users do. The configs above are battle-tested. The rest is your team's discipline.
When this stops being enough — usually around 6+ backends and 50+ users — Kubernetes is the natural next step. The Ollama on Kubernetes guide picks up exactly where this one ends, with StatefulSets, KEDA autoscaling, and ingress that handles all of the above more declaratively.
Want the next infrastructure deep dive (multi-region failover, GPU-aware Envoy filters, request-level cost attribution)? Subscribe to the Local AI Master newsletter — production playbooks, weekly.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!