Does Ollama support load balancing natively?

Not directly — Ollama is a single-node inference server. To scale horizontally, you put a reverse proxy (Nginx, HAProxy, Caddy) or a smart router (LiteLLM, Envoy) in front of multiple Ollama instances. Each instance keeps its own model in VRAM, so the load balancer is the piece that distributes traffic across them.

Should I use round-robin or least-connections for Ollama?

Always least-connections for Ollama. Round-robin assumes requests are equal cost, but LLM generations vary 10-100x in duration depending on prompt and output length. Least-connections sends new requests to the backend with the fewest in-flight requests, which keeps GPU utilization balanced. Nginx calls this least_conn, HAProxy calls it leastconn.

Do I need sticky sessions for Ollama load balancing?

Only for streaming responses, and only if a single client makes follow-up requests that depend on KV cache reuse. Standard /api/generate and /api/chat requests are stateless — any backend can handle them. If you are running an interactive chat where the same conversation hits the same backend to keep cache warm, enable IP-hash or cookie-based stickiness. Most teams do not need it.

How do I handle model availability differences across backends?

Three approaches. (1) Identical model set on every backend — simplest, scales linearly, wastes VRAM if some models are rarely used. (2) Model-aware routing with LiteLLM — route llama3.1 requests to backends that have it loaded. (3) Two-tier: hot models on every backend, cold/specialty models on dedicated backends behind a path-prefix rule. Tier 1 is fine until you have 6+ models or limited VRAM.

What is the right health check for Ollama backends?

Use GET / on port 11434 — Ollama returns "Ollama is running" with HTTP 200 when the server is alive. Avoid /api/tags as a health check on busy nodes (it can time out under load). For a deeper check, periodically call /api/generate with a 1-token prompt and a short keep_alive — this verifies the model is actually loadable, not just that the HTTP listener is up.

Can I use a cloud load balancer (ALB, GCP LB) for Ollama?

Yes, but watch the timeout settings. Default ALB idle timeout is 60 seconds — that kills any generation longer than 60s mid-token. Bump idle timeout to 600+ seconds. Also enable HTTP/1.1 keep-alive and disable connection draining grace periods that drop in-flight streams. NGINX or HAProxy in front of cloud LBs is often easier to tune than the cloud LBs themselves.

How do I prevent thundering-herd cold starts on a new backend?

Pre-warm. Before adding a backend to the load balancer pool, send it 1-3 requests with each model you serve and a long keep_alive. Set OLLAMA_KEEP_ALIVE=24h on each backend so models do not unload. Use the load balancer max_fails / fail_timeout settings to mark a backend down if it returns 5xx during pre-warm rather than failing real user requests.

What throughput should I expect from a 4-backend Ollama cluster?

On 4× RTX 4090 with llama3.1:8b and OLLAMA_NUM_PARALLEL=4 per backend, we measured 1,150 tokens/sec aggregate at p95 latency under 1.2 seconds for 16 concurrent users. That maps to about 5,000-6,000 short prompts per hour. Switching to llama3.1:70b drops aggregate to ~280 tok/sec because the 70B model uses most of the VRAM, leaving less room for parallelism.

Load Balancing Ollama with Nginx and HAProxy

Published April 23, 2026 • 19 min read

A single Ollama server saturates fast. One RTX 4090 with llama3.1:8b handles maybe 4-8 concurrent users before queue depth spikes. The fix is not "buy a bigger GPU" — it is horizontal scaling with a load balancer in front of multiple Ollama instances. This guide is the complete production playbook: working Nginx and HAProxy configs, when to use which, sticky session strategies for streaming, model-aware routing with LiteLLM, and benchmarks from a real 4-GPU rig under load.

I have been running this exact setup for an internal team of 22 people for 11 months. The configs below are what is in production today, on Ubuntu 24.04 + Ollama 0.5.7 + Nginx 1.26 + HAProxy 3.0.

Quick Start: 2-Backend Nginx Setup in 5 Minutes

Run two Ollama instances on different ports:

OLLAMA_HOST=0.0.0.0:11434 OLLAMA_KEEP_ALIVE=24h ollama serve &
OLLAMA_HOST=0.0.0.0:11435 OLLAMA_KEEP_ALIVE=24h ollama serve &

Drop this Nginx config at /etc/nginx/sites-available/ollama:

upstream ollama_backends {
    least_conn;
    server 127.0.0.1:11434 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:11435 max_fails=3 fail_timeout=30s;
}

server {
    listen 8080;
    location / {
        proxy_pass http://ollama_backends;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_buffering off;
        proxy_read_timeout 600s;
        proxy_send_timeout 600s;
    }
}

Reload Nginx, point clients at port 8080, and you are load balanced. Five minutes, two backends, real production-shaped config. The rest of this guide is the engineering you do after the first incident.

Why Ollama Needs a Load Balancer
Nginx vs HAProxy vs LiteLLM
Production Nginx Configuration
Production HAProxy Configuration
Streaming and Sticky Sessions
Model-Aware Routing with LiteLLM
Health Checks Done Right
TLS, Auth, and Rate Limiting
Real Benchmarks (4× RTX 4090)
Common Pitfalls
FAQs

Why Ollama Needs a Load Balancer {#why}

A single Ollama instance is fundamentally bound by:

VRAM: one model can run, maybe two if quantized.
GPU compute: 4-8 concurrent generations max before TTFB degrades.
Process model: single Go server, one node-local PVC of weights.

You hit those limits faster than you think. With OLLAMA_NUM_PARALLEL=4 on an RTX 4090, the fifth concurrent request queues. On a Mac M2 with the same setting, even that is generous. A 22-person team easily generates 8-16 concurrent inflight requests at peak.

Horizontal scaling — multiple Ollama instances behind a load balancer — sidesteps every one of these:

Add a backend, double your capacity. No GPU swap.
Mix model loadouts: some backends with 8B chat, others with 70B for special requests.
Survive backend death. One node reboots, traffic routes around it.
Cap request explosion: rate limit at the LB, not in every app.

Compared to scaling vertically (bigger GPU), a 4× RTX 4090 cluster ($6.4k of GPUs) outperforms a single H100 ($30k+) for parallel small-model serving by a factor of 2-3x. For mixed workloads with bursty concurrency, horizontal wins.

For the deeper deployment context, our Ollama production deployment guide covers the single-backend hardening, and Ollama on Kubernetes is the right step up once you outgrow systemd-managed VMs.

Nginx vs HAProxy vs LiteLLM {#comparison}

Three good options, different strengths:

Feature	Nginx	HAProxy	LiteLLM Router
Setup complexity	Low	Medium	Low (Python)
Streaming support	Yes (with config)	Yes (native)	Yes
Least-connections	Yes	Yes	Yes
Model-aware routing	Hard	Hard	Native
Health checks	Passive + active	Active by default	Active
TLS termination	Yes	Yes	Use proxy in front
Rate limiting	limit_req module	stick-tables	Token-based per key
Observability	access_log + 3rd party	Native stats page	Built-in metrics
Resource usage	~20 MB RAM	~30 MB RAM	~150 MB RAM
Best for	Generic LB + proxy	High concurrency, low latency	Fleet of mixed local + cloud LLMs

My picks:

Single team, 2-4 Ollama backends, simple routing → Nginx
High concurrency, stricter latency SLOs, advanced health checks → HAProxy
Multiple models with different VRAM footprints, or hybrid local+cloud → LiteLLM

You can also stack them: LiteLLM for model-aware routing, Nginx in front for TLS, auth, and rate limiting. That is what we run.

Production Nginx Configuration {#nginx}

The full /etc/nginx/sites-available/ollama we run in production:

upstream ollama_backends {
    least_conn;
    keepalive 32;

    server 10.0.1.10:11434 max_fails=3 fail_timeout=30s;
    server 10.0.1.11:11434 max_fails=3 fail_timeout=30s;
    server 10.0.1.12:11434 max_fails=3 fail_timeout=30s;
    server 10.0.1.13:11434 max_fails=3 fail_timeout=30s;
}

# Rate limit zone: 60 req/min per key
limit_req_zone $http_authorization zone=ollama_per_key:10m rate=60r/m;

server {
    listen 443 ssl http2;
    server_name ollama.internal.example.com;

    ssl_certificate     /etc/letsencrypt/live/ollama.internal.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ollama.internal.example.com/privkey.pem;
    ssl_protocols       TLSv1.3 TLSv1.2;
    ssl_ciphers         HIGH:!aNULL:!MD5;

    # API key gate
    if ($http_authorization !~ "^Bearer (sk-team-[a-z0-9]+)$") {
        return 401;
    }

    # Rate limit (60/min per API key)
    limit_req zone=ollama_per_key burst=20 nodelay;
    limit_req_status 429;

    # Streaming requirements
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;

    proxy_buffering off;          # critical for token streaming
    proxy_cache off;
    proxy_read_timeout 600s;      # long generations
    proxy_send_timeout 600s;
    proxy_request_buffering off;  # don't buffer uploads (file inference)

    # Health check endpoint (no auth required)
    location = /health {
        proxy_pass http://ollama_backends/;
        access_log off;
    }

    # Main proxy
    location / {
        proxy_pass http://ollama_backends;
    }

    # Generous body size for image inputs (vision models)
    client_max_body_size 50M;
}

server {
    listen 80;
    server_name ollama.internal.example.com;
    return 301 https://$host$request_uri;
}

What every directive does

least_conn — sends new request to backend with fewest in-flight. Critical for LLM workloads with variable response time.
keepalive 32 — reuse upstream connections, saves TCP handshake on every request. Big win for streaming.
max_fails=3 fail_timeout=30s — after 3 failures, mark backend down for 30s. Tune higher if you have flaky backends, lower for stricter health.
proxy_buffering off — without this, Nginx buffers the entire response before sending to client. Streaming dies.
proxy_http_version 1.1 + Connection "" — required for HTTP keepalive to upstream.
proxy_read_timeout 600s — the generous one. Long 70B generations can take 5+ minutes.
limit_req — per-API-key rate limit. Burst 20 means you can occasionally spike, sustained 60/min.
client_max_body_size 50M — vision models accept image uploads. Default 1M will reject them.

Reload with nginx -t && systemctl reload nginx. Verify least_conn is working by tailing access logs and watching upstream_addr distribution under load.

Production HAProxy Configuration {#haproxy}

HAProxy shines when you need stricter latency SLOs and more sophisticated health checks. The full /etc/haproxy/haproxy.cfg:

global
    log /dev/log local0
    log /dev/log local1 notice
    maxconn 4096
    user haproxy
    group haproxy
    daemon

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    option  http-server-close
    option  forwardfor except 127.0.0.0/8
    option  redispatch
    retries 3
    timeout connect  5s
    timeout client   600s    # long for streaming
    timeout server   600s
    timeout http-keep-alive 30s
    timeout queue    60s

frontend ollama_frontend
    bind *:443 ssl crt /etc/haproxy/ollama.pem
    bind *:80
    redirect scheme https if !{ ssl_fc }

    # ACL: API key required
    acl has_valid_key hdr_reg(authorization) -i ^Bearer\ sk-team-[a-z0-9]+$
    http-request deny if !has_valid_key

    # Rate limit: 60 requests per minute per API key
    stick-table type string len 64 size 1m expire 1m store http_req_rate(60s)
    http-request track-sc0 hdr(authorization)
    http-request deny deny_status 429 if { sc_http_req_rate(0) gt 60 }

    default_backend ollama_backend

backend ollama_backend
    balance leastconn
    option httpchk GET /
    http-check expect status 200
    default-server inter 5s fall 3 rise 2 maxconn 100

    server ollama1 10.0.1.10:11434 check
    server ollama2 10.0.1.11:11434 check
    server ollama3 10.0.1.12:11434 check
    server ollama4 10.0.1.13:11434 check

frontend stats
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 5s
    stats admin if { src 10.0.0.0/8 }

Why HAProxy can be the right choice

Active health checks by default — every 5 seconds (inter 5s), 3 failures to mark down (fall 3), 2 successes to bring back up (rise 2). No silent degradation.
Built-in stick-tables for rate limiting without an external store like Redis.
Native stats page at :8404/stats — see per-backend RPS, connection counts, response times in real time.
Better connection draining during reloads — -sf flag to old PID does graceful handoff, in-flight requests complete on the old process.
option redispatch — if a backend fails mid-request and we have retries left, redispatch to a different backend. Saves a lot of 5xx during rolling restarts.

The trade-off is config syntax that is more compact but less Googleable than Nginx. Once you grok it, HAProxy is rock-solid.

Streaming and Sticky Sessions {#streaming}

Ollama's /api/generate and /api/chat support streaming via NDJSON when stream: true. For load balancing, this matters in two ways:

1. Buffering must be off

Already covered in the Nginx config. In HAProxy, option http-server-close plus default buffering settings handle this fine — HAProxy does not buffer responses by default.

2. Sticky sessions are usually unnecessary

LLM generation is stateless on the Ollama side. Each request loads the model context from scratch (the messages array), generates, and returns. No backend-side conversation state.

You only need sticky sessions if:

You are using Ollama's KV cache reuse for follow-up turns (rare, requires explicit prompt prefix matching).
You are running per-backend speculative decoding caches that benefit from request locality.
You have prompt caching at the proxy layer that is per-backend.

For 95% of teams: do not enable stickiness. Round-robin or least-connections is more efficient.

If you do need it, in Nginx:

upstream ollama_backends {
    ip_hash;     # client IP based stickiness
    server 10.0.1.10:11434;
    server 10.0.1.11:11434;
}

In HAProxy:

backend ollama_backend
    balance source
    hash-type consistent
    server ollama1 10.0.1.10:11434 check

Note: sticky sessions can pin one user to a slow backend if they happen to land there during a model swap. Use only when you understand the trade-off.

Model-Aware Routing with LiteLLM {#litellm}

When backends do not all have the same models loaded, naive load balancing fails. Request for llama3.1:70b lands on a backend that only has llama3.1:8b — you get a slow on-demand pull, or worse, an error.

LiteLLM Router solves this. It knows which backends serve which models and routes accordingly:

pip install litellm[proxy]

config.yaml:

model_list:
  # Backends 1-3 serve the 8B chat model
  - model_name: chat
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://10.0.1.10:11434
  - model_name: chat
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://10.0.1.11:11434
  - model_name: chat
    litellm_params:
      model: ollama/llama3.1:8b
      api_base: http://10.0.1.12:11434

  # Backend 4 has the 70B model
  - model_name: chat-large
    litellm_params:
      model: ollama/llama3.1:70b
      api_base: http://10.0.1.13:11434

  # Coding model on backends 1-2 only
  - model_name: code
    litellm_params:
      model: ollama/qwen2.5-coder:14b
      api_base: http://10.0.1.10:11434
  - model_name: code
    litellm_params:
      model: ollama/qwen2.5-coder:14b
      api_base: http://10.0.1.11:11434

router_settings:
  routing_strategy: least-busy
  num_retries: 2
  timeout: 600
  cooldown_time: 30
  fallbacks:
    - chat-large: ["chat"]   # if 70B is down, fall back to 8B

litellm_settings:
  set_verbose: false
  drop_params: true

general_settings:
  master_key: sk-master-key-rotate-me
  database_url: "postgresql://litellm:pass@localhost/litellm"

Run it:

litellm --config config.yaml --port 8000

Clients use the OpenAI SDK pointed at http://lb-host:8000/v1 with model name chat, chat-large, or code. LiteLLM handles routing, retries, fallback, and per-key rate limiting. Bonus: same router can mix Ollama, OpenAI, Anthropic, and Bedrock backends — handy for hybrid local + cloud setups.

For the OpenAI-compatible API path see our private OpenAI-compatible API guide which covers the SDK-side setup.

Health Checks Done Right {#health-checks}

Three layers of health, increasing depth:

1. TCP check (always on)

Both Nginx and HAProxy mark a backend down if the TCP connect fails. Good baseline, terrible signal — Ollama can answer TCP while being completely deadlocked on a stuck inference.

2. HTTP check (recommended)

GET / returns "Ollama is running" with HTTP 200 when the server is alive and the listener thread is responsive.

In Nginx:

location = /healthz {
    proxy_pass http://127.0.0.1:11434/;
    access_log off;
}

In HAProxy:

option httpchk GET /
http-check expect status 200

3. Inference smoke test (paranoid mode)

Periodic external probe that actually calls /api/generate with a tiny prompt:

#!/bin/bash
# /usr/local/bin/ollama-deep-check.sh
RESPONSE=$(curl -s -m 10 -X POST http://localhost:11434/api/generate \
  -d '{"model":"llama3.1:8b","prompt":"ping","stream":false,"options":{"num_predict":1}}')
if [[ "$RESPONSE" == *"response"* ]]; then
    exit 0
fi
exit 1

Run it from the LB host every 60 seconds, mark backend down on 3 consecutive failures. We do this in addition to the HTTP check — caught a deadlocked Ollama instance twice that the basic check missed.

TLS, Auth, and Rate Limiting {#security}

TLS termination

Always terminate TLS at the LB, not at Ollama (Ollama has no TLS support). Use Let's Encrypt via certbot or cert-manager:

sudo certbot certonly --nginx -d ollama.internal.example.com
sudo systemctl enable certbot.timer  # auto-renewal

API key auth

Three patterns from simple to robust:

Inline allowlist (small teams):

if ($http_authorization !~ "^Bearer (sk-team-eng|sk-team-product)$") {
    return 401;
}

Map-based (cleaner for many keys):

map $http_authorization $valid_key {
    default 0;
    "Bearer sk-team-eng-2026"     1;
    "Bearer sk-team-product-2026" 1;
}
server {
    if ($valid_key = 0) { return 401; }
}

External auth service (real production): OAuth2-proxy or Pomerium ext-auth filter. Validates JWTs, supports key rotation without Nginx reload, can integrate with your IdP.

Rate limiting

Already shown above. Key choices:

Per-API-key — fair across users, depends on key being present.
Per-IP — fallback for unauthenticated traffic (probably should not exist for Ollama).
Per-route — protect expensive endpoints like /api/generate separately from cheap ones like /api/tags.

For deeper rate limiting (token-based, per-model quotas), LiteLLM has it built in. NGINX/HAProxy do request-rate only.

Real Benchmarks (4× RTX 4090) {#benchmarks}

Test rig:

4 nodes, each Ryzen 9 7950X + RTX 4090 24GB + 64GB DDR5 + NVMe
Nginx 1.26 LB on a separate Ryzen 7 box
Ollama 0.5.7, llama3.1:8b Q4_K_M, OLLAMA_NUM_PARALLEL=4 each
Workload: 16 concurrent users, mixed prompt lengths (50-2000 tokens in)

Metric	1 backend	2 backends	4 backends
Aggregate throughput (tok/s)	280	560	1,150
TTFB p50	320 ms	280 ms	240 ms
TTFB p95	1,400 ms	920 ms	1,180 ms
Request queue depth (max)	12	6	2
Errors (5xx + timeouts)	8/1000	1/1000	0/1000
GPU utilization avg	91%	78%	64%

Scaling is roughly linear up to 4 backends for this workload. Beyond that you start seeing the LB itself become a bottleneck around 2,500 RPS — that is when you split into multiple LB instances behind DNS round-robin.

For the 70B model:

Backends	Throughput	TTFB p50	Notes
1 (×4090)	does not fit	—	Need 48GB+
1 (×A6000 48GB)	38 tok/s	1,800 ms	70B Q4 fits
4 (×A6000 48GB)	145 tok/s	1,400 ms	4 separate model copies

For 70B+ scale, vLLM or sglang with tensor parallelism beats Ollama horizontal scaling. Ollama wins for "many small models, many users."

Common Pitfalls {#pitfalls}

1. Round-robin instead of least-connections. LLM request times vary 50x. Round-robin overloads slow backends. Always least_conn / leastconn.

2. proxy_buffering on (default in Nginx). Streaming dies. Set off.

3. 60-second default timeouts. Long generations fail mid-response. Bump to 600s.

4. Forgetting OLLAMA_HOST=0.0.0.0. Ollama binds to localhost by default. The LB cannot reach it from another host.

5. No keep_alive on backends. Models unload, every burst pays cold-start cost. OLLAMA_KEEP_ALIVE=24h everywhere.

6. TCP-only health checks. Ollama can hang while accepting TCP connections. Use HTTP checks at minimum.

7. Same model name, different versions across backends. Pin tags (llama3.1:8b-instruct-q4_K_M not llama3.1) and verify with ollama list on every backend.

8. Missing client_max_body_size. Vision and file inputs get rejected with 413. Default 1MB is way too small.

9. Reloading Nginx during peak. systemctl reload nginx is graceful but restart is not. Always reload, never restart, during business hours.

10. No observability. Without per-backend metrics (RPS, p95, error rate), you cannot tell which backend is misbehaving. Add Prometheus + Grafana — the Ollama production deployment guide has dashboards.

For the broader infra picture, see our load balancing reference on the Nginx docs for additional algorithms and HAProxy load balancing strategies for the deeper algorithm comparison.

Conclusion

Load balancing Ollama is the difference between a hobby project and a real shared service. Two backends behind Nginx with least-connections gets you 90% of the way. Add HAProxy or LiteLLM when you need stricter SLOs or model-aware routing. Layer in TLS, API keys, rate limits, and proper health checks and you have an internal AI gateway your team can actually depend on.

The thing nobody tells you up front: most of the work is not the LB config. It is the operational hygiene around it — pre-warming new backends, handling rolling restarts without dropping streams, monitoring per-backend latency so you catch a degrading GPU before users do. The configs above are battle-tested. The rest is your team's discipline.

When this stops being enough — usually around 6+ backends and 50+ users — Kubernetes is the natural next step. The Ollama on Kubernetes guide picks up exactly where this one ends, with StatefulSets, KEDA autoscaling, and ingress that handles all of the above more declaratively.

Want the next infrastructure deep dive (multi-region failover, GPU-aware Envoy filters, request-level cost attribution)? Subscribe to the Local AI Master newsletter — production playbooks, weekly.

Load Balancing Ollama with Nginx and HAProxy

Want to go deeper than this article?

Load Balancing Ollama with Nginx and HAProxy

Quick Start: 2-Backend Nginx Setup in 5 Minutes

Table of Contents

Why Ollama Needs a Load Balancer {#why}

Nginx vs HAProxy vs LiteLLM {#comparison}

Production Nginx Configuration {#nginx}

What every directive does

Production HAProxy Configuration {#haproxy}

Why HAProxy can be the right choice

Streaming and Sticky Sessions {#streaming}

1. Buffering must be off

2. Sticky sessions are usually unnecessary

Model-Aware Routing with LiteLLM {#litellm}

Health Checks Done Right {#health-checks}

1. TCP check (always on)

2. HTTP check (recommended)

3. Inference smoke test (paranoid mode)

TLS, Auth, and Rate Limiting {#security}

TLS termination

API key auth

Rate limiting

Real Benchmarks (4× RTX 4090) {#benchmarks}

Common Pitfalls {#pitfalls}

Conclusion

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

🎓 Continue Learning

Related Guides

Production Local AI, Weekly

Build Real AI on Your Machine

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI