Ollama Rate Limiting for Multi-User Setups: Quotas & Fairness
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Ollama Rate Limiting for Multi-User Setups: Quotas & Fairness
Published on April 23, 2026 -- 20 min read
The first user to discover your shared Ollama server runs a 200K-token research script overnight and locks the GPU until morning. The second user runs a tight loop of one-token completions and pegs the queue at 100. The third user opens 40 browser tabs, each polling for autocomplete every 300 ms. None of them are malicious. None of them think they are doing anything wrong. And by 9am Monday, the rest of the team cannot get a single response under 30 seconds.
Rate limiting is what stops that. Done well, it is invisible — every user gets fast responses, the GPU stays under 80 percent utilisation, and the per-user dashboards in Grafana show smooth bands. Done badly, it produces angry support tickets and an Ollama instance that is technically up but practically unusable. This guide is the difference.
Quick Start:
limit_req_zone $http_x_api_key zone=ollama_per_key:10m rate=30r/m;in your Nginx config gives you per-API-key rate limiting at 30 requests per minute. Combine with a daily token budget enforced by a small Redis-backed sidecar and you cover 90 percent of multi-user Ollama needs.
Table of Contents
- Why Ollama Needs External Rate Limiting
- Choosing a Strategy
- Nginx Configuration
- HAProxy Configuration
- Caddy and Traefik
- Token Budgets with Redis
- Fairness Queue Pattern
- Communicating Limits to Users
- Pitfalls and Anti-Patterns
- Frequently Asked Questions
Why Ollama Needs External Rate Limiting {#why-rate-limit}
Ollama by itself has exactly two relevant knobs:
OLLAMA_NUM_PARALLEL— how many requests the server processes concurrently per loaded model. Default 1.OLLAMA_MAX_LOADED_MODELS— how many distinct models can sit in VRAM. Default 1.
Neither addresses the problem. Concurrency 4 means four users get GPU time simultaneously, then the next four wait — and the queue is unbounded and unprioritised. There is no per-user budget, no fair-share scheduling, no way to differentiate a CI integration test from a human chat.
The fix is always to put a layer in front. That layer can be:
- A reverse proxy (Nginx, HAProxy, Caddy, Traefik) that enforces request-rate limits via headers or IP.
- An API gateway (Kong, Tyk, KrakenD, LiteLLM) that adds per-API-key auth, token-bucket limits, and routing.
- A custom queue service that implements fair-share scheduling, priority, and token budgets.
Most production Ollama deployments use a reverse proxy plus a small custom service for token budgets. That combination covers the request-per-minute case (proxy) and the tokens-per-day case (custom).
For the wider production architecture this rate-limiting layer fits into, see the Ollama production deployment guide and the Ollama monitoring guide — quotas without observability are guesswork.
Choosing a Strategy {#strategy}
| Strategy | Enforcement | Granularity | Implementation | Use when |
|---|---|---|---|---|
| Per-IP req/min | Reverse proxy | Coarse | 5 lines Nginx | Single-tenant, prototype |
| Per-API-key req/min | Reverse proxy | Per user | 15 lines Nginx | Small team, named users |
| Per-API-key tokens/day | Sidecar + Redis | Per user, accurate | ~150 LOC Python | Multi-team, cost control |
| Tier-based priority | Custom queue | Per user, fair | ~400 LOC Go | Premium/free tiers |
| Full API gateway | Kong / LiteLLM | Per route, complex | Container | Multi-model, multi-tenant |
Pick the simplest tier that solves your actual problem. A research lab with 8 named users does not need Kong. A startup hosting a SaaS layer over Ollama probably does.
Nginx Configuration {#nginx}
Nginx is the most common reverse proxy in front of Ollama. The relevant directives are limit_req_zone, limit_conn_zone, and limit_req.
Step 1: Per-IP Limit (Baseline)
http {
limit_req_zone $binary_remote_addr zone=ollama_ip:10m rate=30r/m;
limit_req_status 429;
upstream ollama_backend {
server 127.0.0.1:11434;
keepalive 32;
}
server {
listen 443 ssl http2;
server_name ollama.example.com;
# ... ssl_certificate, ssl_certificate_key ...
location / {
limit_req zone=ollama_ip burst=5 nodelay;
proxy_pass http://ollama_backend;
proxy_http_version 1.1;
proxy_buffering off; # critical for streaming
proxy_read_timeout 600s; # large model = long replies
proxy_set_header Connection "";
}
}
}
burst=5 nodelay allows a small burst of 5 requests above the steady rate before the proxy starts returning 429. proxy_buffering off is mandatory — without it, Ollama's streaming responses are buffered and users see no output until the full response is generated.
Step 2: Per-API-Key Limit
http {
# API key required, fall back to IP if missing
map $http_x_api_key $api_key_id {
default $binary_remote_addr;
"" $binary_remote_addr;
~. $http_x_api_key;
}
limit_req_zone $api_key_id zone=ollama_per_key:10m rate=30r/m;
server {
# ... tls, listen ...
# Reject missing API key entirely
if ($http_x_api_key = "") {
return 401 '{"error":"missing X-API-Key header"}';
}
location / {
limit_req zone=ollama_per_key burst=10 nodelay;
proxy_pass http://ollama_backend;
proxy_buffering off;
proxy_read_timeout 600s;
}
}
}
Step 3: Per-API-Key Concurrency
Concurrency limit (limit_conn) is what stops a single user opening 40 simultaneous streaming connections.
http {
limit_conn_zone $api_key_id zone=ollama_conn:10m;
server {
location / {
limit_req zone=ollama_per_key burst=10 nodelay;
limit_conn ollama_conn 4; # max 4 simultaneous streams per key
proxy_pass http://ollama_backend;
}
}
}
Step 4: Validate API Keys
For real auth (not just rate-limit grouping), add an auth_request to a small validation service:
location = /_auth {
internal;
proxy_pass http://127.0.0.1:8080/validate;
proxy_pass_request_body off;
proxy_set_header Content-Length "";
proxy_set_header X-Original-Key $http_x_api_key;
}
location / {
auth_request /_auth;
limit_req zone=ollama_per_key burst=10 nodelay;
limit_conn ollama_conn 4;
proxy_pass http://ollama_backend;
}
The official Nginx documentation at nginx.org limit_req module covers every option these directives accept.
HAProxy Configuration {#haproxy}
HAProxy is the right pick when you need richer ACLs, sticky sessions, or already use HAProxy for other services.
frontend ollama_in
bind *:443 ssl crt /etc/haproxy/cert.pem alpn h2,http/1.1
mode http
# Capture API key for tracking
http-request set-var(txn.api_key) req.hdr(X-API-Key)
# Reject missing key
http-request deny status 401 if !{ var(txn.api_key) -m found }
# Track request rate per API key
stick-table type string len 64 size 100k expire 1m store http_req_rate(60s)
http-request track-sc0 var(txn.api_key)
# Rate limit: 30 requests per minute per key
http-request deny status 429 if { sc_http_req_rate(0) gt 30 }
default_backend ollama_be
backend ollama_be
mode http
timeout server 600s
option http-buffer-request
server ollama1 127.0.0.1:11434 check
HAProxy's stick-table approach is faster than Nginx limit_req for very high request rates because it works in the data plane instead of locking shared memory.
Caddy and Traefik {#caddy-traefik}
Both modern proxies have plugin-based rate limiting.
Caddy
{
order rate_limit before reverse_proxy
}
ollama.example.com {
rate_limit {
zone per_key {
key {http.request.header.X-API-Key}
events 30
window 1m
}
}
reverse_proxy 127.0.0.1:11434 {
flush_interval -1
transport http {
response_header_timeout 600s
}
}
}
Caddy's rate_limit is a third-party module — install with xcaddy build --with github.com/mholt/caddy-ratelimit.
Traefik
http:
middlewares:
ollama-ratelimit:
rateLimit:
average: 30
period: 1m
burst: 10
sourceCriterion:
requestHeaderName: X-API-Key
routers:
ollama:
rule: Host(`ollama.example.com`)
middlewares:
- ollama-ratelimit
service: ollama
tls:
certResolver: letsencrypt
services:
ollama:
loadBalancer:
servers:
- url: http://127.0.0.1:11434
Traefik shines in container deployments where the proxy config lives next to the docker-compose. For bare-metal Ollama, Nginx is usually less work.
Token Budgets with Redis {#token-budgets}
Request count limits do not cover the user who sends one prompt with 100K input tokens. For that, you need a token budget. The pattern: a small sidecar reads the eval_count and prompt_eval_count fields from each Ollama response, debits the API key's daily budget in Redis, and rejects future requests when the budget is depleted.
Sidecar (FastAPI + Redis)
# ollama-quota.py
from datetime import datetime
import os
import httpx
import redis
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import StreamingResponse, Response
OLLAMA = os.environ.get("OLLAMA_URL", "http://127.0.0.1:11434")
DAILY_TOKEN_LIMIT = int(os.environ.get("DAILY_TOKEN_LIMIT", "250000"))
r = redis.from_url(os.environ.get("REDIS_URL", "redis://localhost:6379/0"))
app = FastAPI()
def quota_key(api_key: str) -> str:
today = datetime.utcnow().strftime("%Y%m%d")
return f"ollama:quota:{api_key}:{today}"
@app.middleware("http")
async def enforce_quota(request: Request, call_next):
api_key = request.headers.get("x-api-key")
if not api_key:
raise HTTPException(401, "missing X-API-Key")
used = int(r.get(quota_key(api_key)) or 0)
if used >= DAILY_TOKEN_LIMIT:
return Response(
status_code=429,
headers={
"Retry-After": "3600",
"X-RateLimit-Limit": str(DAILY_TOKEN_LIMIT),
"X-RateLimit-Used": str(used),
},
content='{"error":"daily token budget exceeded"}',
)
return await call_next(request)
@app.post("/api/chat")
@app.post("/api/generate")
async def proxy(request: Request):
api_key = request.headers["x-api-key"]
body = await request.body()
async def upstream():
nonlocal_total = {"in": 0, "out": 0}
async with httpx.AsyncClient(timeout=600) as cli:
async with cli.stream(
"POST", f"{OLLAMA}{request.url.path}", content=body
) as resp:
async for chunk in resp.aiter_raw():
yield chunk
# Final chunk usually contains eval_count and prompt_eval_count
# In practice, parse the streamed JSON to extract counts
return StreamingResponse(upstream(), media_type="application/x-ndjson")
In production, the streaming response handler accumulates the JSON chunks, parses the final object's eval_count and prompt_eval_count, and runs:
total = response_json["prompt_eval_count"] + response_json["eval_count"]
r.incrby(quota_key(api_key), total)
r.expire(quota_key(api_key), 86400 * 2) # 48h safety window
Wiring Into Nginx
location /api/ {
proxy_pass http://127.0.0.1:8085; # quota sidecar
proxy_buffering off;
proxy_read_timeout 600s;
}
Nginx still does request/min limiting; the sidecar adds tokens/day on top.
For a complete reference of the patterns and primitives that make this kind of accounting work cleanly, the official Redis pattern guide covers rate limiting and counters.
Fairness Queue Pattern {#fairness}
For deployments with mixed priority — say, free-tier users vs paid — pure rate limits are not enough. You want round-robin fairness so a paid user with budget always beats a free user, but two paid users at the same time alternate. That is a queue, not a rate limit.
# Simplified fairness queue (sketch)
import asyncio
from collections import deque, defaultdict
class FairQueue:
def __init__(self):
self.queues = defaultdict(deque)
self.users = deque()
self.cv = asyncio.Condition()
async def submit(self, user_id, request):
async with self.cv:
if user_id not in self.users:
self.users.append(user_id)
future = asyncio.get_event_loop().create_future()
self.queues[user_id].append((request, future))
self.cv.notify()
return await future
async def worker(self, ollama_call):
while True:
async with self.cv:
while not any(self.queues.values()):
await self.cv.wait()
# Round-robin: rotate users until we find one with work
while not self.queues[self.users[0]]:
self.users.rotate(-1)
user_id = self.users[0]
request, future = self.queues[user_id].popleft()
self.users.rotate(-1)
try:
result = await ollama_call(request)
future.set_result(result)
except Exception as e:
future.set_exception(e)
A single worker means strict serialisation. Multiple workers means parallel inference up to OLLAMA_NUM_PARALLEL. The queue depth and per-user wait-time become the primary capacity-planning metrics — both worth exporting to Prometheus.
Communicating Limits to Users {#communicating}
Rate-limited users will assume the system is broken unless you tell them otherwise. Three things make 429 responses humane.
1. Use the standard headers.
HTTP/1.1 429 Too Many Requests
Retry-After: 18
X-RateLimit-Limit: 30
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714080540
Content-Type: application/json
{"error":"rate_limit_exceeded","message":"30 requests/minute exceeded. Retry in 18s."}
The OpenAI SDK already understands these headers — clients written against OpenAI will retry automatically.
2. Surface budgets in your UI.
A "Tokens used today: 187K of 250K" widget reduces support tickets dramatically. The numbers come from the same Redis counters as the enforcement, so adding the UI is roughly 30 lines of code.
3. Fail fast at the proxy, not deep in the pipeline.
A 429 in the first millisecond is far better UX than a request that waits 8 seconds in queue, then fails. Always enforce limits at the edge layer (Nginx, HAProxy) before any expensive work.
Pitfalls and Anti-Patterns {#pitfalls}
Pitfall 1: IP-Based Limits Behind a NAT
Symptom: Whole office gets rate-limited together.
Cause: $binary_remote_addr is the office NAT IP. All users share it.
Fix: Switch to API-key-based limiting and require the header. If a user does not have an API key, reject with 401.
Pitfall 2: proxy_buffering on
Symptom: Streaming responses arrive all at once after a long wait.
Cause: Nginx buffers the entire response before forwarding.
Fix: proxy_buffering off in the location block. This is non-negotiable for Ollama.
Pitfall 3: Limit Zone Too Small
Symptom: 503 errors under load with "limit_req: cannot allocate memory" in the Nginx log.
Cause: zone=ollama:1m only tracks a few thousand keys.
Fix: zone=ollama:10m (tracks ~160K keys per MB) is fine for most teams. zone=ollama:100m for very large deployments.
Pitfall 4: Token Budget Reset Drift
Symptom: Daily token budget resets at random times instead of UTC midnight.
Cause: Using SET ... EX 86400 instead of an explicit UTC date in the key.
Fix: Build the key from datetime.utcnow().strftime("%Y%m%d") so the budget is naturally per-UTC-day.
Pitfall 5: Concurrent Stream Limit Set Too Low
Symptom: Power users complain that opening two browser tabs breaks one.
Cause: limit_conn 1 per key.
Fix: 4-8 is a reasonable concurrent-stream budget per user. The limit is only there to stop runaway clients with hundreds of open connections.
For the related multi-GPU scaling considerations once a single Ollama instance is fully loaded, see the multi-GPU Ollama guide.
Final Notes
Rate limiting is the part of an Ollama deployment that everyone knows they need but nobody plans for until something breaks. The tools are simple — Nginx, a few directives, a small Redis sidecar — but the design questions are not. How granular should limits be? Should you charge by request or by token? What happens when the GPU is overloaded but the limit is not hit? Who decides the priority order?
Start with the smallest thing that solves an actual problem. Per-IP for a prototype. Per-API-key for a real team. Token budgets when usage starts costing real money in electricity. Priority queues only when you have free and paid tiers to differentiate. Layer them, monitor them in Grafana, and revisit the thresholds every couple of months as traffic patterns evolve.
The goal is not perfect fairness — it is a deployment where every user gets the same fast experience they would have on a private Ollama instance, even though the GPU is shared. Get the limits right, and the rest of your team forgets the server exists. Get them wrong, and you spend Monday morning explaining to leadership why the AI tool everyone loved last week stopped working.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!