Free course — 2 free chapters of every course. No credit card.Start learning free
Production Operations

Ollama Rate Limiting for Multi-User Setups: Quotas & Fairness

April 23, 2026
20 min read
LocalAimaster Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

Ollama Rate Limiting for Multi-User Setups: Quotas & Fairness

Published on April 23, 2026 -- 20 min read

The first user to discover your shared Ollama server runs a 200K-token research script overnight and locks the GPU until morning. The second user runs a tight loop of one-token completions and pegs the queue at 100. The third user opens 40 browser tabs, each polling for autocomplete every 300 ms. None of them are malicious. None of them think they are doing anything wrong. And by 9am Monday, the rest of the team cannot get a single response under 30 seconds.

Rate limiting is what stops that. Done well, it is invisible — every user gets fast responses, the GPU stays under 80 percent utilisation, and the per-user dashboards in Grafana show smooth bands. Done badly, it produces angry support tickets and an Ollama instance that is technically up but practically unusable. This guide is the difference.

Quick Start: limit_req_zone $http_x_api_key zone=ollama_per_key:10m rate=30r/m; in your Nginx config gives you per-API-key rate limiting at 30 requests per minute. Combine with a daily token budget enforced by a small Redis-backed sidecar and you cover 90 percent of multi-user Ollama needs.


Table of Contents

  1. Why Ollama Needs External Rate Limiting
  2. Choosing a Strategy
  3. Nginx Configuration
  4. HAProxy Configuration
  5. Caddy and Traefik
  6. Token Budgets with Redis
  7. Fairness Queue Pattern
  8. Communicating Limits to Users
  9. Pitfalls and Anti-Patterns
  10. Frequently Asked Questions

Why Ollama Needs External Rate Limiting {#why-rate-limit}

Ollama by itself has exactly two relevant knobs:

  • OLLAMA_NUM_PARALLEL — how many requests the server processes concurrently per loaded model. Default 1.
  • OLLAMA_MAX_LOADED_MODELS — how many distinct models can sit in VRAM. Default 1.

Neither addresses the problem. Concurrency 4 means four users get GPU time simultaneously, then the next four wait — and the queue is unbounded and unprioritised. There is no per-user budget, no fair-share scheduling, no way to differentiate a CI integration test from a human chat.

The fix is always to put a layer in front. That layer can be:

  1. A reverse proxy (Nginx, HAProxy, Caddy, Traefik) that enforces request-rate limits via headers or IP.
  2. An API gateway (Kong, Tyk, KrakenD, LiteLLM) that adds per-API-key auth, token-bucket limits, and routing.
  3. A custom queue service that implements fair-share scheduling, priority, and token budgets.

Most production Ollama deployments use a reverse proxy plus a small custom service for token budgets. That combination covers the request-per-minute case (proxy) and the tokens-per-day case (custom).

For the wider production architecture this rate-limiting layer fits into, see the Ollama production deployment guide and the Ollama monitoring guide — quotas without observability are guesswork.


Choosing a Strategy {#strategy}

StrategyEnforcementGranularityImplementationUse when
Per-IP req/minReverse proxyCoarse5 lines NginxSingle-tenant, prototype
Per-API-key req/minReverse proxyPer user15 lines NginxSmall team, named users
Per-API-key tokens/daySidecar + RedisPer user, accurate~150 LOC PythonMulti-team, cost control
Tier-based priorityCustom queuePer user, fair~400 LOC GoPremium/free tiers
Full API gatewayKong / LiteLLMPer route, complexContainerMulti-model, multi-tenant

Pick the simplest tier that solves your actual problem. A research lab with 8 named users does not need Kong. A startup hosting a SaaS layer over Ollama probably does.


Nginx Configuration {#nginx}

Nginx is the most common reverse proxy in front of Ollama. The relevant directives are limit_req_zone, limit_conn_zone, and limit_req.

Step 1: Per-IP Limit (Baseline)

http {
    limit_req_zone $binary_remote_addr zone=ollama_ip:10m rate=30r/m;
    limit_req_status 429;

    upstream ollama_backend {
        server 127.0.0.1:11434;
        keepalive 32;
    }

    server {
        listen 443 ssl http2;
        server_name ollama.example.com;

        # ... ssl_certificate, ssl_certificate_key ...

        location / {
            limit_req zone=ollama_ip burst=5 nodelay;

            proxy_pass http://ollama_backend;
            proxy_http_version 1.1;
            proxy_buffering off;          # critical for streaming
            proxy_read_timeout 600s;       # large model = long replies
            proxy_set_header Connection "";
        }
    }
}

burst=5 nodelay allows a small burst of 5 requests above the steady rate before the proxy starts returning 429. proxy_buffering off is mandatory — without it, Ollama's streaming responses are buffered and users see no output until the full response is generated.

Step 2: Per-API-Key Limit

http {
    # API key required, fall back to IP if missing
    map $http_x_api_key $api_key_id {
        default $binary_remote_addr;
        "" $binary_remote_addr;
        ~. $http_x_api_key;
    }

    limit_req_zone $api_key_id zone=ollama_per_key:10m rate=30r/m;

    server {
        # ... tls, listen ...

        # Reject missing API key entirely
        if ($http_x_api_key = "") {
            return 401 '{"error":"missing X-API-Key header"}';
        }

        location / {
            limit_req zone=ollama_per_key burst=10 nodelay;
            proxy_pass http://ollama_backend;
            proxy_buffering off;
            proxy_read_timeout 600s;
        }
    }
}

Step 3: Per-API-Key Concurrency

Concurrency limit (limit_conn) is what stops a single user opening 40 simultaneous streaming connections.

http {
    limit_conn_zone $api_key_id zone=ollama_conn:10m;

    server {
        location / {
            limit_req zone=ollama_per_key burst=10 nodelay;
            limit_conn ollama_conn 4;     # max 4 simultaneous streams per key
            proxy_pass http://ollama_backend;
        }
    }
}

Step 4: Validate API Keys

For real auth (not just rate-limit grouping), add an auth_request to a small validation service:

location = /_auth {
    internal;
    proxy_pass http://127.0.0.1:8080/validate;
    proxy_pass_request_body off;
    proxy_set_header Content-Length "";
    proxy_set_header X-Original-Key $http_x_api_key;
}

location / {
    auth_request /_auth;
    limit_req zone=ollama_per_key burst=10 nodelay;
    limit_conn ollama_conn 4;
    proxy_pass http://ollama_backend;
}

The official Nginx documentation at nginx.org limit_req module covers every option these directives accept.


HAProxy Configuration {#haproxy}

HAProxy is the right pick when you need richer ACLs, sticky sessions, or already use HAProxy for other services.

frontend ollama_in
    bind *:443 ssl crt /etc/haproxy/cert.pem alpn h2,http/1.1
    mode http

    # Capture API key for tracking
    http-request set-var(txn.api_key) req.hdr(X-API-Key)

    # Reject missing key
    http-request deny status 401 if !{ var(txn.api_key) -m found }

    # Track request rate per API key
    stick-table type string len 64 size 100k expire 1m store http_req_rate(60s)
    http-request track-sc0 var(txn.api_key)

    # Rate limit: 30 requests per minute per key
    http-request deny status 429 if { sc_http_req_rate(0) gt 30 }

    default_backend ollama_be

backend ollama_be
    mode http
    timeout server 600s
    option http-buffer-request
    server ollama1 127.0.0.1:11434 check

HAProxy's stick-table approach is faster than Nginx limit_req for very high request rates because it works in the data plane instead of locking shared memory.


Caddy and Traefik {#caddy-traefik}

Both modern proxies have plugin-based rate limiting.

Caddy

{
    order rate_limit before reverse_proxy
}

ollama.example.com {
    rate_limit {
        zone per_key {
            key {http.request.header.X-API-Key}
            events 30
            window 1m
        }
    }

    reverse_proxy 127.0.0.1:11434 {
        flush_interval -1
        transport http {
            response_header_timeout 600s
        }
    }
}

Caddy's rate_limit is a third-party module — install with xcaddy build --with github.com/mholt/caddy-ratelimit.

Traefik

http:
  middlewares:
    ollama-ratelimit:
      rateLimit:
        average: 30
        period: 1m
        burst: 10
        sourceCriterion:
          requestHeaderName: X-API-Key

  routers:
    ollama:
      rule: Host(`ollama.example.com`)
      middlewares:
        - ollama-ratelimit
      service: ollama
      tls:
        certResolver: letsencrypt

  services:
    ollama:
      loadBalancer:
        servers:
          - url: http://127.0.0.1:11434

Traefik shines in container deployments where the proxy config lives next to the docker-compose. For bare-metal Ollama, Nginx is usually less work.


Token Budgets with Redis {#token-budgets}

Request count limits do not cover the user who sends one prompt with 100K input tokens. For that, you need a token budget. The pattern: a small sidecar reads the eval_count and prompt_eval_count fields from each Ollama response, debits the API key's daily budget in Redis, and rejects future requests when the budget is depleted.

Sidecar (FastAPI + Redis)

# ollama-quota.py
from datetime import datetime
import os
import httpx
import redis
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import StreamingResponse, Response

OLLAMA = os.environ.get("OLLAMA_URL", "http://127.0.0.1:11434")
DAILY_TOKEN_LIMIT = int(os.environ.get("DAILY_TOKEN_LIMIT", "250000"))
r = redis.from_url(os.environ.get("REDIS_URL", "redis://localhost:6379/0"))
app = FastAPI()

def quota_key(api_key: str) -> str:
    today = datetime.utcnow().strftime("%Y%m%d")
    return f"ollama:quota:{api_key}:{today}"

@app.middleware("http")
async def enforce_quota(request: Request, call_next):
    api_key = request.headers.get("x-api-key")
    if not api_key:
        raise HTTPException(401, "missing X-API-Key")

    used = int(r.get(quota_key(api_key)) or 0)
    if used >= DAILY_TOKEN_LIMIT:
        return Response(
            status_code=429,
            headers={
                "Retry-After": "3600",
                "X-RateLimit-Limit": str(DAILY_TOKEN_LIMIT),
                "X-RateLimit-Used": str(used),
            },
            content='{"error":"daily token budget exceeded"}',
        )
    return await call_next(request)

@app.post("/api/chat")
@app.post("/api/generate")
async def proxy(request: Request):
    api_key = request.headers["x-api-key"]
    body = await request.body()

    async def upstream():
        nonlocal_total = {"in": 0, "out": 0}
        async with httpx.AsyncClient(timeout=600) as cli:
            async with cli.stream(
                "POST", f"{OLLAMA}{request.url.path}", content=body
            ) as resp:
                async for chunk in resp.aiter_raw():
                    yield chunk
                # Final chunk usually contains eval_count and prompt_eval_count
        # In practice, parse the streamed JSON to extract counts

    return StreamingResponse(upstream(), media_type="application/x-ndjson")

In production, the streaming response handler accumulates the JSON chunks, parses the final object's eval_count and prompt_eval_count, and runs:

total = response_json["prompt_eval_count"] + response_json["eval_count"]
r.incrby(quota_key(api_key), total)
r.expire(quota_key(api_key), 86400 * 2)  # 48h safety window

Wiring Into Nginx

location /api/ {
    proxy_pass http://127.0.0.1:8085;   # quota sidecar
    proxy_buffering off;
    proxy_read_timeout 600s;
}

Nginx still does request/min limiting; the sidecar adds tokens/day on top.

For a complete reference of the patterns and primitives that make this kind of accounting work cleanly, the official Redis pattern guide covers rate limiting and counters.


Fairness Queue Pattern {#fairness}

For deployments with mixed priority — say, free-tier users vs paid — pure rate limits are not enough. You want round-robin fairness so a paid user with budget always beats a free user, but two paid users at the same time alternate. That is a queue, not a rate limit.

# Simplified fairness queue (sketch)
import asyncio
from collections import deque, defaultdict

class FairQueue:
    def __init__(self):
        self.queues = defaultdict(deque)
        self.users = deque()
        self.cv = asyncio.Condition()

    async def submit(self, user_id, request):
        async with self.cv:
            if user_id not in self.users:
                self.users.append(user_id)
            future = asyncio.get_event_loop().create_future()
            self.queues[user_id].append((request, future))
            self.cv.notify()
        return await future

    async def worker(self, ollama_call):
        while True:
            async with self.cv:
                while not any(self.queues.values()):
                    await self.cv.wait()
                # Round-robin: rotate users until we find one with work
                while not self.queues[self.users[0]]:
                    self.users.rotate(-1)
                user_id = self.users[0]
                request, future = self.queues[user_id].popleft()
                self.users.rotate(-1)
            try:
                result = await ollama_call(request)
                future.set_result(result)
            except Exception as e:
                future.set_exception(e)

A single worker means strict serialisation. Multiple workers means parallel inference up to OLLAMA_NUM_PARALLEL. The queue depth and per-user wait-time become the primary capacity-planning metrics — both worth exporting to Prometheus.


Communicating Limits to Users {#communicating}

Rate-limited users will assume the system is broken unless you tell them otherwise. Three things make 429 responses humane.

1. Use the standard headers.

HTTP/1.1 429 Too Many Requests
Retry-After: 18
X-RateLimit-Limit: 30
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714080540
Content-Type: application/json

{"error":"rate_limit_exceeded","message":"30 requests/minute exceeded. Retry in 18s."}

The OpenAI SDK already understands these headers — clients written against OpenAI will retry automatically.

2. Surface budgets in your UI.

A "Tokens used today: 187K of 250K" widget reduces support tickets dramatically. The numbers come from the same Redis counters as the enforcement, so adding the UI is roughly 30 lines of code.

3. Fail fast at the proxy, not deep in the pipeline.

A 429 in the first millisecond is far better UX than a request that waits 8 seconds in queue, then fails. Always enforce limits at the edge layer (Nginx, HAProxy) before any expensive work.


Pitfalls and Anti-Patterns {#pitfalls}

Pitfall 1: IP-Based Limits Behind a NAT

Symptom: Whole office gets rate-limited together.

Cause: $binary_remote_addr is the office NAT IP. All users share it.

Fix: Switch to API-key-based limiting and require the header. If a user does not have an API key, reject with 401.

Pitfall 2: proxy_buffering on

Symptom: Streaming responses arrive all at once after a long wait.

Cause: Nginx buffers the entire response before forwarding.

Fix: proxy_buffering off in the location block. This is non-negotiable for Ollama.

Pitfall 3: Limit Zone Too Small

Symptom: 503 errors under load with "limit_req: cannot allocate memory" in the Nginx log.

Cause: zone=ollama:1m only tracks a few thousand keys.

Fix: zone=ollama:10m (tracks ~160K keys per MB) is fine for most teams. zone=ollama:100m for very large deployments.

Pitfall 4: Token Budget Reset Drift

Symptom: Daily token budget resets at random times instead of UTC midnight.

Cause: Using SET ... EX 86400 instead of an explicit UTC date in the key.

Fix: Build the key from datetime.utcnow().strftime("%Y%m%d") so the budget is naturally per-UTC-day.

Pitfall 5: Concurrent Stream Limit Set Too Low

Symptom: Power users complain that opening two browser tabs breaks one.

Cause: limit_conn 1 per key.

Fix: 4-8 is a reasonable concurrent-stream budget per user. The limit is only there to stop runaway clients with hundreds of open connections.

For the related multi-GPU scaling considerations once a single Ollama instance is fully loaded, see the multi-GPU Ollama guide.


Final Notes

Rate limiting is the part of an Ollama deployment that everyone knows they need but nobody plans for until something breaks. The tools are simple — Nginx, a few directives, a small Redis sidecar — but the design questions are not. How granular should limits be? Should you charge by request or by token? What happens when the GPU is overloaded but the limit is not hit? Who decides the priority order?

Start with the smallest thing that solves an actual problem. Per-IP for a prototype. Per-API-key for a real team. Token budgets when usage starts costing real money in electricity. Priority queues only when you have free and paid tiers to differentiate. Layer them, monitor them in Grafana, and revisit the thresholds every couple of months as traffic patterns evolve.

The goal is not perfect fairness — it is a deployment where every user gets the same fast experience they would have on a private Ollama instance, even though the GPU is shared. Get the limits right, and the rest of your team forgets the server exists. Get them wrong, and you spend Monday morning explaining to leadership why the AI tool everyone loved last week stopped working.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: April 23, 2026🔄 Last Updated: April 23, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

Was this helpful?

Run Ollama Like a Real Service

Get one production-ops guide per week — quotas, monitoring, security, scaling — for teams running Ollama at real scale.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Related Guides

Continue your local AI journey with these comprehensive guides

Continue Learning

📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators