Does Ollama have built-in rate limiting?

No. Ollama treats every request as equally entitled to GPU time. There are no per-user quotas, no token budgets, no fairness queue. The single relevant control is OLLAMA_NUM_PARALLEL (default 1) which sets how many requests the server processes concurrently — but that is concurrency, not rate limiting. For multi-user deployments you put a reverse proxy (Nginx, HAProxy, Caddy, Traefik) or an API gateway (Kong, Tyk, LiteLLM) in front, and that layer enforces the quotas.

What rate-limit values should I start with?

For a 10-person team on a single-GPU Ollama server with Llama 3.3 70B Q4: 30 requests per minute per user, burst of 5, and a daily token budget of 200K input + 50K output tokens per user. Those values keep the GPU busy without letting any single user starve the others. Tune up if utilisation is consistently below 50 percent; tune down if you see queue depths over 20 requests during peak.

Should I rate limit by IP, API key, or user ID?

API key for any deployment with named users. IP-based limiting breaks the moment two users sit behind the same office NAT or VPN. User ID requires that the proxy can authenticate the request — usually only practical if you sit behind Open WebUI or a custom auth layer. The right pattern for most teams is API key issued per user (or per service account), enforced at the proxy, with the API key carrying user identity in the limit name.

How do I rate limit by token count instead of request count?

Token-based quotas need to know the size of input and output before you can decide. Two patterns work. (1) Pre-counting: tiktoken or llama tokenizer in the proxy estimates input tokens, rejects if the user is already over budget; output is debited after the response based on eval_count. (2) Post-counting: every request is allowed; usage is logged after; the next request is rejected if the rolling window already exceeded the budget. Pattern 2 is simpler and works well for most teams. Pattern 1 is correct but adds latency.

Can I prioritise some users over others?

Yes, with multiple queues. Run two Nginx limit_req_zone blocks — one for premium API keys with higher rate, one for standard. Or run a small priority queue in front of Ollama: a Python or Go service that holds incoming requests, sorts by user tier, and dispatches to Ollama one at a time. The queue approach scales further and lets you implement fairness rules (round-robin across users instead of strict priority).

How do I tell users they are being rate limited?

Return HTTP 429 with a Retry-After header (seconds until reset) and a JSON body containing the limit, the window, and the current usage. The OpenAI API uses this exact pattern, which means clients written against the OpenAI SDK will already handle it gracefully. If you serve a UI, surface the 429 as a clear "you have hit your daily token limit, resets at 00:00 UTC" rather than a generic error.

What happens if my GPU is overloaded but the rate limit is not hit?

Requests queue inside Ollama and latency rises. The rate limit is independent of GPU load — it is an admission policy, not a back-pressure mechanism. To handle GPU overload, set OLLAMA_NUM_PARALLEL to a value that matches your GPU memory budget, then set the proxy concurrency limit so requests beyond that capacity are queued at the proxy (where you can drop them with a 503) rather than buffering inside Ollama.

Can I use Cloudflare or another CDN for rate limiting in front of Ollama?

Only if Ollama is internet-exposed, which is generally a bad idea. The standard pattern is to keep Ollama on a private network and put a self-hosted reverse proxy in front. If you need Cloudflare features (DDoS protection, WAF), put Cloudflare in front of an internet-facing API gateway, then have the gateway forward to a private Ollama. The double-layer adds latency but is worth it for any internet-exposed deployment.

Ollama Rate Limiting for Multi-User Setups: Quotas & Fairness

Published on April 23, 2026 -- 20 min read

The first user to discover your shared Ollama server runs a 200K-token research script overnight and locks the GPU until morning. The second user runs a tight loop of one-token completions and pegs the queue at 100. The third user opens 40 browser tabs, each polling for autocomplete every 300 ms. None of them are malicious. None of them think they are doing anything wrong. And by 9am Monday, the rest of the team cannot get a single response under 30 seconds.

Rate limiting is what stops that. Done well, it is invisible — every user gets fast responses, the GPU stays under 80 percent utilisation, and the per-user dashboards in Grafana show smooth bands. Done badly, it produces angry support tickets and an Ollama instance that is technically up but practically unusable. This guide is the difference.

Quick Start: limit_req_zone $http_x_api_key zone=ollama_per_key:10m rate=30r/m; in your Nginx config gives you per-API-key rate limiting at 30 requests per minute. Combine with a daily token budget enforced by a small Redis-backed sidecar and you cover 90 percent of multi-user Ollama needs.

Why Ollama Needs External Rate Limiting
Choosing a Strategy
Nginx Configuration
HAProxy Configuration
Caddy and Traefik
Token Budgets with Redis
Fairness Queue Pattern
Communicating Limits to Users
Pitfalls and Anti-Patterns
Frequently Asked Questions

Why Ollama Needs External Rate Limiting {#why-rate-limit}

Ollama by itself has exactly two relevant knobs:

OLLAMA_NUM_PARALLEL — how many requests the server processes concurrently per loaded model. Default 1.
OLLAMA_MAX_LOADED_MODELS — how many distinct models can sit in VRAM. Default 1.

Neither addresses the problem. Concurrency 4 means four users get GPU time simultaneously, then the next four wait — and the queue is unbounded and unprioritised. There is no per-user budget, no fair-share scheduling, no way to differentiate a CI integration test from a human chat.

The fix is always to put a layer in front. That layer can be:

A reverse proxy (Nginx, HAProxy, Caddy, Traefik) that enforces request-rate limits via headers or IP.
An API gateway (Kong, Tyk, KrakenD, LiteLLM) that adds per-API-key auth, token-bucket limits, and routing.
A custom queue service that implements fair-share scheduling, priority, and token budgets.

Most production Ollama deployments use a reverse proxy plus a small custom service for token budgets. That combination covers the request-per-minute case (proxy) and the tokens-per-day case (custom).

For the wider production architecture this rate-limiting layer fits into, see the Ollama production deployment guide and the Ollama monitoring guide — quotas without observability are guesswork.

Choosing a Strategy {#strategy}

Strategy	Enforcement	Granularity	Implementation	Use when
Per-IP req/min	Reverse proxy	Coarse	5 lines Nginx	Single-tenant, prototype
Per-API-key req/min	Reverse proxy	Per user	15 lines Nginx	Small team, named users
Per-API-key tokens/day	Sidecar + Redis	Per user, accurate	~150 LOC Python	Multi-team, cost control
Tier-based priority	Custom queue	Per user, fair	~400 LOC Go	Premium/free tiers
Full API gateway	Kong / LiteLLM	Per route, complex	Container	Multi-model, multi-tenant

Pick the simplest tier that solves your actual problem. A research lab with 8 named users does not need Kong. A startup hosting a SaaS layer over Ollama probably does.

Nginx Configuration {#nginx}

Nginx is the most common reverse proxy in front of Ollama. The relevant directives are limit_req_zone, limit_conn_zone, and limit_req.

Step 1: Per-IP Limit (Baseline)

http {
    limit_req_zone $binary_remote_addr zone=ollama_ip:10m rate=30r/m;
    limit_req_status 429;

    upstream ollama_backend {
        server 127.0.0.1:11434;
        keepalive 32;
    }

    server {
        listen 443 ssl http2;
        server_name ollama.example.com;

        # ... ssl_certificate, ssl_certificate_key ...

        location / {
            limit_req zone=ollama_ip burst=5 nodelay;

            proxy_pass http://ollama_backend;
            proxy_http_version 1.1;
            proxy_buffering off;          # critical for streaming
            proxy_read_timeout 600s;       # large model = long replies
            proxy_set_header Connection "";
        }
    }
}

burst=5 nodelay allows a small burst of 5 requests above the steady rate before the proxy starts returning 429. proxy_buffering off is mandatory — without it, Ollama's streaming responses are buffered and users see no output until the full response is generated.

Step 2: Per-API-Key Limit

http {
    # API key required, fall back to IP if missing
    map $http_x_api_key $api_key_id {
        default $binary_remote_addr;
        "" $binary_remote_addr;
        ~. $http_x_api_key;
    }

    limit_req_zone $api_key_id zone=ollama_per_key:10m rate=30r/m;

    server {
        # ... tls, listen ...

        # Reject missing API key entirely
        if ($http_x_api_key = "") {
            return 401 '{"error":"missing X-API-Key header"}';
        }

        location / {
            limit_req zone=ollama_per_key burst=10 nodelay;
            proxy_pass http://ollama_backend;
            proxy_buffering off;
            proxy_read_timeout 600s;
        }
    }
}

Step 3: Per-API-Key Concurrency

Concurrency limit (limit_conn) is what stops a single user opening 40 simultaneous streaming connections.

http {
    limit_conn_zone $api_key_id zone=ollama_conn:10m;

    server {
        location / {
            limit_req zone=ollama_per_key burst=10 nodelay;
            limit_conn ollama_conn 4;     # max 4 simultaneous streams per key
            proxy_pass http://ollama_backend;
        }
    }
}

Step 4: Validate API Keys

For real auth (not just rate-limit grouping), add an auth_request to a small validation service:

location = /_auth {
    internal;
    proxy_pass http://127.0.0.1:8080/validate;
    proxy_pass_request_body off;
    proxy_set_header Content-Length "";
    proxy_set_header X-Original-Key $http_x_api_key;
}

location / {
    auth_request /_auth;
    limit_req zone=ollama_per_key burst=10 nodelay;
    limit_conn ollama_conn 4;
    proxy_pass http://ollama_backend;
}

The official Nginx documentation at nginx.org limit_req module covers every option these directives accept.

HAProxy Configuration {#haproxy}

HAProxy is the right pick when you need richer ACLs, sticky sessions, or already use HAProxy for other services.

frontend ollama_in
    bind *:443 ssl crt /etc/haproxy/cert.pem alpn h2,http/1.1
    mode http

    # Capture API key for tracking
    http-request set-var(txn.api_key) req.hdr(X-API-Key)

    # Reject missing key
    http-request deny status 401 if !{ var(txn.api_key) -m found }

    # Track request rate per API key
    stick-table type string len 64 size 100k expire 1m store http_req_rate(60s)
    http-request track-sc0 var(txn.api_key)

    # Rate limit: 30 requests per minute per key
    http-request deny status 429 if { sc_http_req_rate(0) gt 30 }

    default_backend ollama_be

backend ollama_be
    mode http
    timeout server 600s
    option http-buffer-request
    server ollama1 127.0.0.1:11434 check

HAProxy's stick-table approach is faster than Nginx limit_req for very high request rates because it works in the data plane instead of locking shared memory.

Caddy and Traefik {#caddy-traefik}

Both modern proxies have plugin-based rate limiting.

Caddy

{
    order rate_limit before reverse_proxy
}

ollama.example.com {
    rate_limit {
        zone per_key {
            key {http.request.header.X-API-Key}
            events 30
            window 1m
        }
    }

    reverse_proxy 127.0.0.1:11434 {
        flush_interval -1
        transport http {
            response_header_timeout 600s
        }
    }
}

Caddy's rate_limit is a third-party module — install with xcaddy build --with github.com/mholt/caddy-ratelimit.

Traefik

http:
  middlewares:
    ollama-ratelimit:
      rateLimit:
        average: 30
        period: 1m
        burst: 10
        sourceCriterion:
          requestHeaderName: X-API-Key

  routers:
    ollama:
      rule: Host(`ollama.example.com`)
      middlewares:
        - ollama-ratelimit
      service: ollama
      tls:
        certResolver: letsencrypt

  services:
    ollama:
      loadBalancer:
        servers:
          - url: http://127.0.0.1:11434

Traefik shines in container deployments where the proxy config lives next to the docker-compose. For bare-metal Ollama, Nginx is usually less work.

Token Budgets with Redis {#token-budgets}

Request count limits do not cover the user who sends one prompt with 100K input tokens. For that, you need a token budget. The pattern: a small sidecar reads the eval_count and prompt_eval_count fields from each Ollama response, debits the API key's daily budget in Redis, and rejects future requests when the budget is depleted.

Sidecar (FastAPI + Redis)

# ollama-quota.py
from datetime import datetime
import os
import httpx
import redis
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import StreamingResponse, Response

OLLAMA = os.environ.get("OLLAMA_URL", "http://127.0.0.1:11434")
DAILY_TOKEN_LIMIT = int(os.environ.get("DAILY_TOKEN_LIMIT", "250000"))
r = redis.from_url(os.environ.get("REDIS_URL", "redis://localhost:6379/0"))
app = FastAPI()

def quota_key(api_key: str) -> str:
    today = datetime.utcnow().strftime("%Y%m%d")
    return f"ollama:quota:{api_key}:{today}"

@app.middleware("http")
async def enforce_quota(request: Request, call_next):
    api_key = request.headers.get("x-api-key")
    if not api_key:
        raise HTTPException(401, "missing X-API-Key")

    used = int(r.get(quota_key(api_key)) or 0)
    if used >= DAILY_TOKEN_LIMIT:
        return Response(
            status_code=429,
            headers={
                "Retry-After": "3600",
                "X-RateLimit-Limit": str(DAILY_TOKEN_LIMIT),
                "X-RateLimit-Used": str(used),
            },
            content='{"error":"daily token budget exceeded"}',
        )
    return await call_next(request)

@app.post("/api/chat")
@app.post("/api/generate")
async def proxy(request: Request):
    api_key = request.headers["x-api-key"]
    body = await request.body()

    async def upstream():
        nonlocal_total = {"in": 0, "out": 0}
        async with httpx.AsyncClient(timeout=600) as cli:
            async with cli.stream(
                "POST", f"{OLLAMA}{request.url.path}", content=body
            ) as resp:
                async for chunk in resp.aiter_raw():
                    yield chunk
                # Final chunk usually contains eval_count and prompt_eval_count
        # In practice, parse the streamed JSON to extract counts

    return StreamingResponse(upstream(), media_type="application/x-ndjson")

In production, the streaming response handler accumulates the JSON chunks, parses the final object's eval_count and prompt_eval_count, and runs:

total = response_json["prompt_eval_count"] + response_json["eval_count"]
r.incrby(quota_key(api_key), total)
r.expire(quota_key(api_key), 86400 * 2)  # 48h safety window

Wiring Into Nginx

location /api/ {
    proxy_pass http://127.0.0.1:8085;   # quota sidecar
    proxy_buffering off;
    proxy_read_timeout 600s;
}

Nginx still does request/min limiting; the sidecar adds tokens/day on top.

For a complete reference of the patterns and primitives that make this kind of accounting work cleanly, the official Redis pattern guide covers rate limiting and counters.

Fairness Queue Pattern {#fairness}

For deployments with mixed priority — say, free-tier users vs paid — pure rate limits are not enough. You want round-robin fairness so a paid user with budget always beats a free user, but two paid users at the same time alternate. That is a queue, not a rate limit.

# Simplified fairness queue (sketch)
import asyncio
from collections import deque, defaultdict

class FairQueue:
    def __init__(self):
        self.queues = defaultdict(deque)
        self.users = deque()
        self.cv = asyncio.Condition()

    async def submit(self, user_id, request):
        async with self.cv:
            if user_id not in self.users:
                self.users.append(user_id)
            future = asyncio.get_event_loop().create_future()
            self.queues[user_id].append((request, future))
            self.cv.notify()
        return await future

    async def worker(self, ollama_call):
        while True:
            async with self.cv:
                while not any(self.queues.values()):
                    await self.cv.wait()
                # Round-robin: rotate users until we find one with work
                while not self.queues[self.users[0]]:
                    self.users.rotate(-1)
                user_id = self.users[0]
                request, future = self.queues[user_id].popleft()
                self.users.rotate(-1)
            try:
                result = await ollama_call(request)
                future.set_result(result)
            except Exception as e:
                future.set_exception(e)

A single worker means strict serialisation. Multiple workers means parallel inference up to OLLAMA_NUM_PARALLEL. The queue depth and per-user wait-time become the primary capacity-planning metrics — both worth exporting to Prometheus.

Communicating Limits to Users {#communicating}

Rate-limited users will assume the system is broken unless you tell them otherwise. Three things make 429 responses humane.

1. Use the standard headers.

HTTP/1.1 429 Too Many Requests
Retry-After: 18
X-RateLimit-Limit: 30
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714080540
Content-Type: application/json

{"error":"rate_limit_exceeded","message":"30 requests/minute exceeded. Retry in 18s."}

The OpenAI SDK already understands these headers — clients written against OpenAI will retry automatically.

2. Surface budgets in your UI.

A "Tokens used today: 187K of 250K" widget reduces support tickets dramatically. The numbers come from the same Redis counters as the enforcement, so adding the UI is roughly 30 lines of code.

3. Fail fast at the proxy, not deep in the pipeline.

A 429 in the first millisecond is far better UX than a request that waits 8 seconds in queue, then fails. Always enforce limits at the edge layer (Nginx, HAProxy) before any expensive work.

Pitfalls and Anti-Patterns {#pitfalls}

Pitfall 1: IP-Based Limits Behind a NAT

Symptom: Whole office gets rate-limited together.

Cause: $binary_remote_addr is the office NAT IP. All users share it.

Fix: Switch to API-key-based limiting and require the header. If a user does not have an API key, reject with 401.

Pitfall 2: `proxy_buffering on`

Symptom: Streaming responses arrive all at once after a long wait.

Cause: Nginx buffers the entire response before forwarding.

Fix: proxy_buffering off in the location block. This is non-negotiable for Ollama.

Pitfall 3: Limit Zone Too Small

Symptom: 503 errors under load with "limit_req: cannot allocate memory" in the Nginx log.

Cause: zone=ollama:1m only tracks a few thousand keys.

Fix: zone=ollama:10m (tracks ~160K keys per MB) is fine for most teams. zone=ollama:100m for very large deployments.

Pitfall 4: Token Budget Reset Drift

Symptom: Daily token budget resets at random times instead of UTC midnight.

Cause: Using SET ... EX 86400 instead of an explicit UTC date in the key.

Fix: Build the key from datetime.utcnow().strftime("%Y%m%d") so the budget is naturally per-UTC-day.

Pitfall 5: Concurrent Stream Limit Set Too Low

Symptom: Power users complain that opening two browser tabs breaks one.

Cause: limit_conn 1 per key.

Fix: 4-8 is a reasonable concurrent-stream budget per user. The limit is only there to stop runaway clients with hundreds of open connections.

For the related multi-GPU scaling considerations once a single Ollama instance is fully loaded, see the multi-GPU Ollama guide.

Final Notes

Rate limiting is the part of an Ollama deployment that everyone knows they need but nobody plans for until something breaks. The tools are simple — Nginx, a few directives, a small Redis sidecar — but the design questions are not. How granular should limits be? Should you charge by request or by token? What happens when the GPU is overloaded but the limit is not hit? Who decides the priority order?

Start with the smallest thing that solves an actual problem. Per-IP for a prototype. Per-API-key for a real team. Token budgets when usage starts costing real money in electricity. Priority queues only when you have free and paid tiers to differentiate. Layer them, monitor them in Grafana, and revisit the thresholds every couple of months as traffic patterns evolve.

The goal is not perfect fairness — it is a deployment where every user gets the same fast experience they would have on a private Ollama instance, even though the GPU is shared. Get the limits right, and the rest of your team forgets the server exists. Get them wrong, and you spend Monday morning explaining to leadership why the AI tool everyone loved last week stopped working.

Ollama Rate Limiting for Multi-User Setups: Quotas & Fairness

Want to go deeper than this article?

Ollama Rate Limiting for Multi-User Setups: Quotas & Fairness

Table of Contents

Why Ollama Needs External Rate Limiting {#why-rate-limit}

Choosing a Strategy {#strategy}

Nginx Configuration {#nginx}

Step 1: Per-IP Limit (Baseline)

Step 2: Per-API-Key Limit

Step 3: Per-API-Key Concurrency

Step 4: Validate API Keys

HAProxy Configuration {#haproxy}

Caddy and Traefik {#caddy-traefik}

Caddy

Traefik

Token Budgets with Redis {#token-budgets}

Sidecar (FastAPI + Redis)

Wiring Into Nginx

Fairness Queue Pattern {#fairness}

Communicating Limits to Users {#communicating}

Pitfalls and Anti-Patterns {#pitfalls}

Pitfall 1: IP-Based Limits Behind a NAT

Pitfall 2: proxy_buffering on

Pitfall 3: Limit Zone Too Small

Pitfall 4: Token Budget Reset Drift

Pitfall 5: Concurrent Stream Limit Set Too Low

Final Notes

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

🎓 Continue Learning

Run Ollama Like a Real Service

Build Real AI on Your Machine

Related Guides

Continue Learning

Ollama in Production

Monitor Ollama

AI Gateway with LiteLLM

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI

Pitfall 2: `proxy_buffering on`