Can Ollama handle multiple concurrent users in production?

Yes, but with caveats. Ollama processes requests sequentially by default — concurrent requests queue up. With a single RTX 4090 running a 70B Q4 model, expect 25-35 tok/s throughput shared across all users. For 5-10 concurrent users, this is usually acceptable (each gets a response in 10-30 seconds). For 20+ concurrent users, you need either multiple Ollama instances behind a load balancer or a dedicated inference server like vLLM for batched processing.

How do I add SSL to Ollama without Nginx?

You generally should not. Ollama does not natively support TLS termination. The standard approach is Nginx or Caddy as a reverse proxy handling TLS, then forwarding plaintext HTTP to Ollama on localhost. Caddy is simpler (auto-generates certificates), but Nginx gives more control over rate limiting and header manipulation. If you truly cannot run a reverse proxy, tools like stunnel can wrap the Ollama port in TLS, but this is fragile and harder to maintain.

What metrics should I monitor for Ollama in production?

The critical four: GPU memory utilization (if it hits 100%, requests fail), GPU temperature (throttling starts around 83C on consumer cards), request queue depth (growing queue means you need more capacity), and tokens-per-second throughput (declining throughput indicates thermal throttling or memory pressure). Secondary metrics include CPU utilization, system RAM usage, disk I/O for model loading, and network bandwidth for streaming responses.

How do I prevent unauthorized access to my Ollama API?

Ollama has no built-in authentication. The three approaches in order of security: (1) Bind Ollama to localhost only and put it behind a reverse proxy with API key validation, (2) Use Open WebUI as the gateway with user accounts and RBAC, (3) Network-level isolation with firewall rules allowing only specific IPs. Never expose port 11434 directly to the internet without authentication — bots scan for open Ollama instances and will use your GPU for free inference.

Should I use Docker or bare metal for production Ollama?

Docker with NVIDIA Container Toolkit. The performance overhead is negligible (under 2% in benchmarks). Docker gives you reproducible deployments, easy rollbacks, health checks, resource limits, and clean separation of concerns. The only argument for bare metal is if you need every last percent of GPU performance or have driver compatibility issues with the container runtime — both are rare in practice.

How do I handle model updates without downtime?

Run two Ollama instances behind a load balancer. Pull the new model on instance B while instance A continues serving traffic. Once B is ready with the new model loaded, shift traffic to B, update instance A, then rebalance. For simpler setups with a single instance, schedule model pulls during low-usage windows — the pull itself does not interrupt running inference, but loading the new model does require a brief restart.

What is the maximum context length I can use in production?

It depends on your VRAM. Each token in the KV cache consumes memory. For Llama 3.3 70B Q4_K_M on a 24GB GPU, you can practically use 8K-16K context. Going to 32K requires significant CPU offloading and slows inference. For 128K context, you need 48GB+ VRAM or a multi-GPU setup. In production, set num_ctx explicitly in your Modelfile rather than relying on defaults — an unexpected large context request can OOM your GPU and crash all active sessions.

How do I back up my Ollama production setup?

Back up three things: (1) The Docker Compose file and all configuration (version controlled in Git), (2) Model blobs in the Ollama data directory (these are large — 40-70GB per model — so use rsync to a local NAS rather than cloud backup), (3) Open WebUI database and user data. Models can be re-pulled, so prioritize configuration and user data in your backup schedule. Test restoration quarterly.

Ollama in Production: Docker, SSL, Auth & Monitoring

Published on April 11, 2026 -- 20 min read

Running ollama run llama3.3 on your laptop is trivial. Running Ollama for a team of 20 people who depend on it daily is a different problem entirely. The model inference part works fine. Everything else — authentication, TLS, monitoring, restart policies, resource limits, log management — is undocumented, scattered across GitHub issues, or left as an exercise for the reader.

This is the guide that fills that gap. A production Ollama deployment that handles real users, survives reboots, alerts you before things break, and does not expose your GPU to the entire internet.

The Production Architecture {#production-architecture}

                    Internet / Internal Network
                              |
                              v
                 +------------+-------------+
                 |   Nginx Reverse Proxy    |
                 |   TLS termination        |
                 |   API key validation     |
                 |   Rate limiting          |
                 +------------+-------------+
                              |
              +---------------+---------------+
              |                               |
              v                               v
+-------------+----------+  +-----------------+--------+
|  Open WebUI            |  |  Direct API              |
|  (User auth, RBAC,     |  |  (Authenticated clients, |
|   chat interface)      |  |   scripts, integrations) |
+-------------+----------+  +-----------------+--------+
              |                               |
              +---------------+---------------+
                              |
                              v
                 +------------+-------------+
                 |   Ollama Server          |
                 |   (localhost:11434 only) |
                 |   GPU inference          |
                 +------------+-------------+
                              |
                              v
                 +------------+-------------+
                 |   Prometheus + Grafana   |
                 |   Metrics, dashboards,   |
                 |   alerting               |
                 +--------------------------+

Every component runs in Docker. Everything restarts automatically. Nothing is exposed without authentication.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Docker Compose: The Full Stack {#docker-compose}

This is the complete docker-compose.yml for production. Not a toy example — this includes health checks, resource limits, proper networking, and GPU allocation.

# docker-compose.yml — Ollama Production Stack
version: "3.8"

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama-prod
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_MAX_LOADED_MODELS=2
      - OLLAMA_KEEP_ALIVE=10m
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
        limits:
          memory: 48G
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    networks:
      - ai_internal
    restart: unless-stopped
    logging:
      driver: json-file
      options:
        max-size: "50m"
        max-file: "5"

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui-prod
    ports:
      - "127.0.0.1:3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_AUTH=true
      - ENABLE_SIGNUP=false
      - DEFAULT_USER_ROLE=user
      - WEBUI_SECRET_KEY_FILE=/run/secrets/webui_secret
    volumes:
      - webui_data:/app/backend/data
    depends_on:
      ollama:
        condition: service_healthy
    secrets:
      - webui_secret
    networks:
      - ai_internal
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  nginx:
    image: nginx:alpine
    container_name: nginx-proxy
    ports:
      - "443:443"
      - "80:80"
    volumes:
      - ./nginx/conf.d:/etc/nginx/conf.d:ro
      - ./nginx/ssl:/etc/nginx/ssl:ro
      - ./nginx/auth:/etc/nginx/auth:ro
    depends_on:
      - open-webui
      - ollama
    networks:
      - ai_internal
      - ai_external
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "127.0.0.1:9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    networks:
      - ai_internal
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "127.0.0.1:3001:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD_FILE=/run/secrets/grafana_password
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
      - ./grafana/datasources:/etc/grafana/provisioning/datasources:ro
    depends_on:
      - prometheus
    secrets:
      - grafana_password
    networks:
      - ai_internal
    restart: unless-stopped

  nvidia-exporter:
    image: utkuozdemir/nvidia_gpu_exporter:latest
    container_name: nvidia-exporter
    ports:
      - "127.0.0.1:9835:9835"
    volumes:
      - /usr/bin/nvidia-smi:/usr/bin/nvidia-smi:ro
      - /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:ro
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - ai_internal
    restart: unless-stopped

volumes:
  ollama_data:
  webui_data:
  prometheus_data:
  grafana_data:

networks:
  ai_internal:
    internal: true
  ai_external:

secrets:
  webui_secret:
    file: ./secrets/webui_secret.txt
  grafana_password:
    file: ./secrets/grafana_password.txt

Key production decisions in this configuration:

ai_internal network is marked internal: true — containers on this network cannot reach the internet. Ollama, Prometheus, and Grafana are completely isolated.
Only Nginx exposes ports to the host. Open WebUI binds to 127.0.0.1 (localhost only). Ollama exposes nothing to the host.
Health checks on Ollama and Open WebUI ensure Docker restarts unhealthy containers automatically.
Memory limit of 48GB prevents Ollama from consuming all system RAM if a large context request arrives.
Log rotation via Docker's json-file driver with 50MB max size prevents disk exhaustion.

For the foundational Docker setup, the Ollama + Open WebUI Docker guide covers installation and first run in detail.

Nginx Reverse Proxy with SSL {#nginx-ssl}

Certificate Setup with Certbot

# Install certbot
sudo apt install certbot -y

# Generate certificate (run before starting the Docker stack)
sudo certbot certonly --standalone -d ai.yourcompany.com

# Copy certs to the project directory
mkdir -p nginx/ssl
sudo cp /etc/letsencrypt/live/ai.yourcompany.com/fullchain.pem nginx/ssl/
sudo cp /etc/letsencrypt/live/ai.yourcompany.com/privkey.pem nginx/ssl/
sudo chmod 644 nginx/ssl/*.pem

Nginx Configuration

# nginx/conf.d/ollama-prod.conf

# Rate limiting zones
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=30r/m;
limit_req_zone $binary_remote_addr zone=ui_limit:10m rate=60r/m;

# Upstream definitions
upstream open_webui {
    server open-webui:8080;
}

upstream ollama_api {
    server ollama:11434;
}

# HTTPS server
server {
    listen 443 ssl http2;
    server_name ai.yourcompany.com;

    ssl_certificate     /etc/nginx/ssl/fullchain.pem;
    ssl_certificate_key /etc/nginx/ssl/privkey.pem;
    ssl_protocols       TLSv1.2 TLSv1.3;
    ssl_ciphers         ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384;
    ssl_prefer_server_ciphers on;
    ssl_session_cache   shared:SSL:10m;
    ssl_session_timeout 10m;

    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
    add_header X-Content-Type-Options nosniff always;
    add_header X-Frame-Options DENY always;

    # Web UI (Open WebUI)
    location / {
        limit_req zone=ui_limit burst=20 nodelay;
        proxy_pass http://open_webui;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # WebSocket support for streaming
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_read_timeout 300s;
    }

    # Direct API access (requires API key)
    location /api/ollama/ {
        limit_req zone=api_limit burst=10 nodelay;

        # API key validation
        if ($http_authorization = "") {
            return 401 '{"error": "API key required"}';
        }
        auth_request /auth_check;

        # Strip the /api/ollama/ prefix and forward to Ollama
        rewrite ^/api/ollama/(.*) /$1 break;
        proxy_pass http://ollama_api;
        proxy_set_header Host $host;
        proxy_read_timeout 300s;
    }

    # API key validation endpoint
    location = /auth_check {
        internal;
        proxy_pass_request_body off;
        proxy_set_header Content-Length "";

        # Check API key against htpasswd-style file
        auth_basic "API Access";
        auth_basic_user_file /etc/nginx/auth/api_keys;

        proxy_set_header X-Original-URI $request_uri;
    }
}

# HTTP redirect
server {
    listen 80;
    server_name ai.yourcompany.com;
    return 301 https://$host$request_uri;
}

Generate API Keys

# Create API key file
mkdir -p nginx/auth

# Generate API keys for programmatic access
# Format: username:api_key (using htpasswd for basic auth)
sudo apt install apache2-utils -y
htpasswd -cb nginx/auth/api_keys service-account-1 "sk-$(openssl rand -hex 32)"
htpasswd -b  nginx/auth/api_keys service-account-2 "sk-$(openssl rand -hex 32)"

Rate Limiting Per User

The Nginx config above implements two rate limit zones:

UI users: 60 requests/minute with burst of 20 (generous for browser interaction)
API clients: 30 requests/minute with burst of 10 (prevents automated abuse)

For a team of 20 users, these limits prevent any single user from monopolizing the GPU while still allowing normal usage patterns. Adjust based on your actual capacity.

Monitoring: Prometheus + Grafana {#monitoring}

Prometheus Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: "nvidia-gpu"
    static_configs:
      - targets: ["nvidia-exporter:9835"]

  - job_name: "ollama"
    metrics_path: /metrics
    static_configs:
      - targets: ["ollama:11434"]

  - job_name: "nginx"
    static_configs:
      - targets: ["nginx:9113"]

  - job_name: "node"
    static_configs:
      - targets: ["host.docker.internal:9100"]

Alert Rules

# prometheus/alerts.yml
groups:
  - name: ollama_production
    rules:
      - alert: GPUTemperatureHigh
        expr: nvidia_gpu_temperature_celsius > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU temperature above 80C for 5 minutes"
          description: "GPU {{ $labels.gpu }} is at {{ $value }}C. Check cooling."

      - alert: GPUTemperatureCritical
        expr: nvidia_gpu_temperature_celsius > 88
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "GPU temperature critical — thermal throttling imminent"

      - alert: GPUMemoryExhausted
        expr: nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes > 0.95
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "GPU memory above 95% — OOM risk"

      - alert: OllamaDown
        expr: up{job="ollama"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Ollama is not responding"

      - alert: HighRequestLatency
        expr: histogram_quantile(0.95, rate(ollama_request_duration_seconds_bucket[5m])) > 60
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "95th percentile response time exceeds 60 seconds"

Grafana Dashboard Panels

Create a dashboard at grafana/dashboards/ollama-production.json with these panels:

Panel	Metric	Purpose
GPU Temperature	`nvidia_gpu_temperature_celsius`	Catch cooling failures before throttling
GPU Memory Used	`nvidia_gpu_memory_used_bytes`	Track memory pressure across sessions
GPU Utilization	`nvidia_gpu_utilization_percent`	Is the GPU actually working or idle?
Tokens/Second	`ollama_tokens_generated_total` rate	Performance trending over time
Active Requests	`ollama_active_requests`	Current load on the system
Request Queue	`ollama_queue_depth`	How many requests are waiting
P95 Latency	`histogram_quantile(0.95, ollama_request_duration)`	User experience metric
System RAM	`node_memory_MemAvailable_bytes`	Catch memory leaks early
Disk Usage	`node_filesystem_avail_bytes`	Model storage and logs

The GPU temperature panel is the single most important metric. A failing fan or clogged heatsink will throttle your GPU from 35 tok/s to 8 tok/s before it shuts down entirely. Catch it at 80C, not at 90C.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Structured Logging {#logging}

Default Ollama logging is minimal. Augment it with structured JSON logs for production debugging:

Docker Log Aggregation

# View real-time logs across all services
docker compose logs -f --tail=100

# Filter to just Ollama
docker compose logs -f ollama

# Export logs for analysis
docker compose logs --no-color ollama > /var/log/ollama/ollama-$(date +%Y%m%d).log

Log Rotation

Docker's built-in log rotation (configured in the compose file above) handles container logs. For additional application logs:

# /etc/logrotate.d/ollama-prod
/var/log/ollama/*.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 0640 root adm
    dateext
}

ELK Stack Integration (Optional)

For teams that need searchable, centralized logs:

# Add to docker-compose.yml
  filebeat:
    image: docker.elastic.co/beats/filebeat:8.12.0
    container_name: filebeat
    volumes:
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
    networks:
      - ai_internal
    restart: unless-stopped

Backup Strategy {#backup-strategy}

What to Back Up

Data	Location	Size	Frequency	Method
Docker Compose + configs	`./`	<1MB	Every change (Git)	Git repository
Open WebUI database	`webui_data` volume	10-500MB	Daily	Volume snapshot
User uploads	`webui_data` volume	Variable	Daily	rsync to NAS
Model files	`ollama_data` volume	40-200GB	Weekly	rsync to NAS
Prometheus metrics	`prometheus_data` volume	1-10GB	Weekly	Volume snapshot
SSL certificates	`./nginx/ssl/`	<10KB	On renewal	Git (encrypted)

Automated Backup Script

#!/bin/bash
# backup-ollama-prod.sh — runs nightly via cron

BACKUP_DIR="/backup/ollama-prod"
DATE=$(date +%Y%m%d)
NAS_PATH="/mnt/nas/backups/ollama"

mkdir -p "${BACKUP_DIR}/${DATE}"

# 1. Stop Open WebUI briefly for consistent DB backup
docker compose stop open-webui
docker run --rm -v ollama-prod_webui_data:/data -v "${BACKUP_DIR}/${DATE}:/backup" \
    alpine tar czf /backup/webui-data.tar.gz /data
docker compose start open-webui

# 2. Sync model files (Ollama stays running — models are read-only during inference)
rsync -az --progress /var/lib/docker/volumes/ollama-prod_ollama_data/_data/ \
    "${NAS_PATH}/models/"

# 3. Prometheus snapshot
curl -s -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot | \
    jq -r '.data.name' | \
    xargs -I {} cp -r /var/lib/docker/volumes/ollama-prod_prometheus_data/_data/snapshots/{} \
    "${BACKUP_DIR}/${DATE}/"

# 4. Retention: keep 30 days of local backups
find "${BACKUP_DIR}" -maxdepth 1 -type d -mtime +30 -exec rm -rf {} \;

echo "[$(date)] Backup complete: ${BACKUP_DIR}/${DATE}"

Auto-Restart and Failure Recovery {#auto-restart}

Docker Restart Policies

The restart: unless-stopped policy in the compose file handles most failures. But Docker itself can fail.

Systemd Service for Docker Compose

# /etc/systemd/system/ollama-prod.service
[Unit]
Description=Ollama Production Stack
Requires=docker.service
After=docker.service

[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/opt/ollama-prod
ExecStart=/usr/bin/docker compose up -d
ExecStop=/usr/bin/docker compose down
ExecReload=/usr/bin/docker compose restart
TimeoutStartSec=300

[Install]
WantedBy=multi-user.target

# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable ollama-prod.service
sudo systemctl start ollama-prod.service

# The stack now survives reboots automatically

GPU Recovery

NVIDIA GPUs occasionally enter a bad state (driver crash, ECC error, stuck process). Monitor and recover:

#!/bin/bash
# gpu-watchdog.sh — runs every 5 minutes via cron

# Check if GPU is responsive
if ! nvidia-smi &>/dev/null; then
    echo "[$(date)] GPU not responding. Restarting Ollama..." | tee -a /var/log/ollama/watchdog.log
    docker compose -f /opt/ollama-prod/docker-compose.yml restart ollama
    sleep 30

    if ! nvidia-smi &>/dev/null; then
        echo "[$(date)] GPU still unresponsive after restart. Alerting." | tee -a /var/log/ollama/watchdog.log
        # Send alert (webhook, email, etc.)
        curl -s -X POST "https://hooks.slack.com/your-webhook" \
            -H "Content-Type: application/json" \
            -d '{"text":"CRITICAL: Ollama production GPU unresponsive after restart. Manual intervention required."}'
    fi
fi

# Check GPU temperature
TEMP=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits 2>/dev/null)
if [ -n "${TEMP}" ] && [ "${TEMP}" -gt 85 ]; then
    echo "[$(date)] GPU temperature ${TEMP}C — throttling likely" | tee -a /var/log/ollama/watchdog.log
fi

# Add to crontab
echo "*/5 * * * * /opt/ollama-prod/scripts/gpu-watchdog.sh" | sudo tee -a /etc/cron.d/ollama-prod

Load Testing with k6 {#load-testing}

Before going live, verify your setup handles expected load. Ollama queues requests, so you need to know your throughput limits.

// load-test.js — k6 script for Ollama production
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  scenarios: {
    steady_load: {
      executor: 'constant-vus',
      vus: 10,
      duration: '5m',
    },
    spike_test: {
      executor: 'ramping-vus',
      startVUs: 0,
      stages: [
        { duration: '1m', target: 5 },
        { duration: '2m', target: 20 },
        { duration: '1m', target: 5 },
        { duration: '1m', target: 0 },
      ],
      startTime: '6m',
    },
  },
  thresholds: {
    http_req_duration: ['p(95)<120000'],  // 95% under 2 minutes
    http_req_failed: ['rate<0.05'],        // Under 5% failure rate
  },
};

export default function () {
  const payload = JSON.stringify({
    model: 'llama3.3:70b-instruct-q4_K_M',
    prompt: 'Explain the concept of dependency injection in 3 sentences.',
    stream: false,
  });

  const params = {
    headers: {
      'Content-Type': 'application/json',
      'Authorization': 'Basic ' + __ENV.API_KEY,
    },
    timeout: '180s',
  };

  const res = http.post(
    'https://ai.yourcompany.com/api/ollama/api/generate',
    payload,
    params
  );

  check(res, {
    'status is 200': (r) => r.status === 200,
    'response has content': (r) => r.json('response') !== '',
  });

  sleep(1);
}

# Run the load test
k6 run --env API_KEY=$(cat nginx/auth/api_keys | head -1 | cut -d: -f2) load-test.js

Expected Results (Single RTX 4090, 70B Q4)

Concurrent Users	Avg Response Time	P95 Response Time	Throughput
1	8s	12s	7.5 req/min
5	35s	52s	8.5 req/min
10	68s	95s	8.8 req/min
20	140s	195s	8.6 req/min

Throughput stays roughly constant because Ollama processes requests sequentially. More concurrent users means longer wait times, not more capacity. If P95 exceeds your SLA, you need more GPUs or a faster model.

The Production Checklist {#production-checklist}

Verify every item before exposing this to users. No exceptions.

Security

Ollama port 11434 not accessible from outside Docker network
TLS certificate valid and auto-renewing
API keys generated and distributed securely
Open WebUI signup disabled
Default admin password changed
Rate limiting configured and tested
Firewall rules blocking unnecessary ports

Reliability

Docker restart: unless-stopped on all services
Systemd service created for docker compose stack
Health checks passing on Ollama and Open WebUI
GPU watchdog cron installed
Memory limits set on all containers
Log rotation configured (Docker and application level)

Monitoring

Prometheus scraping all targets
Grafana dashboard showing GPU temp, memory, throughput
Alert rules configured for GPU temp, OOM, service down
Alert notifications routing to on-call (Slack, PagerDuty, email)

Data

Automated backup script installed and tested
Backup restoration tested (actually restore, do not just assume)
Model files present and loaded (verify with curl http://localhost:11434/api/tags)
SSL certificates backed up

Documentation

Runbook: how to restart the stack
Runbook: how to add/remove users
Runbook: how to update models
Runbook: how to restore from backup
On-call contact list for escalation

For the underlying hardware setup, the homelab AI server build guide covers physical server configuration, and the Ubuntu AI workstation guide handles OS-level optimization.

Conclusion

The gap between "Ollama works on my laptop" and "Ollama runs reliably for my team" is entirely about operations. The inference engine is solid. What you build around it — authentication, encryption, monitoring, backup, restart policies — determines whether your team trusts the system or routes around it.

Every shortcut in this stack has a consequence. Skip TLS and someone sniffs prompts on the network. Skip monitoring and a thermal event kills the GPU at 2 AM with no alert. Skip backups and a disk failure means re-creating every user's conversation history from nothing. Skip rate limiting and one automated script saturates the GPU while everyone else waits.

Do it right once. The stack in this guide takes about 3 hours to deploy from scratch. After that, it runs itself.

Starting from zero? Follow the Ollama + Open WebUI Docker setup first, then layer on the production hardening from this guide.

Ollama in Production: Docker, SSL, Auth & Monitoring

Want to go deeper than this article?

The Production Architecture {#production-architecture}

Reading articles is good. Building is better.

Docker Compose: The Full Stack {#docker-compose}

Nginx Reverse Proxy with SSL {#nginx-ssl}

Certificate Setup with Certbot

Nginx Configuration

Generate API Keys

Rate Limiting Per User

Monitoring: Prometheus + Grafana {#monitoring}

Prometheus Configuration

Alert Rules

Grafana Dashboard Panels

Reading articles is good. Building is better.

Structured Logging {#logging}

Docker Log Aggregation

Log Rotation

ELK Stack Integration (Optional)

Backup Strategy {#backup-strategy}

What to Back Up

Automated Backup Script

Auto-Restart and Failure Recovery {#auto-restart}

Docker Restart Policies

Systemd Service for Docker Compose

GPU Recovery

Load Testing with k6 {#load-testing}

Expected Results (Single RTX 4090, 70B Q4)

The Production Checklist {#production-checklist}

Security

Reliability

Monitoring

Data

Documentation

Conclusion

Ollama’s running. Here’s what to build with it.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by the Local AI Master Team

Build Real AI on Your Machine

🎓 Continue Learning

Related Guides

Ollama + Open WebUI Docker Setup

Homelab AI Server Build

Ubuntu AI Workstation Setup

Complete Ollama Guide

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Ollama’s running. Here’s what to build with it.