Ollama in Production: Docker, SSL, Auth & Monitoring
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Ollama in Production: Docker, SSL, Auth & Monitoring
Published on April 11, 2026 -- 20 min read
Running ollama run llama3.3 on your laptop is trivial. Running Ollama for a team of 20 people who depend on it daily is a different problem entirely. The model inference part works fine. Everything else — authentication, TLS, monitoring, restart policies, resource limits, log management — is undocumented, scattered across GitHub issues, or left as an exercise for the reader.
This is the guide that fills that gap. A production Ollama deployment that handles real users, survives reboots, alerts you before things break, and does not expose your GPU to the entire internet.
The Production Architecture {#production-architecture}
Internet / Internal Network
|
v
+------------+-------------+
| Nginx Reverse Proxy |
| TLS termination |
| API key validation |
| Rate limiting |
+------------+-------------+
|
+---------------+---------------+
| |
v v
+-------------+----------+ +-----------------+--------+
| Open WebUI | | Direct API |
| (User auth, RBAC, | | (Authenticated clients, |
| chat interface) | | scripts, integrations) |
+-------------+----------+ +-----------------+--------+
| |
+---------------+---------------+
|
v
+------------+-------------+
| Ollama Server |
| (localhost:11434 only) |
| GPU inference |
+------------+-------------+
|
v
+------------+-------------+
| Prometheus + Grafana |
| Metrics, dashboards, |
| alerting |
+--------------------------+
Every component runs in Docker. Everything restarts automatically. Nothing is exposed without authentication.
Docker Compose: The Full Stack {#docker-compose}
This is the complete docker-compose.yml for production. Not a toy example — this includes health checks, resource limits, proper networking, and GPU allocation.
# docker-compose.yml — Ollama Production Stack
version: "3.8"
services:
ollama:
image: ollama/ollama:latest
container_name: ollama-prod
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0:11434
- OLLAMA_NUM_PARALLEL=4
- OLLAMA_MAX_LOADED_MODELS=2
- OLLAMA_KEEP_ALIVE=10m
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
limits:
memory: 48G
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
networks:
- ai_internal
restart: unless-stopped
logging:
driver: json-file
options:
max-size: "50m"
max-file: "5"
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui-prod
ports:
- "127.0.0.1:3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_AUTH=true
- ENABLE_SIGNUP=false
- DEFAULT_USER_ROLE=user
- WEBUI_SECRET_KEY_FILE=/run/secrets/webui_secret
volumes:
- webui_data:/app/backend/data
depends_on:
ollama:
condition: service_healthy
secrets:
- webui_secret
networks:
- ai_internal
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
nginx:
image: nginx:alpine
container_name: nginx-proxy
ports:
- "443:443"
- "80:80"
volumes:
- ./nginx/conf.d:/etc/nginx/conf.d:ro
- ./nginx/ssl:/etc/nginx/ssl:ro
- ./nginx/auth:/etc/nginx/auth:ro
depends_on:
- open-webui
- ollama
networks:
- ai_internal
- ai_external
restart: unless-stopped
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "127.0.0.1:9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
networks:
- ai_internal
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "127.0.0.1:3001:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD_FILE=/run/secrets/grafana_password
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
- ./grafana/datasources:/etc/grafana/provisioning/datasources:ro
depends_on:
- prometheus
secrets:
- grafana_password
networks:
- ai_internal
restart: unless-stopped
nvidia-exporter:
image: utkuozdemir/nvidia_gpu_exporter:latest
container_name: nvidia-exporter
ports:
- "127.0.0.1:9835:9835"
volumes:
- /usr/bin/nvidia-smi:/usr/bin/nvidia-smi:ro
- /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:ro
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
networks:
- ai_internal
restart: unless-stopped
volumes:
ollama_data:
webui_data:
prometheus_data:
grafana_data:
networks:
ai_internal:
internal: true
ai_external:
secrets:
webui_secret:
file: ./secrets/webui_secret.txt
grafana_password:
file: ./secrets/grafana_password.txt
Key production decisions in this configuration:
ai_internalnetwork is markedinternal: true— containers on this network cannot reach the internet. Ollama, Prometheus, and Grafana are completely isolated.- Only Nginx exposes ports to the host. Open WebUI binds to 127.0.0.1 (localhost only). Ollama exposes nothing to the host.
- Health checks on Ollama and Open WebUI ensure Docker restarts unhealthy containers automatically.
- Memory limit of 48GB prevents Ollama from consuming all system RAM if a large context request arrives.
- Log rotation via Docker's json-file driver with 50MB max size prevents disk exhaustion.
For the foundational Docker setup, the Ollama + Open WebUI Docker guide covers installation and first run in detail.
Nginx Reverse Proxy with SSL {#nginx-ssl}
Certificate Setup with Certbot
# Install certbot
sudo apt install certbot -y
# Generate certificate (run before starting the Docker stack)
sudo certbot certonly --standalone -d ai.yourcompany.com
# Copy certs to the project directory
mkdir -p nginx/ssl
sudo cp /etc/letsencrypt/live/ai.yourcompany.com/fullchain.pem nginx/ssl/
sudo cp /etc/letsencrypt/live/ai.yourcompany.com/privkey.pem nginx/ssl/
sudo chmod 644 nginx/ssl/*.pem
Nginx Configuration
# nginx/conf.d/ollama-prod.conf
# Rate limiting zones
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=30r/m;
limit_req_zone $binary_remote_addr zone=ui_limit:10m rate=60r/m;
# Upstream definitions
upstream open_webui {
server open-webui:8080;
}
upstream ollama_api {
server ollama:11434;
}
# HTTPS server
server {
listen 443 ssl http2;
server_name ai.yourcompany.com;
ssl_certificate /etc/nginx/ssl/fullchain.pem;
ssl_certificate_key /etc/nginx/ssl/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384;
ssl_prefer_server_ciphers on;
ssl_session_cache shared:SSL:10m;
ssl_session_timeout 10m;
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
add_header X-Content-Type-Options nosniff always;
add_header X-Frame-Options DENY always;
# Web UI (Open WebUI)
location / {
limit_req zone=ui_limit burst=20 nodelay;
proxy_pass http://open_webui;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# WebSocket support for streaming
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 300s;
}
# Direct API access (requires API key)
location /api/ollama/ {
limit_req zone=api_limit burst=10 nodelay;
# API key validation
if ($http_authorization = "") {
return 401 '{"error": "API key required"}';
}
auth_request /auth_check;
# Strip the /api/ollama/ prefix and forward to Ollama
rewrite ^/api/ollama/(.*) /$1 break;
proxy_pass http://ollama_api;
proxy_set_header Host $host;
proxy_read_timeout 300s;
}
# API key validation endpoint
location = /auth_check {
internal;
proxy_pass_request_body off;
proxy_set_header Content-Length "";
# Check API key against htpasswd-style file
auth_basic "API Access";
auth_basic_user_file /etc/nginx/auth/api_keys;
proxy_set_header X-Original-URI $request_uri;
}
}
# HTTP redirect
server {
listen 80;
server_name ai.yourcompany.com;
return 301 https://$host$request_uri;
}
Generate API Keys
# Create API key file
mkdir -p nginx/auth
# Generate API keys for programmatic access
# Format: username:api_key (using htpasswd for basic auth)
sudo apt install apache2-utils -y
htpasswd -cb nginx/auth/api_keys service-account-1 "sk-$(openssl rand -hex 32)"
htpasswd -b nginx/auth/api_keys service-account-2 "sk-$(openssl rand -hex 32)"
Rate Limiting Per User
The Nginx config above implements two rate limit zones:
- UI users: 60 requests/minute with burst of 20 (generous for browser interaction)
- API clients: 30 requests/minute with burst of 10 (prevents automated abuse)
For a team of 20 users, these limits prevent any single user from monopolizing the GPU while still allowing normal usage patterns. Adjust based on your actual capacity.
Monitoring: Prometheus + Grafana {#monitoring}
Prometheus Configuration
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerts.yml"
scrape_configs:
- job_name: "nvidia-gpu"
static_configs:
- targets: ["nvidia-exporter:9835"]
- job_name: "ollama"
metrics_path: /metrics
static_configs:
- targets: ["ollama:11434"]
- job_name: "nginx"
static_configs:
- targets: ["nginx:9113"]
- job_name: "node"
static_configs:
- targets: ["host.docker.internal:9100"]
Alert Rules
# prometheus/alerts.yml
groups:
- name: ollama_production
rules:
- alert: GPUTemperatureHigh
expr: nvidia_gpu_temperature_celsius > 80
for: 5m
labels:
severity: warning
annotations:
summary: "GPU temperature above 80C for 5 minutes"
description: "GPU {{ $labels.gpu }} is at {{ $value }}C. Check cooling."
- alert: GPUTemperatureCritical
expr: nvidia_gpu_temperature_celsius > 88
for: 1m
labels:
severity: critical
annotations:
summary: "GPU temperature critical — thermal throttling imminent"
- alert: GPUMemoryExhausted
expr: nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes > 0.95
for: 2m
labels:
severity: critical
annotations:
summary: "GPU memory above 95% — OOM risk"
- alert: OllamaDown
expr: up{job="ollama"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Ollama is not responding"
- alert: HighRequestLatency
expr: histogram_quantile(0.95, rate(ollama_request_duration_seconds_bucket[5m])) > 60
for: 5m
labels:
severity: warning
annotations:
summary: "95th percentile response time exceeds 60 seconds"
Grafana Dashboard Panels
Create a dashboard at grafana/dashboards/ollama-production.json with these panels:
| Panel | Metric | Purpose |
|---|---|---|
| GPU Temperature | nvidia_gpu_temperature_celsius | Catch cooling failures before throttling |
| GPU Memory Used | nvidia_gpu_memory_used_bytes | Track memory pressure across sessions |
| GPU Utilization | nvidia_gpu_utilization_percent | Is the GPU actually working or idle? |
| Tokens/Second | ollama_tokens_generated_total rate | Performance trending over time |
| Active Requests | ollama_active_requests | Current load on the system |
| Request Queue | ollama_queue_depth | How many requests are waiting |
| P95 Latency | histogram_quantile(0.95, ollama_request_duration) | User experience metric |
| System RAM | node_memory_MemAvailable_bytes | Catch memory leaks early |
| Disk Usage | node_filesystem_avail_bytes | Model storage and logs |
The GPU temperature panel is the single most important metric. A failing fan or clogged heatsink will throttle your GPU from 35 tok/s to 8 tok/s before it shuts down entirely. Catch it at 80C, not at 90C.
Structured Logging {#logging}
Default Ollama logging is minimal. Augment it with structured JSON logs for production debugging:
Docker Log Aggregation
# View real-time logs across all services
docker compose logs -f --tail=100
# Filter to just Ollama
docker compose logs -f ollama
# Export logs for analysis
docker compose logs --no-color ollama > /var/log/ollama/ollama-$(date +%Y%m%d).log
Log Rotation
Docker's built-in log rotation (configured in the compose file above) handles container logs. For additional application logs:
# /etc/logrotate.d/ollama-prod
/var/log/ollama/*.log {
daily
rotate 30
compress
delaycompress
missingok
notifempty
create 0640 root adm
dateext
}
ELK Stack Integration (Optional)
For teams that need searchable, centralized logs:
# Add to docker-compose.yml
filebeat:
image: docker.elastic.co/beats/filebeat:8.12.0
container_name: filebeat
volumes:
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
networks:
- ai_internal
restart: unless-stopped
Backup Strategy {#backup-strategy}
What to Back Up
| Data | Location | Size | Frequency | Method |
|---|---|---|---|---|
| Docker Compose + configs | ./ | <1MB | Every change (Git) | Git repository |
| Open WebUI database | webui_data volume | 10-500MB | Daily | Volume snapshot |
| User uploads | webui_data volume | Variable | Daily | rsync to NAS |
| Model files | ollama_data volume | 40-200GB | Weekly | rsync to NAS |
| Prometheus metrics | prometheus_data volume | 1-10GB | Weekly | Volume snapshot |
| SSL certificates | ./nginx/ssl/ | <10KB | On renewal | Git (encrypted) |
Automated Backup Script
#!/bin/bash
# backup-ollama-prod.sh — runs nightly via cron
BACKUP_DIR="/backup/ollama-prod"
DATE=$(date +%Y%m%d)
NAS_PATH="/mnt/nas/backups/ollama"
mkdir -p "${BACKUP_DIR}/${DATE}"
# 1. Stop Open WebUI briefly for consistent DB backup
docker compose stop open-webui
docker run --rm -v ollama-prod_webui_data:/data -v "${BACKUP_DIR}/${DATE}:/backup" \
alpine tar czf /backup/webui-data.tar.gz /data
docker compose start open-webui
# 2. Sync model files (Ollama stays running — models are read-only during inference)
rsync -az --progress /var/lib/docker/volumes/ollama-prod_ollama_data/_data/ \
"${NAS_PATH}/models/"
# 3. Prometheus snapshot
curl -s -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot | \
jq -r '.data.name' | \
xargs -I {} cp -r /var/lib/docker/volumes/ollama-prod_prometheus_data/_data/snapshots/{} \
"${BACKUP_DIR}/${DATE}/"
# 4. Retention: keep 30 days of local backups
find "${BACKUP_DIR}" -maxdepth 1 -type d -mtime +30 -exec rm -rf {} \;
echo "[$(date)] Backup complete: ${BACKUP_DIR}/${DATE}"
Auto-Restart and Failure Recovery {#auto-restart}
Docker Restart Policies
The restart: unless-stopped policy in the compose file handles most failures. But Docker itself can fail.
Systemd Service for Docker Compose
# /etc/systemd/system/ollama-prod.service
[Unit]
Description=Ollama Production Stack
Requires=docker.service
After=docker.service
[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/opt/ollama-prod
ExecStart=/usr/bin/docker compose up -d
ExecStop=/usr/bin/docker compose down
ExecReload=/usr/bin/docker compose restart
TimeoutStartSec=300
[Install]
WantedBy=multi-user.target
# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable ollama-prod.service
sudo systemctl start ollama-prod.service
# The stack now survives reboots automatically
GPU Recovery
NVIDIA GPUs occasionally enter a bad state (driver crash, ECC error, stuck process). Monitor and recover:
#!/bin/bash
# gpu-watchdog.sh — runs every 5 minutes via cron
# Check if GPU is responsive
if ! nvidia-smi &>/dev/null; then
echo "[$(date)] GPU not responding. Restarting Ollama..." | tee -a /var/log/ollama/watchdog.log
docker compose -f /opt/ollama-prod/docker-compose.yml restart ollama
sleep 30
if ! nvidia-smi &>/dev/null; then
echo "[$(date)] GPU still unresponsive after restart. Alerting." | tee -a /var/log/ollama/watchdog.log
# Send alert (webhook, email, etc.)
curl -s -X POST "https://hooks.slack.com/your-webhook" \
-H "Content-Type: application/json" \
-d '{"text":"CRITICAL: Ollama production GPU unresponsive after restart. Manual intervention required."}'
fi
fi
# Check GPU temperature
TEMP=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits 2>/dev/null)
if [ -n "${TEMP}" ] && [ "${TEMP}" -gt 85 ]; then
echo "[$(date)] GPU temperature ${TEMP}C — throttling likely" | tee -a /var/log/ollama/watchdog.log
fi
# Add to crontab
echo "*/5 * * * * /opt/ollama-prod/scripts/gpu-watchdog.sh" | sudo tee -a /etc/cron.d/ollama-prod
Load Testing with k6 {#load-testing}
Before going live, verify your setup handles expected load. Ollama queues requests, so you need to know your throughput limits.
// load-test.js — k6 script for Ollama production
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
scenarios: {
steady_load: {
executor: 'constant-vus',
vus: 10,
duration: '5m',
},
spike_test: {
executor: 'ramping-vus',
startVUs: 0,
stages: [
{ duration: '1m', target: 5 },
{ duration: '2m', target: 20 },
{ duration: '1m', target: 5 },
{ duration: '1m', target: 0 },
],
startTime: '6m',
},
},
thresholds: {
http_req_duration: ['p(95)<120000'], // 95% under 2 minutes
http_req_failed: ['rate<0.05'], // Under 5% failure rate
},
};
export default function () {
const payload = JSON.stringify({
model: 'llama3.3:70b-instruct-q4_K_M',
prompt: 'Explain the concept of dependency injection in 3 sentences.',
stream: false,
});
const params = {
headers: {
'Content-Type': 'application/json',
'Authorization': 'Basic ' + __ENV.API_KEY,
},
timeout: '180s',
};
const res = http.post(
'https://ai.yourcompany.com/api/ollama/api/generate',
payload,
params
);
check(res, {
'status is 200': (r) => r.status === 200,
'response has content': (r) => r.json('response') !== '',
});
sleep(1);
}
# Run the load test
k6 run --env API_KEY=$(cat nginx/auth/api_keys | head -1 | cut -d: -f2) load-test.js
Expected Results (Single RTX 4090, 70B Q4)
| Concurrent Users | Avg Response Time | P95 Response Time | Throughput |
|---|---|---|---|
| 1 | 8s | 12s | 7.5 req/min |
| 5 | 35s | 52s | 8.5 req/min |
| 10 | 68s | 95s | 8.8 req/min |
| 20 | 140s | 195s | 8.6 req/min |
Throughput stays roughly constant because Ollama processes requests sequentially. More concurrent users means longer wait times, not more capacity. If P95 exceeds your SLA, you need more GPUs or a faster model.
The Production Checklist {#production-checklist}
Verify every item before exposing this to users. No exceptions.
Security
- Ollama port 11434 not accessible from outside Docker network
- TLS certificate valid and auto-renewing
- API keys generated and distributed securely
- Open WebUI signup disabled
- Default admin password changed
- Rate limiting configured and tested
- Firewall rules blocking unnecessary ports
Reliability
- Docker
restart: unless-stoppedon all services - Systemd service created for docker compose stack
- Health checks passing on Ollama and Open WebUI
- GPU watchdog cron installed
- Memory limits set on all containers
- Log rotation configured (Docker and application level)
Monitoring
- Prometheus scraping all targets
- Grafana dashboard showing GPU temp, memory, throughput
- Alert rules configured for GPU temp, OOM, service down
- Alert notifications routing to on-call (Slack, PagerDuty, email)
Data
- Automated backup script installed and tested
- Backup restoration tested (actually restore, do not just assume)
- Model files present and loaded (verify with
curl http://localhost:11434/api/tags) - SSL certificates backed up
Documentation
- Runbook: how to restart the stack
- Runbook: how to add/remove users
- Runbook: how to update models
- Runbook: how to restore from backup
- On-call contact list for escalation
For the underlying hardware setup, the homelab AI server build guide covers physical server configuration, and the Ubuntu AI workstation guide handles OS-level optimization.
Conclusion
The gap between "Ollama works on my laptop" and "Ollama runs reliably for my team" is entirely about operations. The inference engine is solid. What you build around it — authentication, encryption, monitoring, backup, restart policies — determines whether your team trusts the system or routes around it.
Every shortcut in this stack has a consequence. Skip TLS and someone sniffs prompts on the network. Skip monitoring and a thermal event kills the GPU at 2 AM with no alert. Skip backups and a disk failure means re-creating every user's conversation history from nothing. Skip rate limiting and one automated script saturates the GPU while everyone else waits.
Do it right once. The stack in this guide takes about 3 hours to deploy from scratch. After that, it runs itself.
Starting from zero? Follow the Ollama + Open WebUI Docker setup first, then layer on the production hardening from this guide.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, MLOps — chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!