Monitor Ollama with Prometheus & Grafana: Production Dashboards
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Monitor Ollama with Prometheus & Grafana: Production Dashboards
Published on April 23, 2026 -- 21 min read
The first time an Ollama server silently throttles to a quarter of its normal speed, you discover that "it works" is not a state you can verify by SSHing in once a day. The model still loads. ollama run still answers. But every user is waiting 40 seconds for a response that used to take 6, and you find out from a Slack message instead of a pager. Monitoring fixes that.
This guide builds a production-grade observability stack for Ollama using only open-source tools — Prometheus, the NVIDIA DCGM exporter, a small log-tail collector, and Grafana. By the end you will have dashboards that show tokens per second, queue depth, GPU temperature, and request error rate, plus alert rules that fire before users notice problems.
Quick Start:
docker compose up -don the manifest in this guide will give you Prometheus on :9090, Grafana on :3000 (admin / admin), the DCGM exporter on :9400, and an Ollama metrics sidecar on :9778. Import dashboard JSON 19628 into Grafana for an out-of-the-box GPU dashboard, then layer the custom Ollama panels on top.
Table of Contents
- Why Monitor Ollama at All
- The Four Metrics That Matter
- Architecture Overview
- Step-by-Step Setup
- Building Grafana Dashboards
- Alert Rules That Catch Real Problems
- Comparing Stacks: Prometheus vs Alternatives
- Pitfalls and Anti-Patterns
- Frequently Asked Questions
Why Monitor Ollama at All {#why-monitor}
Three failure modes kill local AI deployments and none of them are visible from ollama run:
Silent thermal throttling. A consumer 4090 throttles at 83 C and an A6000 at 88 C. Once throttled, throughput drops 40-60 percent and stays there until the card cools. Without GPU temperature alerting, you only notice when a user complains. By then you have lost an hour of capacity.
Memory pressure on long context. A 70B model that fits comfortably at 4K context starts swapping into CPU RAM at 16K, and free -h shows nothing wrong because Linux file cache eats whatever is left. Monitoring KV cache size and host swap is the only way to catch it before latency doubles.
Queue backups during burst traffic. Ollama serves requests sequentially per model by default. A burst of 10 requests means user 10 waits 10 inference cycles. Without queue-depth metrics on the reverse proxy, you cannot tell if you need a second instance or just need to set OLLAMA_NUM_PARALLEL.
The full Ollama production deployment guide covers the rest of the production stack — this article is the monitoring layer that turns it from "running" into "running observably."
The Four Metrics That Matter {#four-metrics}
Hundreds of metrics are available. Four drive 90 percent of operational decisions.
Metric 1: tokens_per_second (custom)
The single most important indicator of model health. Drop of 30 percent or more from baseline means thermal throttling, model swap, or context size change.
# HELP ollama_tokens_per_second Inference throughput per request
# TYPE ollama_tokens_per_second histogram
ollama_tokens_per_second_bucket{model="llama3.3:70b",le="10"} 12
ollama_tokens_per_second_bucket{model="llama3.3:70b",le="20"} 487
ollama_tokens_per_second_bucket{model="llama3.3:70b",le="40"} 519
Metric 2: DCGM_FI_DEV_GPU_TEMP (DCGM)
GPU die temperature in Celsius. Anything sustained above 83 C on consumer cards or 88 C on workstation cards is a problem.
Metric 3: DCGM_FI_DEV_FB_USED (DCGM)
Frame buffer (VRAM) used in MiB. Compare against DCGM_FI_DEV_FB_TOTAL for percent utilisation. Crossing 95 percent means OOM is imminent the next time someone sends a slightly longer prompt.
Metric 4: nginx_http_requests_total and nginx_upstream_response_time_seconds (nginx-exporter)
Request rate, error rate, and response time at the proxy layer. Queue depth is implicit in response time — when median response time triples without inference becoming slower, you have a queue.
Everything else (CPU, RAM, disk I/O, network) is supporting context that only matters when one of those four anomalies fires.
Architecture Overview {#architecture}
+-------------------+
| Grafana |
| (port 3000) |
+--------+----------+
|
v PromQL
+--------+----------+
| Prometheus |
| (port 9090) |
+--------+----------+
|
+------------+-----------+----------------+----------------+
| | | | |
v v v v v
+--------+---+ +-----+------+ +-+---------+ +---+----------+ +--+--------+
| DCGM | | node | | nginx | | Ollama | | log-tail |
| Exporter | | exporter | | exporter | | sidecar | | collector |
| :9400 | | :9100 | | :9113 | | :9778 | | :9779 |
+------------+ +------------+ +-----------+ +--------------+ +-----------+
Five exporters cover the full stack. The DCGM exporter handles GPU telemetry. node_exporter handles host CPU, RAM, disk. nginx-exporter handles request rate and response time at the proxy. The Ollama sidecar polls the local Ollama API for loaded models and queue state. The log-tail collector parses /var/log/ollama.log for per-request token rates that the API does not expose.
Step-by-Step Setup {#setup}
Step 1: Project Layout
mkdir -p ollama-monitoring/{prometheus,grafana,grafana/provisioning/dashboards,grafana/provisioning/datasources}
cd ollama-monitoring
Step 2: docker-compose.yml
version: '3.9'
services:
prometheus:
image: prom/prometheus:v2.51.2
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/alerts.yml:/etc/prometheus/alerts.yml:ro
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
restart: unless-stopped
grafana:
image: grafana/grafana:10.4.2
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
restart: unless-stopped
dcgm-exporter:
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
container_name: dcgm-exporter
ports:
- "9400:9400"
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
cap_add:
- SYS_ADMIN
restart: unless-stopped
node-exporter:
image: prom/node-exporter:v1.7.0
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
restart: unless-stopped
ollama-sidecar:
image: ghcr.io/localaimaster/ollama-prom-sidecar:0.4
container_name: ollama-sidecar
ports:
- "9778:9778"
environment:
- OLLAMA_URL=http://host.docker.internal:11434
- SCRAPE_INTERVAL=10
extra_hosts:
- "host.docker.internal:host-gateway"
restart: unless-stopped
volumes:
prometheus-data:
grafana-data:
Step 3: Prometheus Scrape Config
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: ollama-prod
rule_files:
- /etc/prometheus/alerts.yml
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
- job_name: dcgm
static_configs:
- targets: ['dcgm-exporter:9400']
- job_name: node
static_configs:
- targets: ['node-exporter:9100']
- job_name: ollama
static_configs:
- targets: ['ollama-sidecar:9778']
- job_name: nginx
static_configs:
- targets: ['host.docker.internal:9113']
Step 4: Custom Token Throughput Collector
The Ollama sidecar above scrapes basic state. To capture per-request tokens-per-second from streaming responses, run a small log-tail collector:
# /usr/local/bin/ollama-token-collector.sh
#!/usr/bin/env bash
set -euo pipefail
LOG=/var/log/ollama.log
PORT=9779
# Tiny Prometheus textfile collector that parses Ollama "eval rate" lines
exec python3 - <<'PY'
import http.server, re, threading, time
from collections import defaultdict, deque
EVAL_RE = re.compile(r'eval rate:\s+([0-9.]+)\s+tokens/s.*model=(\S+)')
samples = defaultdict(lambda: deque(maxlen=512))
def tail():
with open('/var/log/ollama.log') as f:
f.seek(0, 2)
while True:
line = f.readline()
if not line:
time.sleep(0.2); continue
m = EVAL_RE.search(line)
if m:
samples[m.group(2)].append(float(m.group(1)))
class Handler(http.server.BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_header('Content-Type', 'text/plain'); self.end_headers()
out = ['# TYPE ollama_tokens_per_second gauge']
for model, vals in samples.items():
if vals:
avg = sum(vals)/len(vals)
out.append(f'ollama_tokens_per_second{{model="{model}"}} {avg}')
self.wfile.write(('\n'.join(out) + '\n').encode())
threading.Thread(target=tail, daemon=True).start()
http.server.HTTPServer(('0.0.0.0', 9779), Handler).serve_forever()
PY
Run it as a systemd unit and add - targets: ['host.docker.internal:9779'] to the Prometheus config.
Step 5: Boot the Stack
docker compose up -d
docker compose ps
Open http://localhost:9090/targets in a browser. Every target should be UP within 30 seconds. If dcgm-exporter is missing, confirm that docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi works — if it does not, the NVIDIA Container Toolkit is misconfigured.
The official Prometheus docs at prometheus.io configuration are the canonical reference for everything else you will want to tune in the scrape config.
Building Grafana Dashboards {#dashboards}
Two dashboards cover the typical operator workflow: a one-screen overview and a deep-dive GPU dashboard.
Dashboard 1: Ollama Overview
Add a Prometheus data source pointed at http://prometheus:9090 then create panels with the following queries.
Panel: Tokens/sec by model (timeseries)
avg by (model) (ollama_tokens_per_second)
Panel: GPU memory used percent (gauge)
(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) * 100
Panel: GPU temperature (timeseries with thresholds)
DCGM_FI_DEV_GPU_TEMP
Set the panel thresholds at 75 (yellow) and 83 (red) for consumer cards.
Panel: Request rate (timeseries)
sum(rate(nginx_http_requests_total[1m])) by (status)
Panel: P95 request latency (timeseries)
histogram_quantile(0.95, sum by (le) (rate(nginx_http_request_duration_seconds_bucket[5m])))
Dashboard 2: GPU Deep Dive
The community DCGM dashboard at Grafana ID 19628 ships with 24 GPU panels (utilisation, memory, temperature, power, PCIe, NVLink). Import it directly: Grafana > Dashboards > Import > 19628.
For the comparison context that frames why these specific GPU metrics matter, the best GPUs for AI breakdown is the right reference.
Alert Rules That Catch Real Problems {#alerts}
Save the following as prometheus/alerts.yml:
groups:
- name: ollama
interval: 30s
rules:
- alert: OllamaGpuTemperatureHigh
expr: DCGM_FI_DEV_GPU_TEMP > 83
for: 5m
labels: { severity: warning }
annotations:
summary: GPU {{ $labels.gpu }} hot
description: "{{ $labels.gpu }} at {{ $value }}C for 5+ minutes — throttling likely."
- alert: OllamaGpuMemoryNearFull
expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) > 0.95
for: 2m
labels: { severity: critical }
annotations:
summary: GPU {{ $labels.gpu }} VRAM > 95%
description: "OOM imminent on the next long prompt."
- alert: OllamaThroughputDegraded
expr: |
(
avg_over_time(ollama_tokens_per_second[10m])
/
avg_over_time(ollama_tokens_per_second[24h] offset 1h)
) < 0.5
for: 10m
labels: { severity: warning }
annotations:
summary: Ollama throughput halved vs 24h baseline
description: "Likely thermal throttle, contention, or quantization mismatch."
- alert: OllamaErrorRateHigh
expr: |
sum(rate(nginx_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(nginx_http_requests_total[5m]))
> 0.01
for: 10m
labels: { severity: critical }
annotations:
summary: Ollama 5xx rate > 1%
description: "Reverse proxy is returning errors — check upstream Ollama."
- alert: OllamaQueueBacklog
expr: avg_over_time(nginx_upstream_active_connections[5m]) > 20
for: 5m
labels: { severity: warning }
annotations:
summary: Ollama queue > 20 concurrent
description: "Consider raising OLLAMA_NUM_PARALLEL or scaling out."
Reload Prometheus without restarting:
curl -X POST http://localhost:9090/-/reload
Wire alerts to email, Slack, or PagerDuty through Alertmanager. The official Alertmanager documentation walks through routing, grouping, and inhibition rules in detail.
Comparing Stacks: Prometheus vs Alternatives {#comparison}
| Stack | Setup time | Storage cost | GPU coverage | Best for |
|---|---|---|---|---|
| Prometheus + Grafana + DCGM | 30-60 min | $0 self-hosted | Excellent | Self-hosted, single team |
| Datadog | 10 min | $$$$ usage | Good | SaaS, large teams, no SRE |
| New Relic | 15 min | $$$ usage | Decent | Multi-app, mixed stack |
| OpenTelemetry + Tempo + Loki + Mimir | 2-4 hours | $0 self-hosted | Good | Existing OTel pipeline |
| Lightweight (Netdata) | 5 min | $0 | Basic | Solo dev, single host |
Prometheus + Grafana wins for any team running their own Ollama. It is free, has the best GPU integration via DCGM, and the dashboards portfolio (especially Grafana ID 19628) is excellent.
Pitfalls and Anti-Patterns {#pitfalls}
Pitfall 1: Scraping Too Often
Default scrape_interval of 15s is fine. Setting it to 1s for "real-time" dashboards generates 15x the data, hammers DCGM, and rarely surfaces anything you would have missed. Anything below 5s is wasted work.
Pitfall 2: Alerting on Single Samples
expr: DCGM_FI_DEV_GPU_TEMP > 90 without a for: 5m clause will page you for a transient 91 C spike during model load. Always pair temperature and memory alerts with a multi-minute for clause to filter noise.
Pitfall 3: Tokens/sec Computed from /api/tags
/api/tags lists installed models — it does not expose throughput. Computing tok/s from API polling is impossible because Ollama only emits eval timing inside the response payload. Use the log-tail collector (above) or instrument your client.
Pitfall 4: Storing Metrics on the Same NVMe as Models
Prometheus writes ~10 MB/min to its TSDB. Model loading reads tens of GB at a stretch. Sharing the same NVMe causes loading stalls during prometheus compaction. Put Prometheus storage on a separate disk or a different host entirely.
Pitfall 5: Forgetting the Grafana Backup
Dashboards live in Grafana's database. Without provisioning, they vanish if the volume is reset. Put dashboard JSON under grafana/provisioning/dashboards so the stack is reproducible from source control.
For the broader pattern of locking down a production Ollama deployment after monitoring is in place, the securing Ollama guide walks through TLS, API keys, and network isolation.
Final Notes
Monitoring Ollama is the single highest-leverage thing you can do once a deployment is past the prototype stage. The first time the throughput-degraded alert fires at 11pm because a CPU fan died and the GPU is heat-soaking, you will be glad the stack was already in place. The first time you walk into a sprint review with "we doubled concurrent users without a latency regression" backed by a Grafana screenshot, you will be glad you instrumented from day one.
Spin up the docker-compose stack, point Prometheus at your existing Ollama host, import the GPU dashboard, and add the five alert rules above. That is roughly ninety minutes of work and it covers every operational failure mode we have seen in real Ollama deployments. The only thing left is to forget it exists until something actually breaks — at which point the dashboard will already know.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!