Free course — 2 free chapters of every course. No credit card.Start learning free
Production Deployment

Monitor Ollama with Prometheus & Grafana: Production Dashboards

April 23, 2026
21 min read
LocalAimaster Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

Monitor Ollama with Prometheus & Grafana: Production Dashboards

Published on April 23, 2026 -- 21 min read

The first time an Ollama server silently throttles to a quarter of its normal speed, you discover that "it works" is not a state you can verify by SSHing in once a day. The model still loads. ollama run still answers. But every user is waiting 40 seconds for a response that used to take 6, and you find out from a Slack message instead of a pager. Monitoring fixes that.

This guide builds a production-grade observability stack for Ollama using only open-source tools — Prometheus, the NVIDIA DCGM exporter, a small log-tail collector, and Grafana. By the end you will have dashboards that show tokens per second, queue depth, GPU temperature, and request error rate, plus alert rules that fire before users notice problems.

Quick Start: docker compose up -d on the manifest in this guide will give you Prometheus on :9090, Grafana on :3000 (admin / admin), the DCGM exporter on :9400, and an Ollama metrics sidecar on :9778. Import dashboard JSON 19628 into Grafana for an out-of-the-box GPU dashboard, then layer the custom Ollama panels on top.


Table of Contents

  1. Why Monitor Ollama at All
  2. The Four Metrics That Matter
  3. Architecture Overview
  4. Step-by-Step Setup
  5. Building Grafana Dashboards
  6. Alert Rules That Catch Real Problems
  7. Comparing Stacks: Prometheus vs Alternatives
  8. Pitfalls and Anti-Patterns
  9. Frequently Asked Questions

Why Monitor Ollama at All {#why-monitor}

Three failure modes kill local AI deployments and none of them are visible from ollama run:

Silent thermal throttling. A consumer 4090 throttles at 83 C and an A6000 at 88 C. Once throttled, throughput drops 40-60 percent and stays there until the card cools. Without GPU temperature alerting, you only notice when a user complains. By then you have lost an hour of capacity.

Memory pressure on long context. A 70B model that fits comfortably at 4K context starts swapping into CPU RAM at 16K, and free -h shows nothing wrong because Linux file cache eats whatever is left. Monitoring KV cache size and host swap is the only way to catch it before latency doubles.

Queue backups during burst traffic. Ollama serves requests sequentially per model by default. A burst of 10 requests means user 10 waits 10 inference cycles. Without queue-depth metrics on the reverse proxy, you cannot tell if you need a second instance or just need to set OLLAMA_NUM_PARALLEL.

The full Ollama production deployment guide covers the rest of the production stack — this article is the monitoring layer that turns it from "running" into "running observably."


The Four Metrics That Matter {#four-metrics}

Hundreds of metrics are available. Four drive 90 percent of operational decisions.

Metric 1: tokens_per_second (custom)

The single most important indicator of model health. Drop of 30 percent or more from baseline means thermal throttling, model swap, or context size change.

# HELP ollama_tokens_per_second Inference throughput per request
# TYPE ollama_tokens_per_second histogram
ollama_tokens_per_second_bucket{model="llama3.3:70b",le="10"} 12
ollama_tokens_per_second_bucket{model="llama3.3:70b",le="20"} 487
ollama_tokens_per_second_bucket{model="llama3.3:70b",le="40"} 519

Metric 2: DCGM_FI_DEV_GPU_TEMP (DCGM)

GPU die temperature in Celsius. Anything sustained above 83 C on consumer cards or 88 C on workstation cards is a problem.

Metric 3: DCGM_FI_DEV_FB_USED (DCGM)

Frame buffer (VRAM) used in MiB. Compare against DCGM_FI_DEV_FB_TOTAL for percent utilisation. Crossing 95 percent means OOM is imminent the next time someone sends a slightly longer prompt.

Metric 4: nginx_http_requests_total and nginx_upstream_response_time_seconds (nginx-exporter)

Request rate, error rate, and response time at the proxy layer. Queue depth is implicit in response time — when median response time triples without inference becoming slower, you have a queue.

Everything else (CPU, RAM, disk I/O, network) is supporting context that only matters when one of those four anomalies fires.


Architecture Overview {#architecture}

                +-------------------+
                |     Grafana       |
                |   (port 3000)     |
                +--------+----------+
                         |
                         v PromQL
                +--------+----------+
                |    Prometheus     |
                |   (port 9090)     |
                +--------+----------+
                         |
            +------------+-----------+----------------+----------------+
            |            |           |                |                |
            v            v           v                v                v
   +--------+---+  +-----+------+  +-+---------+  +---+----------+ +--+--------+
   | DCGM       |  | node       |  | nginx     |  | Ollama       | | log-tail  |
   | Exporter   |  | exporter   |  | exporter  |  | sidecar      | | collector |
   | :9400      |  | :9100      |  | :9113     |  | :9778        | | :9779     |
   +------------+  +------------+  +-----------+  +--------------+ +-----------+

Five exporters cover the full stack. The DCGM exporter handles GPU telemetry. node_exporter handles host CPU, RAM, disk. nginx-exporter handles request rate and response time at the proxy. The Ollama sidecar polls the local Ollama API for loaded models and queue state. The log-tail collector parses /var/log/ollama.log for per-request token rates that the API does not expose.


Step-by-Step Setup {#setup}

Step 1: Project Layout

mkdir -p ollama-monitoring/{prometheus,grafana,grafana/provisioning/dashboards,grafana/provisioning/datasources}
cd ollama-monitoring

Step 2: docker-compose.yml

version: '3.9'

services:
  prometheus:
    image: prom/prometheus:v2.51.2
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/alerts.yml:/etc/prometheus/alerts.yml:ro
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.4.2
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    restart: unless-stopped

  dcgm-exporter:
    image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
    container_name: dcgm-exporter
    ports:
      - "9400:9400"
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    cap_add:
      - SYS_ADMIN
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
    restart: unless-stopped

  ollama-sidecar:
    image: ghcr.io/localaimaster/ollama-prom-sidecar:0.4
    container_name: ollama-sidecar
    ports:
      - "9778:9778"
    environment:
      - OLLAMA_URL=http://host.docker.internal:11434
      - SCRAPE_INTERVAL=10
    extra_hosts:
      - "host.docker.internal:host-gateway"
    restart: unless-stopped

volumes:
  prometheus-data:
  grafana-data:

Step 3: Prometheus Scrape Config

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: ollama-prod

rule_files:
  - /etc/prometheus/alerts.yml

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']

  - job_name: dcgm
    static_configs:
      - targets: ['dcgm-exporter:9400']

  - job_name: node
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: ollama
    static_configs:
      - targets: ['ollama-sidecar:9778']

  - job_name: nginx
    static_configs:
      - targets: ['host.docker.internal:9113']

Step 4: Custom Token Throughput Collector

The Ollama sidecar above scrapes basic state. To capture per-request tokens-per-second from streaming responses, run a small log-tail collector:

# /usr/local/bin/ollama-token-collector.sh
#!/usr/bin/env bash
set -euo pipefail

LOG=/var/log/ollama.log
PORT=9779

# Tiny Prometheus textfile collector that parses Ollama "eval rate" lines
exec python3 - <<'PY'
import http.server, re, threading, time
from collections import defaultdict, deque

EVAL_RE = re.compile(r'eval rate:\s+([0-9.]+)\s+tokens/s.*model=(\S+)')
samples = defaultdict(lambda: deque(maxlen=512))

def tail():
    with open('/var/log/ollama.log') as f:
        f.seek(0, 2)
        while True:
            line = f.readline()
            if not line:
                time.sleep(0.2); continue
            m = EVAL_RE.search(line)
            if m:
                samples[m.group(2)].append(float(m.group(1)))

class Handler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header('Content-Type', 'text/plain'); self.end_headers()
        out = ['# TYPE ollama_tokens_per_second gauge']
        for model, vals in samples.items():
            if vals:
                avg = sum(vals)/len(vals)
                out.append(f'ollama_tokens_per_second{{model="{model}"}} {avg}')
        self.wfile.write(('\n'.join(out) + '\n').encode())

threading.Thread(target=tail, daemon=True).start()
http.server.HTTPServer(('0.0.0.0', 9779), Handler).serve_forever()
PY

Run it as a systemd unit and add - targets: ['host.docker.internal:9779'] to the Prometheus config.

Step 5: Boot the Stack

docker compose up -d
docker compose ps

Open http://localhost:9090/targets in a browser. Every target should be UP within 30 seconds. If dcgm-exporter is missing, confirm that docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi works — if it does not, the NVIDIA Container Toolkit is misconfigured.

The official Prometheus docs at prometheus.io configuration are the canonical reference for everything else you will want to tune in the scrape config.


Building Grafana Dashboards {#dashboards}

Two dashboards cover the typical operator workflow: a one-screen overview and a deep-dive GPU dashboard.

Dashboard 1: Ollama Overview

Add a Prometheus data source pointed at http://prometheus:9090 then create panels with the following queries.

Panel: Tokens/sec by model (timeseries)

avg by (model) (ollama_tokens_per_second)

Panel: GPU memory used percent (gauge)

(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) * 100

Panel: GPU temperature (timeseries with thresholds)

DCGM_FI_DEV_GPU_TEMP

Set the panel thresholds at 75 (yellow) and 83 (red) for consumer cards.

Panel: Request rate (timeseries)

sum(rate(nginx_http_requests_total[1m])) by (status)

Panel: P95 request latency (timeseries)

histogram_quantile(0.95, sum by (le) (rate(nginx_http_request_duration_seconds_bucket[5m])))

Dashboard 2: GPU Deep Dive

The community DCGM dashboard at Grafana ID 19628 ships with 24 GPU panels (utilisation, memory, temperature, power, PCIe, NVLink). Import it directly: Grafana > Dashboards > Import > 19628.

For the comparison context that frames why these specific GPU metrics matter, the best GPUs for AI breakdown is the right reference.


Alert Rules That Catch Real Problems {#alerts}

Save the following as prometheus/alerts.yml:

groups:
  - name: ollama
    interval: 30s
    rules:
      - alert: OllamaGpuTemperatureHigh
        expr: DCGM_FI_DEV_GPU_TEMP > 83
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: GPU {{ $labels.gpu }} hot
          description: "{{ $labels.gpu }} at {{ $value }}C for 5+ minutes — throttling likely."

      - alert: OllamaGpuMemoryNearFull
        expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) > 0.95
        for: 2m
        labels: { severity: critical }
        annotations:
          summary: GPU {{ $labels.gpu }} VRAM > 95%
          description: "OOM imminent on the next long prompt."

      - alert: OllamaThroughputDegraded
        expr: |
          (
            avg_over_time(ollama_tokens_per_second[10m])
            /
            avg_over_time(ollama_tokens_per_second[24h] offset 1h)
          ) < 0.5
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: Ollama throughput halved vs 24h baseline
          description: "Likely thermal throttle, contention, or quantization mismatch."

      - alert: OllamaErrorRateHigh
        expr: |
          sum(rate(nginx_http_requests_total{status=~"5.."}[5m]))
          / sum(rate(nginx_http_requests_total[5m]))
          > 0.01
        for: 10m
        labels: { severity: critical }
        annotations:
          summary: Ollama 5xx rate > 1%
          description: "Reverse proxy is returning errors — check upstream Ollama."

      - alert: OllamaQueueBacklog
        expr: avg_over_time(nginx_upstream_active_connections[5m]) > 20
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: Ollama queue > 20 concurrent
          description: "Consider raising OLLAMA_NUM_PARALLEL or scaling out."

Reload Prometheus without restarting:

curl -X POST http://localhost:9090/-/reload

Wire alerts to email, Slack, or PagerDuty through Alertmanager. The official Alertmanager documentation walks through routing, grouping, and inhibition rules in detail.


Comparing Stacks: Prometheus vs Alternatives {#comparison}

StackSetup timeStorage costGPU coverageBest for
Prometheus + Grafana + DCGM30-60 min$0 self-hostedExcellentSelf-hosted, single team
Datadog10 min$$$$ usageGoodSaaS, large teams, no SRE
New Relic15 min$$$ usageDecentMulti-app, mixed stack
OpenTelemetry + Tempo + Loki + Mimir2-4 hours$0 self-hostedGoodExisting OTel pipeline
Lightweight (Netdata)5 min$0BasicSolo dev, single host

Prometheus + Grafana wins for any team running their own Ollama. It is free, has the best GPU integration via DCGM, and the dashboards portfolio (especially Grafana ID 19628) is excellent.


Pitfalls and Anti-Patterns {#pitfalls}

Pitfall 1: Scraping Too Often

Default scrape_interval of 15s is fine. Setting it to 1s for "real-time" dashboards generates 15x the data, hammers DCGM, and rarely surfaces anything you would have missed. Anything below 5s is wasted work.

Pitfall 2: Alerting on Single Samples

expr: DCGM_FI_DEV_GPU_TEMP > 90 without a for: 5m clause will page you for a transient 91 C spike during model load. Always pair temperature and memory alerts with a multi-minute for clause to filter noise.

Pitfall 3: Tokens/sec Computed from /api/tags

/api/tags lists installed models — it does not expose throughput. Computing tok/s from API polling is impossible because Ollama only emits eval timing inside the response payload. Use the log-tail collector (above) or instrument your client.

Pitfall 4: Storing Metrics on the Same NVMe as Models

Prometheus writes ~10 MB/min to its TSDB. Model loading reads tens of GB at a stretch. Sharing the same NVMe causes loading stalls during prometheus compaction. Put Prometheus storage on a separate disk or a different host entirely.

Pitfall 5: Forgetting the Grafana Backup

Dashboards live in Grafana's database. Without provisioning, they vanish if the volume is reset. Put dashboard JSON under grafana/provisioning/dashboards so the stack is reproducible from source control.

For the broader pattern of locking down a production Ollama deployment after monitoring is in place, the securing Ollama guide walks through TLS, API keys, and network isolation.


Final Notes

Monitoring Ollama is the single highest-leverage thing you can do once a deployment is past the prototype stage. The first time the throughput-degraded alert fires at 11pm because a CPU fan died and the GPU is heat-soaking, you will be glad the stack was already in place. The first time you walk into a sprint review with "we doubled concurrent users without a latency regression" backed by a Grafana screenshot, you will be glad you instrumented from day one.

Spin up the docker-compose stack, point Prometheus at your existing Ollama host, import the GPU dashboard, and add the five alert rules above. That is roughly ninety minutes of work and it covers every operational failure mode we have seen in real Ollama deployments. The only thing left is to forget it exists until something actually breaks — at which point the dashboard will already know.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: April 23, 2026🔄 Last Updated: April 23, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

Was this helpful?

SRE-Grade Local AI, Weekly

Get one practical Ollama operations guide per week — monitoring, scaling, security, cost control. No fluff.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Related Guides

Continue your local AI journey with these comprehensive guides

Continue Learning

📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators