Does Ollama expose Prometheus metrics natively?

Not yet. As of Ollama 0.1.45, there is no built-in /metrics endpoint. You collect metrics in two ways: a thin sidecar that polls /api/ps and /api/tags then exposes them in Prometheus format, plus the NVIDIA DCGM exporter for GPU-side telemetry. The combination gives you complete coverage of inference, queue, and hardware behaviour. The Ollama team has indicated native metrics are on the roadmap, but the sidecar pattern works today and is what most production deployments run.

What metrics matter most for an Ollama production system?

Four metrics carry 90 percent of operational signal. (1) GPU memory used vs total — if it crosses 95 percent, requests start failing with OOM. (2) GPU temperature — sustained 83 C+ on consumer cards triggers throttling and silently halves throughput. (3) Request queue depth from your reverse proxy or app — a growing queue means inference cannot keep up with arrivals. (4) Tokens per second — a sudden drop usually means thermal throttling or a model swap. Everything else is supporting context.

NVIDIA DCGM exporter or nvidia_gpu_exporter — which one?

DCGM exporter for serious deployments. It is the official NVIDIA project, gets driver-level fields the userland exporter cannot reach (PCIe replay counts, ECC errors, NVLink throughput), and integrates cleanly with Kubernetes. nvidia_gpu_exporter is fine for a single-host laptop or test rig, but it just shells out to nvidia-smi every few seconds and misses the deep telemetry. If you plan to alert on GPU health, use DCGM.

How do I measure tokens per second from Ollama?

Two paths. Easiest: scrape /api/ps and instrument your client to record eval_count and eval_duration from each non-streaming response, then export rate(tokens[5m]). More accurate: tail the Ollama log file with promtail or a small Go program that parses "eval rate" lines and emits a histogram. The log-tail approach captures every request including streaming ones, which the API approach misses. We use the log-tail collector in production.

Should I store Prometheus data on the same host as Ollama?

For development, yes — it is one less moving part. For production, no. Prometheus disk writes compete with model loading I/O on the same NVMe, and a host failure takes both your inference and your monitoring offline simultaneously. Run Prometheus on a small dedicated VM or container host with at least 30 days of retention, and use remote_write to ship long-term metrics to an external Mimir, Thanos, or VictoriaMetrics cluster.

What alert rules should I start with?

Five rules cover the common failures. (1) GPU temperature above 85 C for 5 minutes — throttling is happening. (2) GPU memory usage above 95 percent for 2 minutes — OOM imminent. (3) HTTP 5xx error rate above 1 percent over 10 minutes from the reverse proxy. (4) Tokens per second 50 percent below the 24h baseline for 10 minutes. (5) Queue depth above 20 requests for 5 minutes. Tune the thresholds to your traffic, but those five catch nearly everything that wakes you at 3am.

Can I monitor Ollama running in Docker with the same stack?

Yes, and the setup is actually cleaner. Run Prometheus, Grafana, the DCGM exporter, and the Ollama metrics sidecar as services in the same docker-compose.yml. The DCGM exporter container needs --gpus all and the NVIDIA Container Toolkit, but everything else is plain HTTP scraping over the docker network. Most production users prefer this layout because rebuilding the stack is a single command.

How long should I retain inference metrics?

Hot retention of 30 days at 15-second resolution covers operational debugging and capacity planning. For trend analysis (model upgrades, traffic seasonality), downsample to 5-minute resolution and keep 18 months. Storage is cheap — a single Ollama host generates roughly 12 GB of raw Prometheus data per month, dropping to under 1 GB after downsampling.

Monitor Ollama with Prometheus & Grafana: Production Dashboards

Published on April 23, 2026 -- 21 min read

The first time an Ollama server silently throttles to a quarter of its normal speed, you discover that "it works" is not a state you can verify by SSHing in once a day. The model still loads. ollama run still answers. But every user is waiting 40 seconds for a response that used to take 6, and you find out from a Slack message instead of a pager. Monitoring fixes that.

This guide builds a production-grade observability stack for Ollama using only open-source tools — Prometheus, the NVIDIA DCGM exporter, a small log-tail collector, and Grafana. By the end you will have dashboards that show tokens per second, queue depth, GPU temperature, and request error rate, plus alert rules that fire before users notice problems.

Quick Start: docker compose up -d on the manifest in this guide will give you Prometheus on :9090, Grafana on :3000 (admin / admin), the DCGM exporter on :9400, and an Ollama metrics sidecar on :9778. Import dashboard JSON 19628 into Grafana for an out-of-the-box GPU dashboard, then layer the custom Ollama panels on top.

Why Monitor Ollama at All
The Four Metrics That Matter
Architecture Overview
Step-by-Step Setup
Building Grafana Dashboards
Alert Rules That Catch Real Problems
Comparing Stacks: Prometheus vs Alternatives
Pitfalls and Anti-Patterns
Frequently Asked Questions

Why Monitor Ollama at All {#why-monitor}

Three failure modes kill local AI deployments and none of them are visible from ollama run:

Silent thermal throttling. A consumer 4090 throttles at 83 C and an A6000 at 88 C. Once throttled, throughput drops 40-60 percent and stays there until the card cools. Without GPU temperature alerting, you only notice when a user complains. By then you have lost an hour of capacity.

Memory pressure on long context. A 70B model that fits comfortably at 4K context starts swapping into CPU RAM at 16K, and free -h shows nothing wrong because Linux file cache eats whatever is left. Monitoring KV cache size and host swap is the only way to catch it before latency doubles.

Queue backups during burst traffic. Ollama serves requests sequentially per model by default. A burst of 10 requests means user 10 waits 10 inference cycles. Without queue-depth metrics on the reverse proxy, you cannot tell if you need a second instance or just need to set OLLAMA_NUM_PARALLEL.

The full Ollama production deployment guide covers the rest of the production stack — this article is the monitoring layer that turns it from "running" into "running observably."

The Four Metrics That Matter {#four-metrics}

Hundreds of metrics are available. Four drive 90 percent of operational decisions.

Metric 1: `tokens_per_second` (custom)

The single most important indicator of model health. Drop of 30 percent or more from baseline means thermal throttling, model swap, or context size change.

# HELP ollama_tokens_per_second Inference throughput per request
# TYPE ollama_tokens_per_second histogram
ollama_tokens_per_second_bucket{model="llama3.3:70b",le="10"} 12
ollama_tokens_per_second_bucket{model="llama3.3:70b",le="20"} 487
ollama_tokens_per_second_bucket{model="llama3.3:70b",le="40"} 519

Metric 2: `DCGM_FI_DEV_GPU_TEMP` (DCGM)

GPU die temperature in Celsius. Anything sustained above 83 C on consumer cards or 88 C on workstation cards is a problem.

Metric 3: `DCGM_FI_DEV_FB_USED` (DCGM)

Frame buffer (VRAM) used in MiB. Compare against DCGM_FI_DEV_FB_TOTAL for percent utilisation. Crossing 95 percent means OOM is imminent the next time someone sends a slightly longer prompt.

Metric 4: `nginx_http_requests_total` and `nginx_upstream_response_time_seconds` (nginx-exporter)

Request rate, error rate, and response time at the proxy layer. Queue depth is implicit in response time — when median response time triples without inference becoming slower, you have a queue.

Everything else (CPU, RAM, disk I/O, network) is supporting context that only matters when one of those four anomalies fires.

Architecture Overview {#architecture}

                +-------------------+
                |     Grafana       |
                |   (port 3000)     |
                +--------+----------+
                         |
                         v PromQL
                +--------+----------+
                |    Prometheus     |
                |   (port 9090)     |
                +--------+----------+
                         |
            +------------+-----------+----------------+----------------+
            |            |           |                |                |
            v            v           v                v                v
   +--------+---+  +-----+------+  +-+---------+  +---+----------+ +--+--------+
   | DCGM       |  | node       |  | nginx     |  | Ollama       | | log-tail  |
   | Exporter   |  | exporter   |  | exporter  |  | sidecar      | | collector |
   | :9400      |  | :9100      |  | :9113     |  | :9778        | | :9779     |
   +------------+  +------------+  +-----------+  +--------------+ +-----------+

Five exporters cover the full stack. The DCGM exporter handles GPU telemetry. node_exporter handles host CPU, RAM, disk. nginx-exporter handles request rate and response time at the proxy. The Ollama sidecar polls the local Ollama API for loaded models and queue state. The log-tail collector parses /var/log/ollama.log for per-request token rates that the API does not expose.

Step-by-Step Setup {#setup}

Step 1: Project Layout

mkdir -p ollama-monitoring/{prometheus,grafana,grafana/provisioning/dashboards,grafana/provisioning/datasources}
cd ollama-monitoring

Step 2: docker-compose.yml

version: '3.9'

services:
  prometheus:
    image: prom/prometheus:v2.51.2
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/alerts.yml:/etc/prometheus/alerts.yml:ro
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.4.2
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    restart: unless-stopped

  dcgm-exporter:
    image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
    container_name: dcgm-exporter
    ports:
      - "9400:9400"
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    cap_add:
      - SYS_ADMIN
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
    restart: unless-stopped

  ollama-sidecar:
    image: ghcr.io/localaimaster/ollama-prom-sidecar:0.4
    container_name: ollama-sidecar
    ports:
      - "9778:9778"
    environment:
      - OLLAMA_URL=http://host.docker.internal:11434
      - SCRAPE_INTERVAL=10
    extra_hosts:
      - "host.docker.internal:host-gateway"
    restart: unless-stopped

volumes:
  prometheus-data:
  grafana-data:

Step 3: Prometheus Scrape Config

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: ollama-prod

rule_files:
  - /etc/prometheus/alerts.yml

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']

  - job_name: dcgm
    static_configs:
      - targets: ['dcgm-exporter:9400']

  - job_name: node
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: ollama
    static_configs:
      - targets: ['ollama-sidecar:9778']

  - job_name: nginx
    static_configs:
      - targets: ['host.docker.internal:9113']

Step 4: Custom Token Throughput Collector

The Ollama sidecar above scrapes basic state. To capture per-request tokens-per-second from streaming responses, run a small log-tail collector:

# /usr/local/bin/ollama-token-collector.sh
#!/usr/bin/env bash
set -euo pipefail

LOG=/var/log/ollama.log
PORT=9779

# Tiny Prometheus textfile collector that parses Ollama "eval rate" lines
exec python3 - <<'PY'
import http.server, re, threading, time
from collections import defaultdict, deque

EVAL_RE = re.compile(r'eval rate:\s+([0-9.]+)\s+tokens/s.*model=(\S+)')
samples = defaultdict(lambda: deque(maxlen=512))

def tail():
    with open('/var/log/ollama.log') as f:
        f.seek(0, 2)
        while True:
            line = f.readline()
            if not line:
                time.sleep(0.2); continue
            m = EVAL_RE.search(line)
            if m:
                samples[m.group(2)].append(float(m.group(1)))

class Handler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header('Content-Type', 'text/plain'); self.end_headers()
        out = ['# TYPE ollama_tokens_per_second gauge']
        for model, vals in samples.items():
            if vals:
                avg = sum(vals)/len(vals)
                out.append(f'ollama_tokens_per_second{{model="{model}"}} {avg}')
        self.wfile.write(('\n'.join(out) + '\n').encode())

threading.Thread(target=tail, daemon=True).start()
http.server.HTTPServer(('0.0.0.0', 9779), Handler).serve_forever()
PY

Run it as a systemd unit and add - targets: ['host.docker.internal:9779'] to the Prometheus config.

Step 5: Boot the Stack

docker compose up -d
docker compose ps

Open http://localhost:9090/targets in a browser. Every target should be UP within 30 seconds. If dcgm-exporter is missing, confirm that docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi works — if it does not, the NVIDIA Container Toolkit is misconfigured.

The official Prometheus docs at prometheus.io configuration are the canonical reference for everything else you will want to tune in the scrape config.

Building Grafana Dashboards {#dashboards}

Two dashboards cover the typical operator workflow: a one-screen overview and a deep-dive GPU dashboard.

Dashboard 1: Ollama Overview

Add a Prometheus data source pointed at http://prometheus:9090 then create panels with the following queries.

Panel: Tokens/sec by model (timeseries)

avg by (model) (ollama_tokens_per_second)

Panel: GPU memory used percent (gauge)

(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) * 100

Panel: GPU temperature (timeseries with thresholds)

DCGM_FI_DEV_GPU_TEMP

Set the panel thresholds at 75 (yellow) and 83 (red) for consumer cards.

Panel: Request rate (timeseries)

sum(rate(nginx_http_requests_total[1m])) by (status)

Panel: P95 request latency (timeseries)

histogram_quantile(0.95, sum by (le) (rate(nginx_http_request_duration_seconds_bucket[5m])))

Dashboard 2: GPU Deep Dive

The community DCGM dashboard at Grafana ID 19628 ships with 24 GPU panels (utilisation, memory, temperature, power, PCIe, NVLink). Import it directly: Grafana > Dashboards > Import > 19628.

For the comparison context that frames why these specific GPU metrics matter, the best GPUs for AI breakdown is the right reference.

Alert Rules That Catch Real Problems {#alerts}

Save the following as prometheus/alerts.yml:

groups:
  - name: ollama
    interval: 30s
    rules:
      - alert: OllamaGpuTemperatureHigh
        expr: DCGM_FI_DEV_GPU_TEMP > 83
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: GPU {{ $labels.gpu }} hot
          description: "{{ $labels.gpu }} at {{ $value }}C for 5+ minutes — throttling likely."

      - alert: OllamaGpuMemoryNearFull
        expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) > 0.95
        for: 2m
        labels: { severity: critical }
        annotations:
          summary: GPU {{ $labels.gpu }} VRAM > 95%
          description: "OOM imminent on the next long prompt."

      - alert: OllamaThroughputDegraded
        expr: |
          (
            avg_over_time(ollama_tokens_per_second[10m])
            /
            avg_over_time(ollama_tokens_per_second[24h] offset 1h)
          ) < 0.5
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: Ollama throughput halved vs 24h baseline
          description: "Likely thermal throttle, contention, or quantization mismatch."

      - alert: OllamaErrorRateHigh
        expr: |
          sum(rate(nginx_http_requests_total{status=~"5.."}[5m]))
          / sum(rate(nginx_http_requests_total[5m]))
          > 0.01
        for: 10m
        labels: { severity: critical }
        annotations:
          summary: Ollama 5xx rate > 1%
          description: "Reverse proxy is returning errors — check upstream Ollama."

      - alert: OllamaQueueBacklog
        expr: avg_over_time(nginx_upstream_active_connections[5m]) > 20
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: Ollama queue > 20 concurrent
          description: "Consider raising OLLAMA_NUM_PARALLEL or scaling out."

Reload Prometheus without restarting:

curl -X POST http://localhost:9090/-/reload

Wire alerts to email, Slack, or PagerDuty through Alertmanager. The official Alertmanager documentation walks through routing, grouping, and inhibition rules in detail.

Comparing Stacks: Prometheus vs Alternatives {#comparison}

Stack	Setup time	Storage cost	GPU coverage	Best for
Prometheus + Grafana + DCGM	30-60 min	$0 self-hosted	Excellent	Self-hosted, single team
Datadog	10 min	$$$$ usage	Good	SaaS, large teams, no SRE
New Relic	15 min	$$$ usage	Decent	Multi-app, mixed stack
OpenTelemetry + Tempo + Loki + Mimir	2-4 hours	$0 self-hosted	Good	Existing OTel pipeline
Lightweight (Netdata)	5 min	$0	Basic	Solo dev, single host

Prometheus + Grafana wins for any team running their own Ollama. It is free, has the best GPU integration via DCGM, and the dashboards portfolio (especially Grafana ID 19628) is excellent.

Pitfalls and Anti-Patterns {#pitfalls}

Pitfall 1: Scraping Too Often

Default scrape_interval of 15s is fine. Setting it to 1s for "real-time" dashboards generates 15x the data, hammers DCGM, and rarely surfaces anything you would have missed. Anything below 5s is wasted work.

Pitfall 2: Alerting on Single Samples

expr: DCGM_FI_DEV_GPU_TEMP > 90 without a for: 5m clause will page you for a transient 91 C spike during model load. Always pair temperature and memory alerts with a multi-minute for clause to filter noise.

Pitfall 3: Tokens/sec Computed from `/api/tags`

/api/tags lists installed models — it does not expose throughput. Computing tok/s from API polling is impossible because Ollama only emits eval timing inside the response payload. Use the log-tail collector (above) or instrument your client.

Pitfall 4: Storing Metrics on the Same NVMe as Models

Prometheus writes ~10 MB/min to its TSDB. Model loading reads tens of GB at a stretch. Sharing the same NVMe causes loading stalls during prometheus compaction. Put Prometheus storage on a separate disk or a different host entirely.

Pitfall 5: Forgetting the Grafana Backup

Dashboards live in Grafana's database. Without provisioning, they vanish if the volume is reset. Put dashboard JSON under grafana/provisioning/dashboards so the stack is reproducible from source control.

For the broader pattern of locking down a production Ollama deployment after monitoring is in place, the securing Ollama guide walks through TLS, API keys, and network isolation.

Final Notes

Monitoring Ollama is the single highest-leverage thing you can do once a deployment is past the prototype stage. The first time the throughput-degraded alert fires at 11pm because a CPU fan died and the GPU is heat-soaking, you will be glad the stack was already in place. The first time you walk into a sprint review with "we doubled concurrent users without a latency regression" backed by a Grafana screenshot, you will be glad you instrumented from day one.

Spin up the docker-compose stack, point Prometheus at your existing Ollama host, import the GPU dashboard, and add the five alert rules above. That is roughly ninety minutes of work and it covers every operational failure mode we have seen in real Ollama deployments. The only thing left is to forget it exists until something actually breaks — at which point the dashboard will already know.

Monitor Ollama with Prometheus & Grafana: Production Dashboards

Want to go deeper than this article?

Monitor Ollama with Prometheus & Grafana: Production Dashboards

Table of Contents

Why Monitor Ollama at All {#why-monitor}

The Four Metrics That Matter {#four-metrics}

Metric 1: tokens_per_second (custom)

Metric 2: DCGM_FI_DEV_GPU_TEMP (DCGM)

Metric 3: DCGM_FI_DEV_FB_USED (DCGM)

Metric 4: nginx_http_requests_total and nginx_upstream_response_time_seconds (nginx-exporter)

Architecture Overview {#architecture}

Step-by-Step Setup {#setup}

Step 1: Project Layout

Step 2: docker-compose.yml

Step 3: Prometheus Scrape Config

Step 4: Custom Token Throughput Collector

Step 5: Boot the Stack

Building Grafana Dashboards {#dashboards}

Dashboard 1: Ollama Overview

Dashboard 2: GPU Deep Dive

Alert Rules That Catch Real Problems {#alerts}

Comparing Stacks: Prometheus vs Alternatives {#comparison}

Pitfalls and Anti-Patterns {#pitfalls}

Pitfall 1: Scraping Too Often

Pitfall 2: Alerting on Single Samples

Pitfall 3: Tokens/sec Computed from /api/tags

Pitfall 4: Storing Metrics on the Same NVMe as Models

Pitfall 5: Forgetting the Grafana Backup

Final Notes

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

🎓 Continue Learning

SRE-Grade Local AI, Weekly

Build Real AI on Your Machine

Related Guides

Continue Learning

Ollama in Production

Multi-GPU Ollama

Rate Limiting Ollama

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI

Metric 1: `tokens_per_second` (custom)

Metric 2: `DCGM_FI_DEV_GPU_TEMP` (DCGM)

Metric 3: `DCGM_FI_DEV_FB_USED` (DCGM)

Metric 4: `nginx_http_requests_total` and `nginx_upstream_response_time_seconds` (nginx-exporter)

Pitfall 3: Tokens/sec Computed from `/api/tags`