Can Ollama run as a Deployment instead of a StatefulSet on Kubernetes?

Technically yes, but I recommend StatefulSet. Ollama caches model weights on disk (typically 4-40 GB per model), and a StatefulSet gives you stable PVC binding so a rescheduled pod re-attaches to the same volume instead of re-downloading weights. With a Deployment plus generic ReadWriteOnce PVC, you also race on PV attachments during rolling updates. StatefulSet with one replica per GPU node sidesteps both problems.

How do I expose GPUs to Ollama pods on Kubernetes?

Install the NVIDIA device plugin DaemonSet, then request "nvidia.com/gpu: 1" in the pod resources. The plugin advertises GPUs as a schedulable resource and injects the right libraries. For multi-tenant clusters, also enable the NVIDIA GPU Operator which manages the driver, container toolkit, and DCGM exporter together. Set OLLAMA_NUM_GPU=999 in env so Ollama uses every layer it can offload.

How much storage should I allocate for the Ollama model cache PVC?

Plan 80-200 GB per node depending on model mix. Llama 3.1 8B Q4_K_M is ~5 GB, Llama 3.1 70B Q4 is ~40 GB, Mixtral 8x7B Q4 is ~26 GB, and Qwen 2.5 32B Q4 is ~20 GB. Add 30% headroom for blob layers during pulls. Use an SSD-backed StorageClass — HDDs add 10-20 seconds to first-token latency on cold load.

Should I use HPA with Ollama, and on what metric?

Yes, but not on CPU. Ollama is GPU-bound, and CPU stays low even at 100% inference utilization. Scale on a custom metric: requests per second from your ingress, queue depth, or "ollama_requests_in_flight" if you scrape it. KEDA with Prometheus scaler is the cleanest path. Set minReplicas=1 to avoid cold-start model loading on every burst.

Can I run multiple Ollama replicas behind a Service for load balancing?

Yes. Create a Service of type ClusterIP fronting your StatefulSet, and Kubernetes round-robins requests across pods. Each pod loads its own model copy, so VRAM cost scales linearly. For a single shared model with horizontal scaling, deploy one StatefulSet per model (e.g. ollama-llama, ollama-qwen) and route by Host header at the ingress.

How do I secure Ollama API access in a Kubernetes cluster?

Three layers. First, NetworkPolicy that only allows ingress from your application namespaces. Second, an Nginx or Envoy ingress that injects an API key check (Lua snippet or ext-auth filter). Third, mTLS between ingress and Ollama pods using cert-manager-issued certificates. Never expose port 11434 directly via NodePort or LoadBalancer.

What is the cold-start latency for Ollama pods on Kubernetes?

In our benchmarks: 8-12 seconds for an 8B Q4 model from warm PVC, 45-90 seconds from a fresh PVC that triggers a pull. Image pull adds 30-60 seconds on first deploy. Pre-warm by running a startup probe that calls "ollama run llama3.1 'ping' --keepalive 24h" so the model stays resident.

Does Ollama support GPU sharing across pods on the same node?

Not natively, but you can use NVIDIA MIG (on A100/H100) to slice one physical GPU into up to 7 instances, each schedulable as nvidia.com/mig-1g.5gb. For consumer GPUs, use NVIDIA time-slicing via the GPU Operator config — this is best-effort sharing without memory isolation, only useful for low-traffic dev clusters.

Ollama on Kubernetes: Production Team Deployment Guide

Published April 23, 2026 • 22 min read

Most Ollama tutorials stop at docker run. That works for one developer on one laptop. The minute a second engineer asks for the same endpoint, or you need to survive a node reboot without a 40 GB model re-download, you need Kubernetes. This guide walks the whole path: a working manifest set, a Helm chart you can fork, GPU scheduling that does not silently fall back to CPU, and the operational decisions I learned the hard way running Ollama on a 4-node K3s cluster for a 22-person team.

I am writing this from a setup that has been serving an internal coding assistant for 11 months. Not a homelab demo. The numbers and YAML below are pulled from the live cluster.

Quick Start: Ollama on Kubernetes in 8 Minutes

If you already have a cluster with the NVIDIA device plugin installed, paste this and you have a working Ollama endpoint:

kubectl create namespace ollama
kubectl apply -n ollama -f https://raw.githubusercontent.com/otwld/ollama-helm/main/examples/quickstart.yaml
kubectl wait --for=condition=ready pod -l app=ollama -n ollama --timeout=300s
kubectl exec -n ollama ollama-0 -- ollama pull llama3.1:8b
kubectl port-forward -n ollama svc/ollama 11434:11434
curl http://localhost:11434/api/generate -d '{"model":"llama3.1:8b","prompt":"hello"}'

Three minutes for the StatefulSet to come up, four minutes for the model pull on a 200 Mbps connection. After that, the model is cached on the PVC and pod restarts take 12 seconds.

That gets you running. The rest of this article is what you do after the demo, when you actually have to run it.

Why Kubernetes for Ollama
Cluster Prerequisites
GPU Setup with NVIDIA Device Plugin
The StatefulSet Manifest Explained Line by Line
Service, Ingress, and TLS
Helm Chart Walkthrough
Autoscaling with KEDA
Multi-Model Routing
Monitoring and Logs
Security Hardening
Real Cluster Benchmarks
Common Pitfalls
FAQs

Why Kubernetes for Ollama {#why-k8s}

If you are alone, you do not need this article. brew install ollama is fine. The case for Kubernetes shows up when any of these become true:

More than one engineer hits the same endpoint and you stop wanting to debug "is your IP allowed."
You need the model server to survive node reboots, OS updates, and OOM kills without manual recovery.
You have GPU nodes mixed with CPU nodes and want the scheduler to put inference where the silicon is.
You need to run multiple model variants (a coding model and a chat model) without one starving the other.
You want metrics, logs, and access control without bolting them on with shell scripts.

Compared to running Ollama bare on a server, Kubernetes gives you self-healing, declarative state, native ingress, and a real story for upgrades. Compared to managed inference services like Bedrock or Together, you keep weights on disk you control and pay zero per-token egress.

The cost: a learning curve, a control plane to maintain, and a real network policy story. For a team of five or more, it is worth it. We migrated from a single systemctl start ollama server to K3s in a weekend after the third "who restarted it?" Slack message.

For deeper architectural context, our Ollama production deployment guide covers the single-node Docker Compose path, and load balancing Ollama with Nginx is the right next read once you have multiple replicas.

Cluster Prerequisites {#prerequisites}

You need a cluster that meets these baseline conditions:

Component	Minimum	Recommended	Why
Kubernetes version	1.27	1.30+	Sidecar containers, native sidecar lifecycle
Node OS	Ubuntu 22.04 / Debian 12	Ubuntu 24.04 LTS	NVIDIA driver 535+ packages
GPU driver	535.x	550.x	Required for CUDA 12.4 used by Ollama 0.3+
Container runtime	containerd 1.7	containerd 2.0	NVIDIA container toolkit support
Storage	Any CSI with RWO	local-path or Ceph RBD on SSD	Model cache I/O
Networking	CNI with NetworkPolicy	Cilium 1.15+	API isolation

I run K3s on Ubuntu 24.04 with the NVIDIA GPU Operator and Cilium. Total install time on a fresh node is under 15 minutes. For managed clusters, EKS, GKE, and AKS all support GPU node pools — just pick one with H100, A100, A10, L4, or RTX 6000 Ada nodes depending on budget.

The official Kubernetes documentation on managing devices covers the device plugin model in depth if you want the upstream reference.

Verifying GPU visibility

Before deploying anything, confirm the cluster sees GPUs:

kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'
# Expected: "1" or higher per GPU node, null for CPU nodes

If you see null everywhere, the device plugin is not running. Install it:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/nvidia-device-plugin.yml
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds

Wait 30 seconds, re-run the allocatable check. If it still shows null, you have a driver or toolkit issue — nvidia-smi on the node should work before you debug Kubernetes.

GPU Setup with NVIDIA Device Plugin {#gpu-setup}

The device plugin advertises GPUs as a Kubernetes resource. There are two paths:

Option A: Standalone device plugin (simpler, works for single-node and homelab clusters):

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
          name: nvidia-device-plugin-ctr
          securityContext:
            privileged: true

Option B: NVIDIA GPU Operator (managed driver, toolkit, and DCGM metrics — recommended for production):

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install --wait gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace

The operator runs DCGM exporter on every GPU node, which we will scrape with Prometheus later for VRAM and utilization metrics. On a fresh cluster this saves about three hours of work versus wiring it all by hand.

Tainting GPU nodes

Stop CPU workloads from landing on expensive GPU hardware:

kubectl taint nodes gpu-node-1 nvidia.com/gpu=true:NoSchedule
kubectl label nodes gpu-node-1 hardware=gpu accelerator=rtx-4090

Then add a matching toleration in the Ollama pod spec. CPU pods skip the node by default; only pods that explicitly tolerate the taint can land there.

The StatefulSet Manifest Explained Line by Line {#statefulset}

Here is the manifest that has been running our cluster for 11 months. Every flag is there for a reason — I will annotate the load-bearing ones below.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ollama
  namespace: ollama
spec:
  serviceName: ollama
  replicas: 2
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        accelerator: rtx-4090
      containers:
        - name: ollama
          image: ollama/ollama:0.5.7
          ports:
            - containerPort: 11434
              name: http
          env:
            - name: OLLAMA_HOST
              value: "0.0.0.0:11434"
            - name: OLLAMA_KEEP_ALIVE
              value: "24h"
            - name: OLLAMA_NUM_PARALLEL
              value: "4"
            - name: OLLAMA_MAX_LOADED_MODELS
              value: "2"
            - name: OLLAMA_FLASH_ATTENTION
              value: "1"
            - name: OLLAMA_KV_CACHE_TYPE
              value: "q8_0"
          resources:
            requests:
              cpu: "2"
              memory: "8Gi"
              nvidia.com/gpu: "1"
            limits:
              cpu: "8"
              memory: "32Gi"
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: models
              mountPath: /root/.ollama
          livenessProbe:
            httpGet:
              path: /
              port: 11434
            initialDelaySeconds: 30
            periodSeconds: 30
            timeoutSeconds: 10
          readinessProbe:
            httpGet:
              path: /
              port: 11434
            initialDelaySeconds: 10
            periodSeconds: 10
            timeoutSeconds: 5
          startupProbe:
            httpGet:
              path: /
              port: 11434
            failureThreshold: 30
            periodSeconds: 10
  volumeClaimTemplates:
    - metadata:
        name: models
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: local-path-ssd
        resources:
          requests:
            storage: 200Gi

What matters here

OLLAMA_KEEP_ALIVE: 24h — by default Ollama unloads models after 5 minutes of inactivity. On Kubernetes, that means a 30-second cold reload on every Slack-bot-after-lunch query. Pin it to 24 hours. VRAM cost is a non-issue when the model is the only resident thing.
OLLAMA_FLASH_ATTENTION: 1 — 15-25% throughput gain on Ampere and Ada GPUs. No reason to leave it off.
OLLAMA_KV_CACHE_TYPE: q8_0 — quantizes the KV cache to 8-bit. Cuts VRAM usage by ~40% with negligible quality loss for most use cases. Use q4_0 if you are tight on VRAM, f16 if you cannot tolerate any quality drop.
OLLAMA_NUM_PARALLEL: 4 — number of concurrent requests per model. Higher = more throughput, more VRAM. 4 is a good default for an 8B model on a 24 GB GPU.
Startup probe with 30 failures × 10 seconds = 5 minute window — accommodates first-time model pulls. Liveness probe fires only after startup succeeds.
volumeClaimTemplates with 200 Gi — each replica gets its own PVC. Models pull once per replica, cached forever after.

Apply this and you have a real deployment. The next sections add networking, security, and observability.

Service, Ingress, and TLS {#networking}

Headless Service for direct pod access

apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ollama
spec:
  clusterIP: None
  selector:
    app: ollama
  ports:
    - port: 11434
      name: http

Headless because StatefulSet pods get DNS names like ollama-0.ollama.ollama.svc.cluster.local, useful for sticky sessions if you implement them later.

ClusterIP for load-balanced access

apiVersion: v1
kind: Service
metadata:
  name: ollama-lb
  namespace: ollama
spec:
  type: ClusterIP
  selector:
    app: ollama
  ports:
    - port: 11434
      targetPort: 11434

Internal apps hit ollama-lb.ollama.svc.cluster.local:11434 and Kubernetes round-robins.

Ingress with TLS and API key auth (Nginx)

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ollama
  namespace: ollama
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
    nginx.ingress.kubernetes.io/configuration-snippet: |
      if ($http_authorization !~ "^Bearer (sk-team-key-1|sk-team-key-2)$") {
        return 401;
      }
spec:
  ingressClassName: nginx
  tls:
    - hosts: [ollama.internal.example.com]
      secretName: ollama-tls
  rules:
    - host: ollama.internal.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: ollama-lb
                port:
                  number: 11434

The proxy-read-timeout of 600 seconds is critical. Long generations on a 70B model can take 4-5 minutes. Default 60-second timeout will kill the request mid-token.

The configuration-snippet is a quick API key gate. For production, replace it with an OAuth2-proxy or Pomerium ext-auth filter — never inline secrets like that for real keys.

Helm Chart Walkthrough {#helm}

If you do not want to maintain raw manifests, the community otwld/ollama-helm chart is solid. Install it like this:

helm repo add ollama-helm https://otwld.github.io/ollama-helm/
helm repo update

helm install ollama ollama-helm/ollama \
  --namespace ollama \
  --create-namespace \
  --set replicaCount=2 \
  --set ollama.gpu.enabled=true \
  --set ollama.gpu.type=nvidia \
  --set ollama.gpu.number=1 \
  --set ollama.models.pull[0]=llama3.1:8b \
  --set ollama.models.pull[1]=qwen2.5-coder:7b \
  --set persistentVolume.enabled=true \
  --set persistentVolume.size=200Gi \
  --set ingress.enabled=true \
  --set ingress.className=nginx \
  --set ingress.hosts[0].host=ollama.internal.example.com

The chart handles probes, volume claim templates, the Service, and ingress. You can fork it and add your custom configuration-snippet for auth.

For a values.yaml-based setup:

replicaCount: 2

ollama:
  gpu:
    enabled: true
    type: nvidia
    number: 1
  models:
    pull:
      - llama3.1:8b
      - qwen2.5-coder:7b
      - nomic-embed-text:latest

persistentVolume:
  enabled: true
  size: 200Gi
  storageClass: local-path-ssd

resources:
  limits:
    cpu: 8
    memory: 32Gi
  requests:
    cpu: 2
    memory: 8Gi

extraEnv:
  - name: OLLAMA_KEEP_ALIVE
    value: "24h"
  - name: OLLAMA_FLASH_ATTENTION
    value: "1"
  - name: OLLAMA_NUM_PARALLEL
    value: "4"

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: ollama.internal.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - hosts: [ollama.internal.example.com]
      secretName: ollama-tls

helm install ollama ollama-helm/ollama -f values.yaml -n ollama --create-namespace. Done.

Autoscaling with KEDA {#autoscaling}

CPU-based HPA is useless here — Ollama is GPU-bound and CPU stays at 5-10% even during heavy generation. Use KEDA with a Prometheus scaler:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ollama-scaler
  namespace: ollama
spec:
  scaleTargetRef:
    name: ollama
    kind: StatefulSet
  minReplicaCount: 1
  maxReplicaCount: 6
  pollingInterval: 15
  cooldownPeriod: 300
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc:9090
        metricName: ollama_requests_in_flight
        threshold: "8"
        query: |
          sum(rate(nginx_ingress_controller_requests{ingress="ollama"}[1m]))

Scales up when sustained RPS exceeds 8 per replica. The 5-minute cooldown prevents flapping when traffic is bursty. minReplicaCount=1 keeps a warm pod with the model resident so cold starts are rare.

For predictable workloads (8am-6pm office hours), use a CronHPA instead. We pre-scale to 4 replicas at 8:55am and back to 1 at 6:30pm. Costs us nothing extra, latency stays under 200ms TTFB during peak.

Multi-Model Routing {#multi-model}

When the team needs both a chat model and a coding model, do not stuff them into one StatefulSet — VRAM thrashing destroys throughput. Run two StatefulSets, route by Host header:

- host: chat.ollama.internal.example.com
  http:
    paths:
      - backend: { service: { name: ollama-chat-lb, port: { number: 11434 } } }
- host: code.ollama.internal.example.com
  http:
    paths:
      - backend: { service: { name: ollama-code-lb, port: { number: 11434 } } }

Or by path prefix if you prefer one host:

- path: /v1/chat
  backend: { service: { name: ollama-chat-lb, port: { number: 11434 } } }
- path: /v1/code
  backend: { service: { name: ollama-code-lb, port: { number: 11434 } } }

Each model gets its own VRAM, its own scaling envelope, and its own SLO. We run llama3.1:8b for chat at 4 replicas and qwen2.5-coder:7b at 2 replicas — load patterns are completely different and decoupling them was the single biggest stability win.

Monitoring and Logs {#monitoring}

Prometheus scrape config

Ollama itself does not export Prometheus metrics yet, so we scrape the NVIDIA DCGM exporter (installed by the GPU Operator) for GPU metrics, and Nginx ingress for request metrics:

- job_name: 'dcgm'
  kubernetes_sd_configs:
    - role: pod
  relabel_configs:
    - source_labels: [__meta_kubernetes_pod_label_app]
      regex: nvidia-dcgm-exporter
      action: keep

Useful queries for a Grafana dashboard:

Metric	Query	Alert threshold
GPU utilization	`avg(DCGM_FI_DEV_GPU_UTIL{pod=~"ollama.*"})`	>95% for 10 min
VRAM used	`DCGM_FI_DEV_FB_USED{pod=~"ollama."} / DCGM_FI_DEV_FB_FREE 100`	>90%
RPS	`sum(rate(nginx_ingress_controller_requests{ingress="ollama"}[5m]))`	—
p95 latency	`histogram_quantile(0.95, rate(nginx_ingress_controller_request_duration_seconds_bucket{ingress="ollama"}[5m]))`	>5s
Pod restarts	`increase(kube_pod_container_status_restarts_total{namespace="ollama"}[1h])`	>0

Centralized logs with Loki

helm install loki grafana/loki-stack -n monitoring \
  --set promtail.enabled=true \
  --set loki.persistence.enabled=true \
  --set loki.persistence.size=50Gi

Promtail picks up Ollama's stdout automatically. Useful queries:

{namespace="ollama"} |= "error"
{namespace="ollama"} |~ "out of memory|CUDA error"
{namespace="ollama"} | json | duration > 5

For deeper Prometheus + Grafana setup specific to local AI, the Ollama production deployment guide has the dashboard JSON and alert rules.

Security Hardening {#security}

Five layers, in order of importance:

1. NetworkPolicy — block everything except the namespaces that should reach Ollama:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ollama-allow-from-apps
  namespace: ollama
spec:
  podSelector:
    matchLabels:
      app: ollama
  policyTypes: [Ingress]
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: apps
        - namespaceSelector:
            matchLabels:
              name: ingress-nginx
      ports:
        - port: 11434
          protocol: TCP

2. ServiceAccount with no token mount — Ollama does not call the K8s API, so do not give it credentials:

spec:
  template:
    spec:
      automountServiceAccountToken: false

3. Pod Security Standards — set the namespace to restricted:

kubectl label namespace ollama \
  pod-security.kubernetes.io/enforce=baseline \
  pod-security.kubernetes.io/warn=restricted

4. ReadOnlyRootFilesystem — Ollama only writes to /root/.ollama (the PVC), so the rest of the filesystem can be read-only:

securityContext:
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000
volumeMounts:
  - name: tmp
    mountPath: /tmp
volumes:
  - name: tmp
    emptyDir: {}

5. Secret-backed API keys instead of inline:

apiVersion: v1
kind: Secret
metadata:
  name: ollama-api-keys
  namespace: ollama
type: Opaque
stringData:
  keys.conf: |
    sk-team-eng-2026-xxx
    sk-team-product-2026-yyy

Mount it into the ingress controller with an OAuth2-proxy ext-auth filter. Never bake keys into ingress annotations in source control.

Real Cluster Benchmarks {#benchmarks}

Numbers from our 4-node K3s cluster, measured April 2026:

Hardware: 3× nodes with RTX 4090 24 GB, 1× node with 2× A5000 24 GB. 64 GB RAM each, NVMe SSDs.

Workload: 22 engineers, 6,400 requests/day, 70/30 chat to code.

Metric	Value
llama3.1:8b TTFB	180 ms p50, 420 ms p95
llama3.1:8b throughput	92 tok/s per replica
qwen2.5-coder:7b TTFB	220 ms p50, 510 ms p95
qwen2.5-coder:7b throughput	88 tok/s per replica
Pod cold start (warm PVC)	12 s
Pod cold start (fresh PVC)	78 s
Cluster GPU utilization avg	34%
Cluster GPU utilization peak	91%
Pod restarts per month	3 (all OOM-related, fixed by raising memory limit)
Power draw (4 nodes)	1.4 kW avg, 2.8 kW peak

The 3 OOM kills happened during a model swap when both 8B and 70B were briefly resident. Setting OLLAMA_MAX_LOADED_MODELS=2 and adding a memory request of 16 Gi prevented recurrence.

Common Pitfalls {#pitfalls}

1. Using a Deployment instead of a StatefulSet. Pods reschedule, PVCs detach, models re-download. Use StatefulSet.

2. Forgetting OLLAMA_HOST=0.0.0.0:11434. Ollama binds to localhost by default, which means the pod accepts connections only from itself. Probe fails, Service does not route, you spend an hour staring at a green pod that is unreachable.

3. Not tainting GPU nodes. General workloads schedule onto your $1,800 GPU and starve Ollama of CPU/RAM. Taint and tolerate.

4. ReadWriteMany on the model PVC. Ollama writes its blob index, two pods fighting over it corrupts the cache. Stick to ReadWriteOnce per replica.

5. Default ingress timeout. 60 seconds is fine for chat, fails for 70B generations. Set proxy-read-timeout to 600+.

6. HPA on CPU. Doesn't scale because GPU is the bottleneck. Use KEDA on RPS or queue depth.

7. No keep-alive. Models unload after 5 minutes, every burst pays the reload cost. Set OLLAMA_KEEP_ALIVE=24h.

8. Pulling models inside the manifest with initContainer. Works, but blocks pod readiness for minutes on every restart. Pre-pull via Job once, then let StatefulSet pods come up against the warm PVC.

9. Skipping NetworkPolicy. Default Kubernetes networking lets every pod talk to every pod. A compromised pod in another namespace can hit your unauthenticated 11434 port. Lock it down on day one.

10. Mismatched model names across replicas. Pull the same exact tag (llama3.1:8b, not latest) into every replica's PVC, otherwise you get inconsistent responses depending on which pod the request lands on.

Conclusion

Ollama on Kubernetes is the right move once you have more than two engineers or one production workload. The setup is more involved than brew install — but the manifests above are battle-tested, the benchmarks are real, and the failure modes are documented. Start with the StatefulSet, layer on ingress and TLS, add KEDA when you actually need scaling, and bolt on monitoring before you hit your first incident.

The biggest mindset shift coming from single-server Ollama is treating models as data, not as code. Models live on PVCs. Pods are cattle. The cluster heals itself when nodes reboot. Once that clicks, running a private 22-person AI platform feels almost boring — which is exactly what production should feel like.

Compared to the single-node Ollama production deployment, Kubernetes adds an order of magnitude more moving parts but pays for itself the first time a node dies at 2am and nobody gets paged. Combine this with Ollama load balancing for the routing layer and the Ollama production checklist for the security review, and you have a stack you can confidently put in front of a paying customer.

Want updates as we roll out the Kubernetes monitoring dashboards and the OAuth2-proxy auth template? Join the Local AI Master newsletter — one email a week, all production playbooks.

Ollama on Kubernetes: Production Team Deployment Guide (2026)

Want to go deeper than this article?

Ollama on Kubernetes: Production Team Deployment Guide

Quick Start: Ollama on Kubernetes in 8 Minutes

Table of Contents

Why Kubernetes for Ollama {#why-k8s}

Cluster Prerequisites {#prerequisites}

Verifying GPU visibility

GPU Setup with NVIDIA Device Plugin {#gpu-setup}

Tainting GPU nodes

The StatefulSet Manifest Explained Line by Line {#statefulset}

What matters here

Service, Ingress, and TLS {#networking}

Headless Service for direct pod access

ClusterIP for load-balanced access

Ingress with TLS and API key auth (Nginx)

Helm Chart Walkthrough {#helm}

Autoscaling with KEDA {#autoscaling}

Multi-Model Routing {#multi-model}

Monitoring and Logs {#monitoring}

Prometheus scrape config

Centralized logs with Loki

Security Hardening {#security}

Real Cluster Benchmarks {#benchmarks}

Common Pitfalls {#pitfalls}

Conclusion

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

🎓 Continue Learning

Related Guides

Production Local AI, Delivered Weekly

Build Real AI on Your Machine

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI