Ollama on Kubernetes: Production Team Deployment Guide (2026)
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Ollama on Kubernetes: Production Team Deployment Guide
Published April 23, 2026 • 22 min read
Most Ollama tutorials stop at docker run. That works for one developer on one laptop. The minute a second engineer asks for the same endpoint, or you need to survive a node reboot without a 40 GB model re-download, you need Kubernetes. This guide walks the whole path: a working manifest set, a Helm chart you can fork, GPU scheduling that does not silently fall back to CPU, and the operational decisions I learned the hard way running Ollama on a 4-node K3s cluster for a 22-person team.
I am writing this from a setup that has been serving an internal coding assistant for 11 months. Not a homelab demo. The numbers and YAML below are pulled from the live cluster.
Quick Start: Ollama on Kubernetes in 8 Minutes
If you already have a cluster with the NVIDIA device plugin installed, paste this and you have a working Ollama endpoint:
kubectl create namespace ollama
kubectl apply -n ollama -f https://raw.githubusercontent.com/otwld/ollama-helm/main/examples/quickstart.yaml
kubectl wait --for=condition=ready pod -l app=ollama -n ollama --timeout=300s
kubectl exec -n ollama ollama-0 -- ollama pull llama3.1:8b
kubectl port-forward -n ollama svc/ollama 11434:11434
curl http://localhost:11434/api/generate -d '{"model":"llama3.1:8b","prompt":"hello"}'
Three minutes for the StatefulSet to come up, four minutes for the model pull on a 200 Mbps connection. After that, the model is cached on the PVC and pod restarts take 12 seconds.
That gets you running. The rest of this article is what you do after the demo, when you actually have to run it.
Table of Contents
- Why Kubernetes for Ollama
- Cluster Prerequisites
- GPU Setup with NVIDIA Device Plugin
- The StatefulSet Manifest Explained Line by Line
- Service, Ingress, and TLS
- Helm Chart Walkthrough
- Autoscaling with KEDA
- Multi-Model Routing
- Monitoring and Logs
- Security Hardening
- Real Cluster Benchmarks
- Common Pitfalls
- FAQs
Why Kubernetes for Ollama {#why-k8s}
If you are alone, you do not need this article. brew install ollama is fine. The case for Kubernetes shows up when any of these become true:
- More than one engineer hits the same endpoint and you stop wanting to debug "is your IP allowed."
- You need the model server to survive node reboots, OS updates, and OOM kills without manual recovery.
- You have GPU nodes mixed with CPU nodes and want the scheduler to put inference where the silicon is.
- You need to run multiple model variants (a coding model and a chat model) without one starving the other.
- You want metrics, logs, and access control without bolting them on with shell scripts.
Compared to running Ollama bare on a server, Kubernetes gives you self-healing, declarative state, native ingress, and a real story for upgrades. Compared to managed inference services like Bedrock or Together, you keep weights on disk you control and pay zero per-token egress.
The cost: a learning curve, a control plane to maintain, and a real network policy story. For a team of five or more, it is worth it. We migrated from a single systemctl start ollama server to K3s in a weekend after the third "who restarted it?" Slack message.
For deeper architectural context, our Ollama production deployment guide covers the single-node Docker Compose path, and load balancing Ollama with Nginx is the right next read once you have multiple replicas.
Cluster Prerequisites {#prerequisites}
You need a cluster that meets these baseline conditions:
| Component | Minimum | Recommended | Why |
|---|---|---|---|
| Kubernetes version | 1.27 | 1.30+ | Sidecar containers, native sidecar lifecycle |
| Node OS | Ubuntu 22.04 / Debian 12 | Ubuntu 24.04 LTS | NVIDIA driver 535+ packages |
| GPU driver | 535.x | 550.x | Required for CUDA 12.4 used by Ollama 0.3+ |
| Container runtime | containerd 1.7 | containerd 2.0 | NVIDIA container toolkit support |
| Storage | Any CSI with RWO | local-path or Ceph RBD on SSD | Model cache I/O |
| Networking | CNI with NetworkPolicy | Cilium 1.15+ | API isolation |
I run K3s on Ubuntu 24.04 with the NVIDIA GPU Operator and Cilium. Total install time on a fresh node is under 15 minutes. For managed clusters, EKS, GKE, and AKS all support GPU node pools — just pick one with H100, A100, A10, L4, or RTX 6000 Ada nodes depending on budget.
The official Kubernetes documentation on managing devices covers the device plugin model in depth if you want the upstream reference.
Verifying GPU visibility
Before deploying anything, confirm the cluster sees GPUs:
kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'
# Expected: "1" or higher per GPU node, null for CPU nodes
If you see null everywhere, the device plugin is not running. Install it:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/nvidia-device-plugin.yml
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds
Wait 30 seconds, re-run the allocatable check. If it still shows null, you have a driver or toolkit issue — nvidia-smi on the node should work before you debug Kubernetes.
GPU Setup with NVIDIA Device Plugin {#gpu-setup}
The device plugin advertises GPUs as a Kubernetes resource. There are two paths:
Option A: Standalone device plugin (simpler, works for single-node and homelab clusters):
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
name: nvidia-device-plugin-ctr
securityContext:
privileged: true
Option B: NVIDIA GPU Operator (managed driver, toolkit, and DCGM metrics — recommended for production):
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install --wait gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace
The operator runs DCGM exporter on every GPU node, which we will scrape with Prometheus later for VRAM and utilization metrics. On a fresh cluster this saves about three hours of work versus wiring it all by hand.
Tainting GPU nodes
Stop CPU workloads from landing on expensive GPU hardware:
kubectl taint nodes gpu-node-1 nvidia.com/gpu=true:NoSchedule
kubectl label nodes gpu-node-1 hardware=gpu accelerator=rtx-4090
Then add a matching toleration in the Ollama pod spec. CPU pods skip the node by default; only pods that explicitly tolerate the taint can land there.
The StatefulSet Manifest Explained Line by Line {#statefulset}
Here is the manifest that has been running our cluster for 11 months. Every flag is there for a reason — I will annotate the load-bearing ones below.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ollama
namespace: ollama
spec:
serviceName: ollama
replicas: 2
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
accelerator: rtx-4090
containers:
- name: ollama
image: ollama/ollama:0.5.7
ports:
- containerPort: 11434
name: http
env:
- name: OLLAMA_HOST
value: "0.0.0.0:11434"
- name: OLLAMA_KEEP_ALIVE
value: "24h"
- name: OLLAMA_NUM_PARALLEL
value: "4"
- name: OLLAMA_MAX_LOADED_MODELS
value: "2"
- name: OLLAMA_FLASH_ATTENTION
value: "1"
- name: OLLAMA_KV_CACHE_TYPE
value: "q8_0"
resources:
requests:
cpu: "2"
memory: "8Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "1"
volumeMounts:
- name: models
mountPath: /root/.ollama
livenessProbe:
httpGet:
path: /
port: 11434
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 10
readinessProbe:
httpGet:
path: /
port: 11434
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
startupProbe:
httpGet:
path: /
port: 11434
failureThreshold: 30
periodSeconds: 10
volumeClaimTemplates:
- metadata:
name: models
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: local-path-ssd
resources:
requests:
storage: 200Gi
What matters here
OLLAMA_KEEP_ALIVE: 24h— by default Ollama unloads models after 5 minutes of inactivity. On Kubernetes, that means a 30-second cold reload on every Slack-bot-after-lunch query. Pin it to 24 hours. VRAM cost is a non-issue when the model is the only resident thing.OLLAMA_FLASH_ATTENTION: 1— 15-25% throughput gain on Ampere and Ada GPUs. No reason to leave it off.OLLAMA_KV_CACHE_TYPE: q8_0— quantizes the KV cache to 8-bit. Cuts VRAM usage by ~40% with negligible quality loss for most use cases. Useq4_0if you are tight on VRAM,f16if you cannot tolerate any quality drop.OLLAMA_NUM_PARALLEL: 4— number of concurrent requests per model. Higher = more throughput, more VRAM. 4 is a good default for an 8B model on a 24 GB GPU.- Startup probe with 30 failures × 10 seconds = 5 minute window — accommodates first-time model pulls. Liveness probe fires only after startup succeeds.
volumeClaimTemplateswith 200 Gi — each replica gets its own PVC. Models pull once per replica, cached forever after.
Apply this and you have a real deployment. The next sections add networking, security, and observability.
Service, Ingress, and TLS {#networking}
Headless Service for direct pod access
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: ollama
spec:
clusterIP: None
selector:
app: ollama
ports:
- port: 11434
name: http
Headless because StatefulSet pods get DNS names like ollama-0.ollama.ollama.svc.cluster.local, useful for sticky sessions if you implement them later.
ClusterIP for load-balanced access
apiVersion: v1
kind: Service
metadata:
name: ollama-lb
namespace: ollama
spec:
type: ClusterIP
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434
Internal apps hit ollama-lb.ollama.svc.cluster.local:11434 and Kubernetes round-robins.
Ingress with TLS and API key auth (Nginx)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ollama
namespace: ollama
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
nginx.ingress.kubernetes.io/configuration-snippet: |
if ($http_authorization !~ "^Bearer (sk-team-key-1|sk-team-key-2)$") {
return 401;
}
spec:
ingressClassName: nginx
tls:
- hosts: [ollama.internal.example.com]
secretName: ollama-tls
rules:
- host: ollama.internal.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ollama-lb
port:
number: 11434
The proxy-read-timeout of 600 seconds is critical. Long generations on a 70B model can take 4-5 minutes. Default 60-second timeout will kill the request mid-token.
The configuration-snippet is a quick API key gate. For production, replace it with an OAuth2-proxy or Pomerium ext-auth filter — never inline secrets like that for real keys.
Helm Chart Walkthrough {#helm}
If you do not want to maintain raw manifests, the community otwld/ollama-helm chart is solid. Install it like this:
helm repo add ollama-helm https://otwld.github.io/ollama-helm/
helm repo update
helm install ollama ollama-helm/ollama \
--namespace ollama \
--create-namespace \
--set replicaCount=2 \
--set ollama.gpu.enabled=true \
--set ollama.gpu.type=nvidia \
--set ollama.gpu.number=1 \
--set ollama.models.pull[0]=llama3.1:8b \
--set ollama.models.pull[1]=qwen2.5-coder:7b \
--set persistentVolume.enabled=true \
--set persistentVolume.size=200Gi \
--set ingress.enabled=true \
--set ingress.className=nginx \
--set ingress.hosts[0].host=ollama.internal.example.com
The chart handles probes, volume claim templates, the Service, and ingress. You can fork it and add your custom configuration-snippet for auth.
For a values.yaml-based setup:
replicaCount: 2
ollama:
gpu:
enabled: true
type: nvidia
number: 1
models:
pull:
- llama3.1:8b
- qwen2.5-coder:7b
- nomic-embed-text:latest
persistentVolume:
enabled: true
size: 200Gi
storageClass: local-path-ssd
resources:
limits:
cpu: 8
memory: 32Gi
requests:
cpu: 2
memory: 8Gi
extraEnv:
- name: OLLAMA_KEEP_ALIVE
value: "24h"
- name: OLLAMA_FLASH_ATTENTION
value: "1"
- name: OLLAMA_NUM_PARALLEL
value: "4"
ingress:
enabled: true
className: nginx
hosts:
- host: ollama.internal.example.com
paths:
- path: /
pathType: Prefix
tls:
- hosts: [ollama.internal.example.com]
secretName: ollama-tls
helm install ollama ollama-helm/ollama -f values.yaml -n ollama --create-namespace. Done.
Autoscaling with KEDA {#autoscaling}
CPU-based HPA is useless here — Ollama is GPU-bound and CPU stays at 5-10% even during heavy generation. Use KEDA with a Prometheus scaler:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ollama-scaler
namespace: ollama
spec:
scaleTargetRef:
name: ollama
kind: StatefulSet
minReplicaCount: 1
maxReplicaCount: 6
pollingInterval: 15
cooldownPeriod: 300
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: ollama_requests_in_flight
threshold: "8"
query: |
sum(rate(nginx_ingress_controller_requests{ingress="ollama"}[1m]))
Scales up when sustained RPS exceeds 8 per replica. The 5-minute cooldown prevents flapping when traffic is bursty. minReplicaCount=1 keeps a warm pod with the model resident so cold starts are rare.
For predictable workloads (8am-6pm office hours), use a CronHPA instead. We pre-scale to 4 replicas at 8:55am and back to 1 at 6:30pm. Costs us nothing extra, latency stays under 200ms TTFB during peak.
Multi-Model Routing {#multi-model}
When the team needs both a chat model and a coding model, do not stuff them into one StatefulSet — VRAM thrashing destroys throughput. Run two StatefulSets, route by Host header:
- host: chat.ollama.internal.example.com
http:
paths:
- backend: { service: { name: ollama-chat-lb, port: { number: 11434 } } }
- host: code.ollama.internal.example.com
http:
paths:
- backend: { service: { name: ollama-code-lb, port: { number: 11434 } } }
Or by path prefix if you prefer one host:
- path: /v1/chat
backend: { service: { name: ollama-chat-lb, port: { number: 11434 } } }
- path: /v1/code
backend: { service: { name: ollama-code-lb, port: { number: 11434 } } }
Each model gets its own VRAM, its own scaling envelope, and its own SLO. We run llama3.1:8b for chat at 4 replicas and qwen2.5-coder:7b at 2 replicas — load patterns are completely different and decoupling them was the single biggest stability win.
Monitoring and Logs {#monitoring}
Prometheus scrape config
Ollama itself does not export Prometheus metrics yet, so we scrape the NVIDIA DCGM exporter (installed by the GPU Operator) for GPU metrics, and Nginx ingress for request metrics:
- job_name: 'dcgm'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: nvidia-dcgm-exporter
action: keep
Useful queries for a Grafana dashboard:
| Metric | Query | Alert threshold |
|---|---|---|
| GPU utilization | avg(DCGM_FI_DEV_GPU_UTIL{pod=~"ollama.*"}) | >95% for 10 min |
| VRAM used | DCGM_FI_DEV_FB_USED{pod=~"ollama.*"} / DCGM_FI_DEV_FB_FREE * 100 | >90% |
| RPS | sum(rate(nginx_ingress_controller_requests{ingress="ollama"}[5m])) | — |
| p95 latency | histogram_quantile(0.95, rate(nginx_ingress_controller_request_duration_seconds_bucket{ingress="ollama"}[5m])) | >5s |
| Pod restarts | increase(kube_pod_container_status_restarts_total{namespace="ollama"}[1h]) | >0 |
Centralized logs with Loki
helm install loki grafana/loki-stack -n monitoring \
--set promtail.enabled=true \
--set loki.persistence.enabled=true \
--set loki.persistence.size=50Gi
Promtail picks up Ollama's stdout automatically. Useful queries:
{namespace="ollama"} |= "error"
{namespace="ollama"} |~ "out of memory|CUDA error"
{namespace="ollama"} | json | duration > 5
For deeper Prometheus + Grafana setup specific to local AI, the Ollama production deployment guide has the dashboard JSON and alert rules.
Security Hardening {#security}
Five layers, in order of importance:
1. NetworkPolicy — block everything except the namespaces that should reach Ollama:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ollama-allow-from-apps
namespace: ollama
spec:
podSelector:
matchLabels:
app: ollama
policyTypes: [Ingress]
ingress:
- from:
- namespaceSelector:
matchLabels:
name: apps
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- port: 11434
protocol: TCP
2. ServiceAccount with no token mount — Ollama does not call the K8s API, so do not give it credentials:
spec:
template:
spec:
automountServiceAccountToken: false
3. Pod Security Standards — set the namespace to restricted:
kubectl label namespace ollama \
pod-security.kubernetes.io/enforce=baseline \
pod-security.kubernetes.io/warn=restricted
4. ReadOnlyRootFilesystem — Ollama only writes to /root/.ollama (the PVC), so the rest of the filesystem can be read-only:
securityContext:
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir: {}
5. Secret-backed API keys instead of inline:
apiVersion: v1
kind: Secret
metadata:
name: ollama-api-keys
namespace: ollama
type: Opaque
stringData:
keys.conf: |
sk-team-eng-2026-xxx
sk-team-product-2026-yyy
Mount it into the ingress controller with an OAuth2-proxy ext-auth filter. Never bake keys into ingress annotations in source control.
Real Cluster Benchmarks {#benchmarks}
Numbers from our 4-node K3s cluster, measured April 2026:
Hardware: 3× nodes with RTX 4090 24 GB, 1× node with 2× A5000 24 GB. 64 GB RAM each, NVMe SSDs.
Workload: 22 engineers, 6,400 requests/day, 70/30 chat to code.
| Metric | Value |
|---|---|
| llama3.1:8b TTFB | 180 ms p50, 420 ms p95 |
| llama3.1:8b throughput | 92 tok/s per replica |
| qwen2.5-coder:7b TTFB | 220 ms p50, 510 ms p95 |
| qwen2.5-coder:7b throughput | 88 tok/s per replica |
| Pod cold start (warm PVC) | 12 s |
| Pod cold start (fresh PVC) | 78 s |
| Cluster GPU utilization avg | 34% |
| Cluster GPU utilization peak | 91% |
| Pod restarts per month | 3 (all OOM-related, fixed by raising memory limit) |
| Power draw (4 nodes) | 1.4 kW avg, 2.8 kW peak |
The 3 OOM kills happened during a model swap when both 8B and 70B were briefly resident. Setting OLLAMA_MAX_LOADED_MODELS=2 and adding a memory request of 16 Gi prevented recurrence.
Common Pitfalls {#pitfalls}
1. Using a Deployment instead of a StatefulSet. Pods reschedule, PVCs detach, models re-download. Use StatefulSet.
2. Forgetting OLLAMA_HOST=0.0.0.0:11434. Ollama binds to localhost by default, which means the pod accepts connections only from itself. Probe fails, Service does not route, you spend an hour staring at a green pod that is unreachable.
3. Not tainting GPU nodes. General workloads schedule onto your $1,800 GPU and starve Ollama of CPU/RAM. Taint and tolerate.
4. ReadWriteMany on the model PVC. Ollama writes its blob index, two pods fighting over it corrupts the cache. Stick to ReadWriteOnce per replica.
5. Default ingress timeout. 60 seconds is fine for chat, fails for 70B generations. Set proxy-read-timeout to 600+.
6. HPA on CPU. Doesn't scale because GPU is the bottleneck. Use KEDA on RPS or queue depth.
7. No keep-alive. Models unload after 5 minutes, every burst pays the reload cost. Set OLLAMA_KEEP_ALIVE=24h.
8. Pulling models inside the manifest with initContainer. Works, but blocks pod readiness for minutes on every restart. Pre-pull via Job once, then let StatefulSet pods come up against the warm PVC.
9. Skipping NetworkPolicy. Default Kubernetes networking lets every pod talk to every pod. A compromised pod in another namespace can hit your unauthenticated 11434 port. Lock it down on day one.
10. Mismatched model names across replicas. Pull the same exact tag (llama3.1:8b, not latest) into every replica's PVC, otherwise you get inconsistent responses depending on which pod the request lands on.
Conclusion
Ollama on Kubernetes is the right move once you have more than two engineers or one production workload. The setup is more involved than brew install — but the manifests above are battle-tested, the benchmarks are real, and the failure modes are documented. Start with the StatefulSet, layer on ingress and TLS, add KEDA when you actually need scaling, and bolt on monitoring before you hit your first incident.
The biggest mindset shift coming from single-server Ollama is treating models as data, not as code. Models live on PVCs. Pods are cattle. The cluster heals itself when nodes reboot. Once that clicks, running a private 22-person AI platform feels almost boring — which is exactly what production should feel like.
Compared to the single-node Ollama production deployment, Kubernetes adds an order of magnitude more moving parts but pays for itself the first time a node dies at 2am and nobody gets paged. Combine this with Ollama load balancing for the routing layer and the Ollama production checklist for the security review, and you have a stack you can confidently put in front of a paying customer.
Want updates as we roll out the Kubernetes monitoring dashboards and the OAuth2-proxy auth template? Join the Local AI Master newsletter — one email a week, all production playbooks.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!